WO2025016547A1

WO2025016547A1 - Packing of video data

Info

Publication number: WO2025016547A1
Application number: PCT/EP2023/070214
Authority: WO
Inventors: Martin Pettersson; Ali El Essaili
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2025-01-23
Anticipated expiration: 2026-01-20

Abstract

A method for processing video data is provided. The video data may be used, for example, to generate volumetric video, and may include a depth image (30), a selection map (28), and optionally, a texture image (32). A sending device (50) packs (152) the depth image and the selection map into different channels (22, 24, 26) of the same picture (20), encodes (160) the picture into a bitstream (44), and sends the bitstream to a receiving device (70, 120). Upon receipt, the receiving device decodes (196) the bitstream, and extracts (202, 204) the depth image and the selection map.

Description

PACKING OF VIDEO DATA

TECHNICAL FIELD

The present disclosure relates generally to image processing techniques, and more particularly, to devices and methods configured to pack video data to send to a receiving device.

BACKGROUND

Versatile Video Coding (WC), and its predecessors High Efficiency Video Coding (HEVC) and Advanced Video Coding (AVC), are video codecs standardized and developed jointly by the International Telecommunication Union Telecommunication Standardization Sector (ITU-T) and the Moving Picture Experts Group (MPEG). Like many others, the WC, HVEC, and AVC codecs are block-based. The codecs utilize both temporal and spatial prediction, where spatial prediction is achieved using intra (I) prediction from within the current picture and temporal prediction is achieved using unidirectional (P) or bi-directional inter (B) prediction on the block level from previously decoded reference pictures.

In the encoder, the difference between the original pixel data and the predicted pixel data, referred to as the “residual,” is transformed into the frequency domain, quantized, and then entropy coded before being transmitted to a decoder together with necessary prediction parameters. Such parameters include, for example, prediction mode and motion vectors, and are also entropy coded before being transmitted to the decoder. Upon receipt, the decoder performs entropy decoding, inverse quantization, and inverse transformation to obtain the residual, and then adds the residual to an intra or inter prediction to reconstruct a picture. Transforming the residual data to the frequency domain before quantization improves the efficiency of the compression and better masks artifacts for the human eye when the quantization is high. However, due to their complexity, conventional encoding/decoding techniques do not always provide an acceptable Quality of Experience (QoE) within typical latency bounds.

SUMMARY

The embodiments described herein provide a method for packing video data, such as depth images and selection maps (e.g., occupancy maps), for example, to send to a receiving device in the form of point clouds and/or meshes. Upon receipt, the receiving device processes the video data for volumetric video consumption, for example.

Volumetric video is a 3D representation of an object, such as a person (e.g., a person’s face, head, and/or upper torso), and can be described as a point cloud, a mesh, or other, similar format. Volumetric video can be generated for applications without real-time constraints and low-complexity coding demands. However, volumetric video can also be generated for other, more complex applications having more stringent requirements, such as conversational video applications. In more detail, the present embodiments provide low-complexity processing techniques for real-time conversational video applications, such as holographic video communications, immersive video applications, 3D video calls, and the like. Example uses of such applications include, but are not limited to, those used to enable communications between a consultant and a therapist (e.g., in the e-health arena), families communicating with each other in an immersive manner, social media influencers communicating with his/her followers, and 3D calls with a remotely located subject matter expert.

In a first aspect, the present disclosure provides a method, implemented by a sending device, for processing video data. In this aspect, the method comprises the sending device obtaining a depth image from a depth image source, wherein the depth image comprises information associated with a distance of an object in a picture from a viewpoint. The method also comprises the sending device obtaining a selection map, wherein the selection map represents the picture partitioned into a plurality of areas. Then, the method comprises the sending device packing the depth image and the selection map into a first picture having a color format with a plurality of channels, wherein the depth image and the selection map are packed in first and second channels, respectively, with at least one of the first and second channels being a color channel.

In a second aspect, the present disclosure provides a method, implemented by a receiving device, for processing video data. In this aspect, the method comprises the receiving device obtaining a first picture having a color format with a plurality of channels, wherein a depth image and a selection map are packed into first and second channels, respectively, with at least one of the first and second channels being a color channel. So obtained, the method comprises the receiving device extracting the depth image from the first channel and extracting the selection map from the second channel.

In a third aspect, the present disclosure provides a sending device for processing video data. In this aspect, the sending device is configured to obtain a depth image from a depth image source, wherein the depth image comprises information associated with a distance of an object in a picture from a viewpoint, obtain a selection map, wherein the selection map represents the picture partitioned into a plurality of areas, and pack the depth image and the selection map into a first picture having a color format with a plurality of channels, wherein the depth image and the selection map are packed in first and second channels, respectively, with at least one of the first and second channels being a color channel.

In a fourth aspect, the present disclosure provides a sending device for processing video data. In this aspect, the sending device comprises communication circuitry for communicating with a receiving device, and processing circuitry. The processing circuitry in this aspect is configured to obtain a depth image from a depth image source, wherein the depth image comprises information associated with a distance of an object in a picture from a viewpoint, obtain a selection map, wherein the selection map represents the picture partitioned into a plurality of areas, and pack the depth image and the selection map into a first picture having a color format with a plurality of channels, wherein the depth image and the selection map are packed in first and second channels, respectively, with at least one of the first and second channels being a color channel.

In a fifth aspect, the present disclosure provides a computer program comprising executable instructions that, when executed by a processing circuit in a sending device, causes the sending device to perform the method of the first aspect.

In a sixth aspect, the present disclosure provides a carrier containing a computer program of the fifth aspect, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

In a seventh aspect, the present disclosure provides a non-transitory computer-readable storage medium containing a computer program comprising executable instructions that, when executed by a processing circuit in a sending device, causes the sending device to perform the method of the first aspect.

In an eighth aspect, the present disclosure provides a receiving device for processing video data. In this aspect, the receiving device is configured to obtain a first picture having a color format with a plurality of channels, wherein a depth image and a selection map are packed into first and second channels, respectively, with at least one of the first and second channels being a color channel, extract the depth image from the first channel, and extract the selection map from the second channel.

In a ninth aspect, the present disclosure provides a receiving device for processing video data. In this aspect, the receiving device comprises communication circuitry for communicating with one or more devices via a network, and processing circuitry configured to obtain a first picture having a color format with a plurality of channels, wherein a depth image and a selection map are packed into first and second channels, respectively, with at least one of the first and second channels being a color channel, extract the depth image from the first channel, and extract the selection map from the second channel.

In a tenth aspect, the present disclosure provides a computer program comprising executable instructions that, when executed by a processing circuit in a receiving device, causes the receiving device to perform the method of the eighth aspect.

In an eleventh aspect, the present disclosure provides a carrier containing a computer program of the tenth aspect, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

In a twelfth aspect, the present disclosure provides a non-transitory computer-readable storage medium containing a computer program comprising executable instructions that, when executed by a processing circuit in a receiving device, causes the receiving device to perform the method of the eighth aspect. In a thirteenth aspect, the present disclosure provides a communication system for processing video data. The communication system comprises a sending device and a receiving device, the sending device is configured to obtain a depth image from a depth image source, wherein the depth image comprises information associated with a distance of an object in a picture from a viewpoint, obtain a selection map, wherein the selection map represents the picture partitioned into a plurality of areas, and pack the depth image and the selection map into a first picture having a color format with a plurality of channels, wherein the depth image and the selection map are packed in first and second channels, respectively, with at least one of the first and second channels being a color channel. The receiving device is configured to obtain the first picture having the color format with the plurality of channels, wherein the depth image and the selection map are packed into the first and second channels, respectively, with at least one of the first and second channels being the color channel, extract the depth image from the first channel, and extract the selection map from the second channel.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates a mesh created using polygonal faces.

Figure 2 illustrates an example of a texture image, selection map, and depth image in separate streams, according to one embodiment of the present disclosure.

Figure 3 is a functional block diagram illustrating a system for implementing the present disclosure according to a first embodiment.

Figure 4A is a flowchart illustrating a method, implemented at a sending device, for packing and encoding video data to send to a receiving device via a network according to one embodiment of the present disclosure.

Figure 4B is a flowchart illustrating a method, implemented at a receiving device, for extracting and decoding the packed video data received from sending device according to one embodiment of the present disclosure.

Figure 5 illustrates an example of packing a selection map and a depth image into different channels of a picture according to one embodiment.

Figure 6 illustrates an example texture image and an example selection map having the same resolution, and a depth image that has a different resolution than the texture image and the selection map, according to one embodiment.

Figure 7 illustrates an example texture image having a resolution that is higher than the resolutions of both the selection map and the depth image.

Figure 8 is a functional block diagram illustrating a system for implementing the present disclosure according to a another embodiment.

Figure 9 is a flow chart illustrating a method for processing video data at a sender device according to one embodiment of the present disclosure.

Figure 10 is a flow chart illustrating an example method for obtaining a selection map according to embodiments of the present disclosure. Figure 11 is a flow chart illustrating an example method for deriving the selection map from the depth image according to one embodiment.

Figure 12 is a flow chart illustrating a method for processing video data at a receiving device according to one embodiment.

Figure 13 is a flow chart illustrating a method for processing video data at a receiving device according to another embodiment of the present disclosure.

Figure 14 is a functional block diagram illustrating some example components of a sending device configured according to the present embodiments.

Figure 15 is a functional block diagram illustrating some example components of a receiving device configured according to the present embodiments.

Figure 16 is a functional block diagram illustrating some example components of a network-based receiving device configured according to the present embodiments.

Figure 17 is a functional block diagram illustrating an example of a communication system in accordance with some embodiments.

Figure 18 is a functional block diagram illustrating an example User Equipment (UE) in accordance with some embodiments.

Figure 19 is a functional block diagram illustrating an example network node in accordance with some embodiments.

Figure 20 is a functional block diagram of a host, which may be an embodiment of the host illustrated in Figure 17, in accordance with some embodiments described herein.

Figure 21 is a functional block diagram illustrating a virtualization environment in which functions implemented by some embodiments of the present disclosure may be virtualized.

Figure 22 is a functional block diagram illustrating a communication diagram of a host communicating via a network node with a UE over a partially wireless connection in accordance with some embodiments.

DETAILED DESCRIPTION

The embodiments described herein provide a method for packing video data that may be used, for example, to generate volumetric video. In more detail, the present embodiments utilize the fact that a depth image and a selection map (e.g., occupancy map) are correlated to pack the depth image and the selection map into different channels of the same picture prior to compression.

As will be described more fully below, a sending device configured according to the present embodiments obtains a depth image and a selection map. The depth image comprises information associated with the distance of an object in a picture from a particular viewpoint (e.g., a camera that captured the picture), while the selection map represents the picture partitioned into a plurality of areas. The sending device packs the depth image and the selection map into first and second channels of a first picture, respectively, and then encodes the first picture to send to a receiving device. Upon receiving the encoded first picture, the receiving device decodes the first picture and extracts both the depth image and the selection map from the first and second channels before generating volumetric video based on the depth image and the selection map.

Components and color formats

A video sequence consists of a series of pictures with each picture consisting of one or more components. As is known in the art, components are sometimes referred to as “color components,” and other times as “channels.” Additionally, a picture in a video sequence is sometimes denoted “image” or “frame.” Thus, in the context of this disclosure, and unless stated otherwise, the terms picture, frame, and image are used interchangeably. Similarly, the terms “encoding” and “compressing” are used interchangeably, as are the terms “decoding” and “decompressing,” and the terms “pixel” and “sample.”

Each component in a picture can be described as a two-dimensional rectangular array of sample values. Uncompressed pictures that come from a camera and uncompressed pictures that are rendered in a display are often in red, green, blue (RGB) format. This is because the camera sensors and the sub-pixels of the pixels in displays are historically compatible with RGB format.

Video coding, on the other hand, is different. Particularly, the human eye is more sensitive to luminance than to chrominance. Therefore, a picture in a video sequence more commonly consists of three components; one luma component “Y,” where the sample values are luma values, and two chroma components “Cb” and “Cr,” where the sample values are chroma values. This is sometimes referred to as “YCbCr” color format or “YUV” color format where the chroma components are "U" and “V”. To optimize perceived quality, more bits can be spent on the luminance component than on the chrominance components. For the same reason, the dimensions of the chroma components are typically smaller than the luma components by a factor of two in each dimension. This is often referred to as the chroma components having been subsampled compared to the luma component. For example, the size of the luma component Y of an HD picture would be 1920x1080 and the chroma components Cb and Cr would each have the dimension of 960x540. A YUV format in which the chroma components have been subsampled in both the vertical and horizontal directions is often referred to as the “YUV420” format and understood to have 4:2:0 chroma subsampling.

As stated above, components are sometimes referred to by those of ordinary skill in the art as “channels.” As such, unless stated otherwise, these two terms are used interchangeably throughout this disclosure. However, the present disclosure distinguishes “color channels” from other non-color channels. Specifically, color channels carry chroma information, such as Cb/U, Cr/V, R, G, and B. Luma channels, in contrast, do not, and thus, are not considered “color” channels in the context of this disclosure. Blocks and Units

In many video coding standards, such as HEVC and WC, each component is split into blocks, with each block being a two-dimensional array of samples. A coded video bitstream, therefore, consists of a series of coded blocks. It is common in video coding that the picture is split into coding units that cover a specific area of the picture. Each coding unit consists of all blocks from all components that make up that specific area and each block belongs fully to one coding unit. In HEVC and WC the coding units are referred to as Coding Units (CUs). Other video codecs, however, refer to such coding units by another name. For example, the H.264 coding standard refers to such coding units as “macroblocks.”

Residuals, Transforms and Quantization

A “residual block” consists of samples that represent sample value differences between the sample values of the original source blocks and the sample values of the prediction blocks. The residual block is typically processed using a spatial transform. Particularly, the encoder quantizes transform coefficients according to a quantization parameter (QP), which controls the precision of the quantized coefficients. The quantized coefficients can be referred to as “residual coefficients.” High QP values result in low precision of the coefficients, and therefore, a low fidelity/quality of the residual block. Upon receiving the residual coefficients, the decoder applies inverse quantization and inverse transform to derive the residual block.

Subpictures

WC supports subpictures. As is known in the art, a subpicture is defined as a rectangular region of one or more slices within a picture. A slice comprises multiple coding units (CUs). This means a subpicture comprises one or more slices that collectively cover a rectangular region of a picture. A subpicture can be encoded/decoded independently of other subpictures in a picture. That is, prediction from other spatially located subpictures is not used when encoding/decoding subpictures.

WC also supports bitstream extraction and merge operations. For example, consider a situation with multiple bitstreams. With the extraction operations, one or more subpictures may be extracted from the first bitstream and one or more subpictures from a second bitstream. With merging operations, the extracted subpictures can be merged into a new third bitstream.

SEI messages

Supplementary Enhancement Information (SEI) messages are codepoints in a coded bitstream that do not influence the decoding process of coded pictures from Video Coding Layer (VCL) Network Abstraction Layer (NAL) units. SEI messages usually address issues associated with the representation/rendering of the decoded bitstream.

The WC specifications have inherited the overall concept of SEI messages, and many of the SEI messages themselves, from the H.264 and HEVC specifications. In WC, an SEI Raw Byte Sequence Payload (RBSP) contains one or more SEI messages. Additionally, WC references SEI messages in the Versatile Supplemental Enhancement Information (VSEI) specification. The WC specification comprises more generalized SEI messages, which may be referenced by future video codec specifications.

In general, SEI messages assist in processes related to decoding, display, and other purposes. However, SEI messages are not required for constructing the luma or chroma samples by the decoding process. For example, some SEI messages are required for checking bitstream conformance and for output timing decoder conformance, while other SEI messages are not required for checking bitstream conformance. Generally, a decoder is not required to support all SEI messages. When encountering an unsupported SEI message, a decoder will generally discard the message.

Video Codec Profiles

A profile in HEVC and WC is defined as a specified subset of the syntax of the specification. More specifically, a profile defines the particular codec tools, bit depth, color format, and color subsampling that a decoder conforming to the profile should support, as well as what information is to be included in a bitstream that conforms to the profile. Practically, a codec specification may include a number of different profiles targeting different use cases, devices, and platforms, which can range from mainstream decoding in mobile devices to professional capturing and editing in high-end devices.

By way of example, HEVC version 1 comprises the Main and Main 10 profiles with support for 8- and 1O-bits bit depth with 4:2:0 chroma subsampling. Later versions of HEVC include additional profiles to support higher bit depths, 4:4:4 chroma subsampling, and additional coding tools. Similarly, WC version 1 comprises the Main 10 profile, a Main 10 Still Picture profile, a Main 4:4:4 10 profile, a Main 4:4:4 10 Still Picture profile, a Multilayer Main 10 profile, and a Multilayer Main 10 4:4:4 profile, with support for higher bit depths in later versions.

Related to profiles is the concept of “levels.” Both HEVC and WC define a level as a set of constraints on the values assigned to the syntax elements and variables that are identified in the specification document. For example, a level specifies the maximum throughput a decoder must be able to handle (e.g., the combination of resolution and framerate that the decoder is required to decode). Generally, the same set of levels is defined for all profiles, and most aspects defined for each level are common across different profiles. However, in some cases, individual implementations may, within specified constraints, support a different level for each supported profile.

It should be noted here that the terms “mainstream” video codecs and profiles are used throughout the disclosure. In the context of the present disclosure, these terms simply mean video codecs and profiles that are widely adopted in the market, or will be widely adopted in the market, and are supported (e.g., in hardware) by relevant devices, such as mobile devices, Head Mounted Devices (HMDs), computers, and TV sets, for example. According to this definition, such profiles may include, but are not limited to, the following video codec profiles: • AVC constrained baseline profile;

• AVC baseline profile;

• AVC Extended profile;

• AVC Main profile;

• AVC High profile;

• HEVC Main profile;

• HEVC Main 10 profile;

• HEVC Main 12 profile; and

• WC Main 10 profile.

Depth image

A depth image is an image that contains information relating to the distance of the surfaces of the objects in a scene from a viewpoint of the camera (i.e., “depth”). Depth images are also sometimes referred to as “depth maps,” “depth pictures,” “z-buffer,” and “z-depth.” Each sample value of a depth image corresponds to the depth of that sample. Normally, only one component is used to represent the depth. A higher value (lighter luminance) in the depth image corresponds to a point that is close to the camera and a lower value (dark luminance) corresponds to a point that is distanced further away from the camera. In some applications, however, the opposite notation is used. That is, a low value in a depth image means that a given point in the image is close to the camera, while a high value in the depth image means that the point is distanced further away from the camera.

Depth images may be used for many different applications including, but not limited to, 3D object reconstruction and video conferencing (holographic communication). Depth images can correspond to certain parts of the head (e.g., a person’s face) and/or body (e.g., a person’s upper body/torso) that can represent a participant in a video conferencing session, and further, may be computer generated or captured by a depth image camera. Such depth image cameras may, for example, be a stand-alone camera or a camera that is configured to produce both texture and depth images as output, such as a Red Green Blue Depth (RGBD) camera, for example. Depth images can also be acquired, for example, by Lidar sensors (e.g., such as those used with Microsoft Kinect® and Intel Real-Sense®), true-depth sensors (e.g., iPhones®), or other devices.

Point clouds and V-PCC

A point cloud is a discrete set of data points in space, in which the points may represent a 3D shape or object. A point cloud is represented by vertices and a set of attributes (e.g., color).

MPEG has developed and standardized video-based point cloud compression (V-PCC) in MPEG-I part 5 (i.e., ISO/IEC 23090-5:2021), which is incorporated herein by reference in its entirety. In V-PCC, 3D patches of point clouds are projected onto 2D patches, which are then packed into frames for three types of images: texture images, geometry maps, and occupancy maps. The geometry map stores the distance between the missing coordinates of points of a 3D position and the projection surface in the 3D bounding box used in the 3D to 2D projection. The occupancy map is used to specify which samples in the geometry map are used for the 3D reconstruction and which are not.

The texture images, geometry maps, and occupancy maps are coded as sub-bitstreams using HEVC. These sub-bitstreams are then multiplexed into a V-PCC bitstream together with an atlas metadata sub-bitstream that comprises information detailing how to reconstruct the point cloud from the 2D patches.

Meshes

A 3D object or person may also be represented using polygon meshes. Such meshes describe a 3D object or person in terms of connected polygons, which are typically triangles. Objects created with polygon meshes store different types of elements, including vertices, edges, faces, and surfaces. A vertex is a position in space and includes other information such as color and normal vector. An edge is a connection between two vertices, and a face is a closed set of vertices (e.g., a triangle face has three vertices). Surfaces are not required for meshes but may be used to group smooth regions of the mesh. An example of a mesh 10 created using triangular faces is seen in Figure 1 .

Meshes may, for example, be generated from point clouds, depth images, and/or texture images. However, when generating a mesh from camera captured content, knowledge of intrinsic and extrinsic camera parameters used in capturing an image may be necessary to properly place the mesh in 3D space. Intrinsic camera parameters specify the camera image format including focal length, image resolution, and camera principal point, while extrinsic camera parameters define the camera pose with position and orientation.

Holographic communication

Work is currently being performed on holographic communications solutions to enable 3D calling between two or more participants. Current solutions define an end-to-end pipeline, which includes capturing RGB and depth images (e.g., using mobile devices or standalone cameras), converting the resulting stream into a point cloud or mesh, and then transmitting the point cloud or mesh to one or more end-user device, such as AR glasses, for example. In some cases, depth compression is realized by the device that captures the images (e.g., a camera). The compressed RGB and depth streams are then transmitted to a server (e.g., cloud or edge server) where the point cloud or mesh is created (e.g., the server creates a hologram that can be consumed by an end-user device). In some cases, the depth images are compressed at a low bitrate, which can be essential in reducing uplink bitrate requirements.

Some work has previously been done concerning the use of mainstream 2D video codecs to efficiently compress volumetric video, including depth images and occupancy maps, for holographic communication. Like for V-PCC, such work signals depth images, occupancy maps, and texture images in separate streams or sub-bitstreams of a picture. An example of this work is illustrated in Figure 2, which illustrates an example picture 20 having a YUV format (also known as “YCbCr” format). Particularly, picture 20 has a plurality of channels. These are a luma or “Y” channel 22, and two chroma or “Cb” and “Cr” channels 24, 26, respectively. In Figure 2, the selection map 28 and the depth image 30 use only the luma channel 22 of the YCbCr/YUV format of picture 20.

V-PCC and other work signals depth images and occupancy maps in separate streams or sub-streams. However, such approaches can be problematic. For example, such signaling requires the use of multiple encoding/decoding instances and/or the use of multilayer codecs. Multilayer codecs, however, have not been widely adopted for use in the market, and therefore, cannot be considered to be a mainstream codec. Further, many devices only configured to support a single hardware accelerated encoding/decoding instance at a time. Therefore, other video streams are left to be encoded/decoded using best-effort software, which undesirably increases battery consumption. Moreover, synchronization also becomes problematic when multiple streams are involved, the compression process does not exploit the redundancy between the depth image and the occupancy map.

Accordingly, the present embodiments exploit the fact that the depth image and selection map (e.g., the occupancy map) are correlated by packing the depth image and the selection map in different channels of the same picture prior to compression. This may be done, for example, for a picture using a YCbCr/YUV or RGB color format, which may advantageously be compressed using a conventional mainstream video codec and video coding profile. In one embodiment of the present disclosure, only the depth image and the selection map are packed into the same picture. In other embodiments, however, a texture image is also packed spatially with both the depth image and the selection map in the same picture.

Embodiments of the present disclosure provide benefits and advantages that conventional methods and techniques cannot or do not provide. For example, by using a deployed and supported mainstream codec, and by packing the depth image and selection map (and possibly the texture image) in a color format supported by a mainstream profile of the codec, existing hardware-acceleration implementations of the mainstream codec can be used for each of the depth image, the selection map, and when included, the texture image. Not only does this speed-up encoding/decoding processing, but it also conserves energy and battery life, and accelerates scheduling the encoding and decoding processes for time-synchronized throughput and output while maintaining high resolution.

Additionally, slim devices can typically only support the hardware acceleration of one encoder/decoder instance at the time. Therefore, instead of using multiple streams, including the depth image, the selection map, and the texture image in the same picture allows for full hardware acceleration of encoding and/or decoding the video data. Synchronization between the depth image, the selection map, and the texture image is also made much easier since all of the data is in the same picture.

In another advantage, packing the depth image and the selection map in different color channels of the picture allows for the use of higher resolution in the depth image, the selection map, and/or the texture image then is otherwise possible with a given codec profile. Packing the depth image and selection map in different color channels of the picture also allows for utilizing prediction between the selection map and the depth image. For instance, WC provides tools for cross-component prediction, which in turn, reduces the bitrate.

Additionally, embodiments of the present disclosure use a deployed and supported mainstream video codec for the described implementation, thereby accelerating the adoption of holographic communication use-cases on the market. Other advantages include, but are not limited to, the reduction of transport overhead as the texture images, depth images, and selection maps can be packed and compressed in a single bitstream.

Accordingly, embodiments of the present disclosure utilize a conventional mainstream codec for compressing a depth image and a selection map in the same picture while maintaining the high resolution for both. To accomplish this, the present embodiments pack the depth image and the selection map into different channels of the same picture prior to compression. In one embodiment, the picture is, for instance, in the YCbCr/YUV or RGB color format.

Turning now to Figure 3, it is a functional block diagram illustrating a system 40 for implementing the present disclosure according to the first embodiment. As seen in Figure 3, system 40 comprises a communication network 42 communicatively interconnecting a sending device 50 and a receiving device 70. In different embodiments, the communication network 42 may comprise any number of wired or wireless networks, network nodes, base stations, controllers, wireless devices, relay stations, and/or any other components or systems that may facilitate or participate in the communication of video data and/or signals between the sending device 50 and the receiving device 70 whether via wired or wireless connections.

According to some embodiments of the present disclosure, sending device 50 pre- processes and encodes/compresses input video data for volumetric video into one or more bitstreams 44 and sends the bitstreams 44 to receiving device 70 via communication network 42. Upon receipt, receiving device 70 decompresses the one or more bitstreams 44 into decoded video data, which is then post-processed. According to various embodiments, the decoded video data may be used for a variety of purposes including, but not limited to, generating a volumetric video stream in which the volumetric video may comprise a mesh or a point cloud.

In one embodiment of the present disclosure, the input video data received/obtained by sending device 50 comprises a video stream of one or more texture images and a video stream of one or more depth images. The decoded video data received by receiving device 70 comprises one or more decoded texture images, one or more decoded depth images, and one or more decoded occupancy maps. Those of ordinary skill in the art should appreciate, however, that the present embodiments are not limited to the receipt of a single video stream. In another embodiment, the input video data comprises multiple video streams of texture and/or multiple video streams of depth images. Similarly, the decoded video data may, according to the present disclosure, comprise multiple video streams of texture, and/or multiple video streams of depth images, and/or multiple video streams of occupancy maps.

Sending device 50 comprises various components described in more detail below. These components work together in order to pre-process the video data and convey that video data to the receiving device 70. As seen in the embodiment of Figure 3, these components comprise a video source 52, a depth source 54, pre-processing circuitry 56, one or more encoders 58, and a transmitter 60. The sending device 50 may, for instance, be a mobile device, a computer, a head mounted display (HMD), such as Virtual Reality (VR), Mixed Reality (MR), or Augmented Reality (AR) glasses, another type of wearable device that comprises camera sensors, or the like. Additionally, or alternatively, sending device 50 may be implemented in another form including, but not limited to, a native operating system (e.g., iOS or Android device), device middleware, and a web-browser. Furthermore, in different embodiments of the present disclosure, components of the sending device 50 and/or the receiving device 70 may be implemented in separate physical devices linked by wired or wireless data links. For example, the video source 52, the depth source 54 and the pre-processor 56 may be implemented in a first device having a wired or wireless data link to one or more separate devices implementing the one or more encoders 58 and/or the transmitter 60. In another embodiment, the video source 52, the depth source 54, the pre-processor 56 and the one or more encoders 58 may be implemented in the same physical device and a encoded bitstream may be output by the one or more encoders 68 for storage or sending by the transmitter 60 implemented in a seperate physical device. Other implementations and distributions of functionality between physical devices as would be apparent to a skilled person in the relevant art may be selected appropriate to a particular application.

Video source 52 and depth source 54, respectively, provide the stream of texture images and the stream of depth images. Although not limiting, the video source 52, in one embodiment, is a video camera configured to provide a stream of texture images in a particular format (e.g., RGB format). In another embodiment, video source 52 is a file that comprises the stream of texture images and is stored in memory (e.g., memory that is part of, or is at least accessible to) sending device 50. Similarly, in one embodiment, depth source 54 is a depth camera configured to provide a stream of depth images to the sending device 50. In another embodiment, depth source 54 is a file comprising the stream(s) of depth images. In this latter embodiment, the file comprising the stream of texture images is stored in memory that is part of, or accessible to, sending device 50. It should be noted that the present embodiments do not require the video source 52 and the depth source 54 to be separate entities. Rather, in at least one embodiment, video source 52 and depth source 54 comprise a single entity, such as a camera that provides both the texture and depth video in RGBD format, for example, or a single file comprising both the texture and depth video data.

The pre-processing circuitry 56 is configured to perform a variety of pre-processing functions on the video data; however, according to the present disclosure, the pre-processing functions comprise one or more of:

• determining foreground and background sections of a scene from the depth image;

• creating an occupancy map based on the depth image;

• performing hole filling functions, if necessary;

• removing the background section in the texture image and/or the depth image;

• padding the foreground of the texture image and/or the depth image;

• converting the depth image to a lower bit depth;

• converting the depth image to a lower resolution; and

• converting the color format for the texture image, the depth image, and/or the occupancy map.

The one or more encoder(s) 58 are encoder instances configured to compress the texture images, the depth images, and the occupancy maps into one or more bitstreams. According to the present disclosure, encoders 58 advantageously comprise what those of ordinary skill in the art would consider as being “mainstream” video codecs that are widely adopted for use in the market and are supported by relevant devices. Examples of mainstream encoders 58 include, but are not limited to, those that operate according to well-known standards such as WO, AVC, and HEVC.

T ransmitter 60 comprises circuitry for transmitting the one or more bitstreams generated by the sending device 50 to receiving device 70. In at least one embodiment, transmitter 60 is further configured to convey metadata associated with the texture and depth images to receiving device 70. As described in more detail later, receiving device 70 uses the metadata along with the video data received from sending device 50 to generate the volumetric video (e.g., a mesh or point cloud).

The receiving device 70, in this embodiment, may be a mobile device, a computer, an HMD such as VR, MR or AR glasses, or a network entity (e.g., a cloud server or edge computing server) and comprises a receiver 72, one or more decoders 74, post-processing circuitry 76, and optionally, a display device 78. The receiver 72 comprises circuitry configured to receive the one or more bitstreams from sending device 50 via network 42. In some embodiments, receiver 72 is also configured to receive the metadata from sending device 50 in cases where the metadata is not conveyed by other means. Decoder(s) 74 comprise one or more decoder instances configured to decode the one or more bitstreams received from sending device 50 into decoded texture images, decoded depth images, and decoded occupancy maps. Similar to encoders 58, decoders 74 comprise mainstream video codecs, such as those that operate according to well-known standards such as WC, AVC, and HEVC.

The post-processor 76 is configured to perform various post-processing functions on the decoded texture images, decoded depth images, and decoded occupancy maps. As above, such post-processing functions may be any functions needed or desired; however, according to the present disclosure, the post-processing functions in one or more embodiments may comprise one or more of:

• filtering the depth image based on the occupancy map;

• converting the depth image to a higher bit depth;

• converting a color format for the texture image, the depth image, and/or the occupancy map; and

• generating volumetric video from the post-processed texture images and depth images. Such volumetric video may comprise, for instance, a mesh or point cloud.

Figure 4A is a flowchart illustrating a method 80 for, processing video data for sending. The processing of video data comprises, in a first part, obtaining at least a depth image and a selection map and packing the obtained depth image and selection map into a first picture. The first picture has a color format with a plurality of channels such that the depth image and selection map are packed into separate channels, e.g. first and second of the plurality of channels, of the same picture. In a second part, the first picture is encoded for sending in a bitstream to a receiving device 70 via network 42. According to one embodiment of the present disclosure, the first and second parts of the processing of the video data are performed by a sending device 50 comprising a single physical device or one or more linked components, as discussed above. The sending device 50 may, for instance, be a mobile device, a computer, a head mounted display (HMD) such as Virtual Reality (VR) glasses, Mixed Reality (MR) glasses, Augmented Reality (AR) glasses, or some other type of wearable device that comprises, for example, camera sensors or the like with all components integrated into the same physical device or distributed among two or more separate but linked or linkable devices.

As seen in Figure 4A, the sending device first obtains a depth image (box 82), a selection map (box 84), and optionally, a texture image (box 86). Once the sending device has obtained this data, the depth image and the selection map are both packed into a first picture having a color format comprising a plurality of channels (box 88). In embodiments where the sending device obtained the texture image, the sending device spatially packs the texture image, the depth image, and the selection map into the same first picture (box 90). Then, using a mainstream encoder, the sending device encodes the first picture into the bitstream (box 92). As will be described later in more detail, the sending device may, in some embodiments, optionally encode information about the packing into the bitstream (box 94) before the bitstream is sent to the receiving device (box 96).

Figure 4B is a flowchart illustrating a method 100, implemented at a receiving device 70, for extracting and decoding the packed video data received from sending device 50 according to one embodiment of the present disclosure. The receiving device 70 may be any device equipped with a “mainstream” decoder. However, in this embodiment, receiving device 70 may be a mobile device, a computer, a head mounted display (HMD) such as Virtual Reality (VR) glasses, Mixed Reality (MR) glasses, Augmented Reality (AR) glasses, some other type of wearable device that comprises, for example, camera sensors or the like, or a network node, such as a cloud server or edge computing server, for example.

In Figure 4B, the receiving device first receives a bitstream from the sending device (box 102) and decodes a picture that was encoded into the bitstream (box 104). In embodiments where the received bitstream includes information about the packing of the depth image, the selection map, and the optional texture image into the picture, the receiving device 70 decodes that information (box 106). Once decoded, receiving device 70 extracts the depth image and the selection map from respective channels in the picture (box 108), and if included, the texture image (box 110). Receiving device 70 then generates a volumetric video using the extracted depth image and selection map (and if included, the texture image) (box 112) and renders the volumetric display to a display device for a user (box 114).

Obtaining texture, depth image and selection map

The texture and depth image may be obtained from a video source and the depth source, respectively, which as described above may be an RGB camera and a depth camera located on a mobile device, for example. Alternatively, one or both of the texture image and the depth image may be obtained from one or more files in a storage device. In at least one embodiment of the present disclosure, the texture image and the depth image are obtained from a single camera that is configured to provide both the texture image and the depth image as video in RGBD format, for example, or from a file that comprises both the texture image and the depth image as video data.

In general, the texture image, the depth image, and the selection map represent the same content. However, their resolutions may be the same or different. By way of example only, the resolution of the depth image, in one embodiment, is smaller than the resolution of the texture image. In another embodiment, however, the resolutions of the depth image and the texture image are the same. Regardless, as those of ordinary skill in the art will readily appreciate, there are existing devices that are configured to obtain both the texture and depth images at their respective resolutions. Examples of such devices include, but are not limited to, smartphones, tablet computers, and/or standalone camera devices.

The selection map, in the context of the present disclosure, is a representation of a picture divided into two or more areas. A selection map may, for instance, be an occupancy map that indicates or describes what part(s) of an image is/are selected for output and/or further processing by a decoder. In one embodiment, the selection map is a binary map in which each sample in the map can have or be assigned only one of two possible values. For example, a first value (e.g., 1) may be used to indicate that a particular sample is within a selected area, while a second value (e.g., 0) may be used to indicate that a sample is not within the selected area. Additionally, the binary map may be stored with only a single bit per sample, or with more than one bit (e.g., 8 bits) per sample. Even in this latter case, however, a binary selection map according to the present disclosure would still only be configured to allow each sample to have one of two different, possible values.

The selection map may be obtained from various places. For example, in one embodiment, the selection map is obtained from a specific sensor. In another embodiment, the selection map is obtained from a file in a storage device. In yet another embodiment, the selection map is derived from the depth image. To derive the selection map from the depth image, the sending device may first divide the depth image into at least two areas based on the values of the depth image. The sending device may then assign a value to each sample in the selection map. The assigned value, which would be based on the values in the depth image, would correspond to the area in the depth image to which it belongs.

For example, consider a depth image divided into two areas - a foreground area and a background area. Objects in the foreground area (e.g., a subject’s head and/or torso) would be closest in distance to the camera that captured the image, while the objects in the background area would be farther away from the camera. In one embodiment of the present disclosure, the samples in the foreground area could be assigned a first value (e.g., ‘1 ’), while the samples in the background area of the depth image could be assigned a second value (e.g., ‘0’), thereby differentiating the samples in the foreground area from those in the background area.

Alternatively, a selection map may comprise a multi-threshold map configured to allow for more than one selected area. For instance, consider a multi-threshold selection map divided into four or more areas. A map sample in the multi-threshold selection map having a value of 1 might indicate that the sample belongs to a first selection. Similarly, values of 2 and 3 could indicate that the corresponding sample belongs to a second or third selection, respectively. A value of 0, however, could be used to indicate that the corresponding map sample does not belong to any of the selected selections of the depth map.

In at least one embodiment, the values of the samples in the selection map are spread throughout a range of possible depth values. By way of example only, consider the depth image above divided into three selectable sections or areas (i.e., a first section, a second section, and a third section). In situations where 8-bits are used for each sample in the selection map, then the values used to indicate the different selection states (i.e., no selection area, 1^st selection area, 2nd selection area, and 3rd selection area) could be evenly distributed throughout a range of 2⁸ values (i.e., [0..255]). Therefore, the values assigned to the samples in the selection map could be one of 0, 85, 170, and 255 depending on the area to which the samples belong. In some embodiments, each sample value in the selection map can be rounded further to the next nearest of these representation values. For example, consider a sample value of 94 after compression. In these cases, sending device could round the value to 85 and determine that it to belongs to the first selection area.

Packinq/Extractinq the Depth Image and the Selection Map

According to the present disclosure, the color format has at least two channels. In one embodiment, however, the first picture has exactly three channels. Regardless of the number of channels, though, one embodiment of the present disclosure packs the depth image and the selection map into first and second channels, respectively. In such embodiments, the first and second channels are different channels.

Additionally, according to some embodiments of the present disclosure, the color format can be either a RGB color format or a YUV/YCbCr color format. In such cases, at least one of the first and second channels is a color channel. For example, at least one of the first and second channels may be a chroma channel (e.g., a U/Cb or V/Cr channel) of an YUV/YCbCr color format. In another example, at least one of the first and second channels may be an R, G, or B channel of an RGB color format. Regardless, though, the channels may or may not have the same resolution. For example, consider an embodiment in which color subsampling is used (e.g., 4:2:0 subsampling in a YUV/YCbCr color format). In such embodiments, each color channel may have half of the total resolution in both the vertical and horizontal directions. In other embodiments, however, the channels have the same resolution (e.g., as for 4:4:4 subsampling or RGB888).

Regardless of the specific channels, however, the channel into which the selection map is packed (e.g., the second channel) can precede or follow the channel into which the depth channel is packed (e.g., the first channel) in stream and/or processing order. For example, the selection map may be packed into the luma channel of a YCbCr/YUV color format while the depth image is packed into the Cb/U channel of the YCbCr/YUV color format. Additionally, the spatial positions of the depth image and selection map may be collocated such that a spatial position in the depth image would correspond to the same spatial position of the selection map.

Packing the selection map and/or the depth image into a color channel, rather than spatially packing them side-by-side, is advantageous. One advantage is that it enables the present embodiments to achieve a higher overall resolution given a certain resolution of the picture. The maximum possible picture resolution is determined by the allowed codec level.

Additionally, the present embodiments also utilize cross-component coding tools to compress the depth image and/or the selection map. In these cases, the present embodiments take advantage of the fact that the selection map and the depth image are collocated in different channels of a picture. Thus, the cross-component coding tools of the codec can take advantage of the redundancies between the channels and be used to predict between the selection map and the depth image to improve compression efficiency. For instance, WC includes the crosscomponent adaptive loop filter (CC-ALF) and the cross-component linear model (CCLM) coding tools.

Figure 5 illustrates an example of packing the selection map 28 and the depth image 30 into different channels of picture 20 according to one embodiment. In this example, a YCbCr/YUV color format with 4:2:0 color subsampling has been used. Thus, the Cb and Cr channels 24, 26 have half the vertical and horizontal resolution compared to the luma channel 22. As seen in Figure 5, the selection map 28 is packed into the luma channel 22, while the depth image 30 is packed into the Cb channel 24.

As stated above, the color format may have three channels. In these cases, some parts of the selection map and/or the depth image may be packed into the third channel. This would be in addition to packing other parts of the selection map and the depth image into the first and second channels. By way of example, one embodiment of the present disclosure configures a sending device to pack an additional attribute of the content into the third channel (e.g., a second color channel in the YCbCr/YUV color format). The additional attribute may be, for instance, a different type of selection map, an occlusion map indicating objects that are occluded in picture 20, an alpha channel indicating a transparency of a scene, a map associated with ambient light, and/or reflection or other material properties of a 3D object that is in the picture or scene.

At the receiving device, the first picture is decoded from a bitstream. The depth image and the selection map are then extracted from the first and second channels of the first picture. As stated above, the first and second channels are different channels in at least one embodiment. Additionally, in one or more embodiments, at least one of the first and second channels is a color channel, and the two channels may or may not have the same resolution. According to the present disclosure, the depth image and selection map may correspond to the same content. Further, the spatial positions of both the depth image and the selection map may be collocated such that a spatial position in the depth image would correspond to the same spatial position of the selection map.

Spatially packing the texture image with the depth image and the selection map

Obtaining a texture image, as stated above, is optional. However, when the sending device does obtain a texture image, the present embodiments spatially pack that texture image into the first picture along with the depth image and selection map. Figure 6, for example, illustrates an embodiment in which a texture image 32 and selection map 28 have the same resolution (e.g., half of the full resolution), while the depth image 30 has a quarter resolution.

It should be noted here that the present embodiments are not limited to any particular resolution ratio. Rather, according to the present disclosure, other ratios for the resolutions may be used. Figure 7, for example, illustrates an embodiment in which the resolution of the texture image 32 is higher than the resolutions of both the selection map 28 and the depth image 30. In some embodiments of the present disclosure, the original aspect ratios of the texture image, the selection map, and/or the depth image may not necessarily be maintained once they are packed into the first picture. By way of example only, a vertical resolution may of one of the images or the selection map be decreased by a factor of 2, while the horizontal resolution is decreased by a factor of 4. Therefore, in cases where the texture image, the depth image, and/or the selection map do not have the same bit depth, the texture image, the depth image, and the selection map would have to be aligned prior to being packed into the same picture. This may be accomplished, for example, by converting the texture image, the depth image, and the selection map to a predetermined number of bits (e.g., 10 bits).

One advantage with this approach is that each of the texture image, the depth image, and the selection map can be packed into the same picture. This picture can then be compressed using a mainstream video codec configured to use hardware acceleration on devices that only support hardware acceleration of encoding/decoding instances. This provides a benefit over many conventional devices, which typically only support hardware encoding/decoding of a single video stream at a time. Additionally, parallel video streams generally use what is commonly known as “best-effort” software encoding/decoding. However, this functionality greatly decreases battery life, making scheduling, for example, more difficult and more complex.

Another advantage is that the present embodiments are configured to compress all three of the texture, depth image, and the selection map into one stream. This beneficially avoids synchronization issues due to jitter on the receiving device. Additionally, in one embodiment of the present disclosure, the selection map and the depth image are packed to the “left” or “above” the texture image (i.e., opposite to the packing illustrated in Figure 7). This enables the selection map and depth image to be decoded and post-processed earlier than they would be processed the embodiment of Figure 6. One advantage with the earlier decoding and postprocessing is that the overall latency may be reduced.

As described later in more detail, decoding and extracting the parts of a picture may be realized using subpictures in WC, for example. Subpictures are part of the WC Main 10 profile. In this case, the selection map and the depth image could be packed into separate components of a first subpicture, while the texture image could be mapped into a second subpicture. In some cases, the first subpicture is positioned to the left or on top of the second subpicture such that first subpicture precedes the second subpicture in the bitstream.

As previously described, the sending device may, in some embodiments, obtain a texture image to send to the receiving device in addition to the depth image and the selection map. Particularly, the texture image would be spatially packed with the depth image and selection map into the first picture. In such cases, the receiving device would decode the first picture before extracting the texture image from the first picture. As stated above, the texture image, the depth image, and selection map may correspond to the same content. Further, the spatial positions of the depth image, the selection map, and/or the texture image may be collocated such that a spatial position in the depth image would correspond to the same spatial position of the selection map and/or the texture image.

Encodinq/decodinq the packed picture

According to the present disclosure, the sending device is beneficially configured to encode the first picture to a bitstream using an encoder that conforms to a mainstream profile of a mainstream codec. The receiving device, on the other hand, is configured to decode the first picture from the bitstream using a decoder that conforms to a mainstream profile of a mainstream codec.

In more detail, one embodiment of the present disclosure configures the sending device to encode the first picture into the bitstream in conformance with a mainstream profile of a mainstream codec. By way of example only, the first picture may be part of a video stream. In these cases, the mainstream codec may be a video codec. In another embodiment, the first picture may be a still image. In such cases, the mainstream codec could be a still image codec. Regardless of the particular codec, however, packing the selection map and the depth image in different channels (e.g., the first and second channels previously described) advantageously enables cross-component prediction. That is, the present embodiments can beneficially use the cross-component coding tools of the codec (e.g., ALF-CC and CCLM of WC) to improve compression efficiency.

Those of ordinary skill in the art should note that according to the present disclosure, the depth image and the selection map need not be encoded with the same quality. In one embodiment, for example, the quality settings of the encoding are set differently for the selection map and the depth image. For instance, HEVC and WC compatible codecs include individual delta QP signaling of the luma and chroma channels. In one embodiment, these quality settings are adjustable, thereby beneficially allowing for the simple and efficient tuning of the quality between the luma channel and the chroma channels. For example, in one embodiment, the selection map is encoded with a higher quality (e.g., a lower QP and/or higher resolution), than the depth image. Additionally, in some embodiments, the selection map is losslessly encoded, while the depth image is lossy encoded.

There are advantages to encoding the selection map with a quality that is higher than that of the depth image for a limited bitrate. One advantage, for example, is that the receiving device can filter the depth image using the higher quality selection map, thereby better preserving the edges of a foreground object of the depth image. Another advantage is that the lower quality of a compressed depth image matters less to processing. For example, the smoothing effects associated with the low-bitrate compression of a foreground object in the depth image are typically less noticeable after rendering.

Signaling the packed structure According to embodiments of the present disclosure, the sending device is configured to convey information about the structure of the packed depth image, the selection map, and/or the texture image to the receiving device. For example, in at least one embodiment of the present disclosure, which is described in more detail below, the sending device is configured to convey this information to the receiving device in a message signaled in the bitstream. In other embodiments, however, the information is conveyed to the receiving device in some other predetermined manner.

Regardless of the manner of conveyance, however, the information conveyed to the receiving device may be any information needed or desired. In one embodiment, that information includes, but is not limited to one or more of:

• the number of packed images and/or maps in the bitstream;

• the type of images and/or maps in the bitstream;

• the channel(s) that an image and/or map occupies in the picture;

• the vertical and horizontal positions of an image or map;

• the width and height of an image or map; and/or

• the orientation of an image or map.

As stated above, the sending device may be configured in one embodiment to signal this information to the receiving device in a message over the bitstream. The message may be, for example, a Supplemental Enhancement Information (SEI) message. The following tables illustrate example structure, syntax, and semantics for an SEI message according to at least one embodiment of the present disclosure. As can be seen in the following tables, the information carried in the SEI message in at least one embodiment includes, but is not limited to, information about the component(s) (i.e., channel(s)), positions, height, width, and orientation of the image(s), images, and maps packed into the first picture.

It should be noted here that the structures in the following tables refer to both the images (i.e., the depth image and the texture image) and the selection map as “maps.” For example, “num_maps” indicates the number of depth images, texture images, and selection maps present in the packed structure. Thus, the term “map” in the following tables is not associated solely with the selection map.

Table 1

• num_maps indicates the number of maps and/or images in the packed structure.

• mapjnterpretation [i] indicates the interpretation of the map and/or image [i]. The value of mapjnterpretation [i] shall be in the range of 0 to N-1 , inclusive, and has the following interpretation:

able 2

• map_component [I] indicates the component into which map and/or image [i] is packed.

The value of map_component shall be in the range of 0 to 3, inclusive, and has the following interpretation:

Table 3

• map_pos_x [I] indicates the horizontal position of the top left corner of the map and/or image in terms of luma samples.

• map_pos_y [I] indicates the vertical position of the top left corner of the map and/or image [i] in terms of luma samples.

• map_width [I] indicates the width of the map and/or image [i] in terms of luma samples.

• map ieight [I] indicates the height of the map and/or image [i] in terms of luma samples.

• map_orientation [I] indicates the orientation of the map and/or image [i]. The value of map_orientation [i] shall be in the range of 0 to 3, inclusive, and has the following interpretation:

Table 4

Those of ordinary skill in the art should appreciate that the present disclosure is not limited to conveying this information using an SEI message, as indicated in the tables above. Rather, in some embodiments, the sending device can be configured to signal this information using systems level signaling, such as with MPEG-DASH, the ISO base media file format or other standards based on the ISO base media file format, or by using WebSockets or HTTP signaling such as in WebRTC.

Render a volumetric video

As previously described, the receiving device is, in some embodiments, configured to use the depth image, selection map, and the texture to generate and render a volumetric video (e.g., a mesh or a point cloud). Additionally, however, the receiving device may also use metadata to generate and render the volumetric video. Such metadata includes, for example, parameters that are intrinsic to the camera that captured the images and may be provided to the receiving device by the sending device.

Sendinq/receivinq the bitstream

In general, the sending device sends the bitstream comprising the encoded picture to the receiving device via a network, such as network 42. Upon receipt of the bitstream, the receiving device performs further processing, including, as described above, decoding the picture from the bitstream and extracting the selection map, depth image and the texture image from the decoded picture.

In the previous embodiments, the receiving device is a user device, for example, such as a mobile device, a tablet computer, or a HMD. However, those of ordinary skill in the art should appreciate that the present disclosure is not so limited. In some embodiments, such as will be described later in more detail, the receiving device is a node in the network (e.g., a cloud or edge computer server). In such embodiments, a network node functioning as the receiving device may, for example, be used to relay, transcode, or perform split rendering of a bitstream sent by the sending device. In this context, the network may, for example, comprise a wired or wireless network (e.g., a 5G network).

In more detail, the video data sent by the sending device (e.g., the texture image, the depth image, and the selection map) may be transcoded by the receiving device (i.e., functioning as a network node) as part of a cloud service, for example. Once transcoded, the network node is configured to recompress (i.e., reencode) a representation of the volumetric video comprised in the video data and send the reencoded representation to another receiving device (e.g., a user’s mobile device or HMD). In one embodiment, the representation of the volumetric video comprises a full volumetric video. In another embodiment, the representation of the volumetric video is a reduced representation (e.g., a 2D representation of a viewport of the volumetric video). In these latter embodiments, the network node may be configured to encode the reduced representation to a 2D video stream and transmit that video stream to an end-user with a device capable of decoding a 2D video stream. This process is often referred to as “split rendering.”

Accordingly, the present disclosure configures a sending device to perform all or a subset of the following functions:

• Obtain a depth image;

• Obtain a selection map (e.g., an occupancy map);

• Obtain a texture image (the texture, depth image, and selection map may correspond to the same content. That is, together they may represent a volumetric image or video object);

• Pack the depth image and the selection map into a first picture with a color format having at least two channels. In some embodiments, the depth image is packed in a first channel and the selection map is packed in a second channel that is different from the first channel. Additionally, in at least some embodiments, at least one of the first and second channels is a color channel. The channels may or may not have the same resolution, and the spatial positions of the depth image and selection map may be collocated such that a spatial position in the depth image corresponds to the same spatial position of the selection map;

• Spatially pack the texture image into the first picture;

• Encode the first picture into a bitstream. For example, the sending device may encode the first picture into a bitstream in conformance with a mainstream profile of a mainstream codec. The sending device may encode the first picture such that the selection map is encoded with a higher quality, a lower QP, and/or a higher resolution than the depth image;

• Encode information about the packing of the depth image, selection map, and in some embodiments, the texture image, into the first picture to the bitstream; and

• Send the bitstream to the receiving device. As previously described, the receiving device may be a user device (e.g., a mobile device or HMD) or a computer node in the network.

Additionally, the present disclosure configures a receiving device to perform all or a subset of the following functions:

• Receive a bitstream. As stated above, the bitstream, which may be a video bitstream, for example, may conform to a mainstream profile of a mainstream codec; • Decode a first picture from the bitstream. Decoding may be performed by a decoder that conforms to a mainstream profile of a mainstream codec;

• Decode information about the packing of depth image, the selection map, and when included, the texture image, into the picture from the bitstream;

• Obtain the first picture. According to at least one embodiment, the first picture has a color format having two or more channels (e.g., a YCbCr/YUV or RGB color format);

• Extract a depth image from the first channel of the first picture;

• Extract a selection map from the second channel of the first picture. The second channel, in at least one embodiment, is different from the first channel. Further, the selection map may have been encoded with a higher quality, a lower QP, and/or a higher resolution than the depth image. At least one of the first and second channels may be a color channel, and the channels may or may not have the same resolution. Additionally, the spatial positions of the depth image and the selection map may be collocated such that a spatial position in the depth image would correspond to the same spatial position of the selection map.

• Extract a texture image from the first picture. In these cases, the sending device spatially packed the texture image with the depth image and selection map. The texture image, the depth image, and the selection map may correspond to the same content.

• Use the depth image, the selection map, and the texture to generate a volumetric video, e.g., a mesh or a point cloud. Metadata, such as intrinsic camera parameters, may also be provided for this.

• Render the volumetric video to a display device, such as a user’s mobile device, the display of a computer device, a HMD, and the like.

Transcoding in the network

As stated above, one embodiment of the present disclosure, illustrated in Figure 8, configures a network node operating in network 44 to transcode the bitstream received from sending device 50 before sending a representation of a volumetric video to receiving device 70. In these embodiments, the network node is considered to be a first receiving device 120 while receiving device 70 is considered to be a second receiving device.

According to at least one embodiment, the first receiving device 120 (e.g., a network node such as a cloud server or edge server) transcodes the video data received from sending device 50 in bitstream 44 as a part of a cloud service, for example. A representation of the volumetric video comprised in the video data is then reencoded/recompressed by the first receiving device 120 and sent to the receiving device 70. In one embodiment, the representation comprises the full volumetric video. In another aspect, the representation is a reduced representation (e.g., a 2D representation of a viewport of the volumetric video). As stated above, the reduced representation may be encoded to a 2D video stream and transmitted to an end-user device capable of decoding the 2D video stream. In this embodiment, first receiving device 120 comprises a network node such as a cloud server or edge server located in network 42. As seen in Figure 8, first receiving device 120 comprises a receiver 122, one or more decoders 124, post-processing circuitry 126, one or more encoders 128, and a transmitter 130. Each of these components are similar to those described previously with respect to sending device 50 and receiving device 70, and thus, are not described in detail here.

In operation, first receiving device 120 receives one or more bitstreams 44 from sending device 50, decodes the one or more bitstreams, and generates the volumetric video as previously described. However, in this embodiment, first receiving device 120 does not render the volumetric video to a display. Instead, the volumetric video is re-encoded by the one or more encoders 128 and sent to receiving device 70 in one or more bitstreams 132. In these embodiments, the re-encoded volumetric video may comprise the full volumetric video, or a reduced representation of the volumetric video. In one embodiment, where the reduced representation of the volumetric video is generated and sent to receiving device 60, a viewport is first selected for the reduced representation of the volumetric video (e.g., a 2D video picture). The reduced representation of the volumetric video is then encoded using a conventional mainstream video codec, for example, and sent to receiving device 70. For example, the receiving device 70 may be capable of decoding and displaying a 2D video stream but may not be capable of decoding multiple streams and rendering a volumetric video.

Accordingly, a receiving device 120 may perform all or a subset of the following functions according to one or more embodiments of the present disclosure, as previously described.

• Receive one or more bitstreams from a sending device;

• Receive metadata, e.g., intrinsic camera parameters, used for generating a mesh;

• Decode the one or more bitstreams into a decoded texture image, a decoded depth image, and a decoded selection map;

• Generate a volumetric video (e.g. a mesh or a point cloud) from the decoded texture image, the decoded depth image, and the metadata when sent by the sending device;

• Select and encode/transcode a representation of the volumetric video to an outgoing bitstream. The representation may comprise the full volumetric video or a reduced representation of the volumetric video (e.g., a viewport of the volumetric video encoded as a 2D video); and

• Send the outgoing bitstream to a second receiving device.

Figure 9 is a flow diagram illustrating a method 140 for processing video data, and more specifically, video data for use in generating volumetric video, for example. Method 140 is implemented by sending device 50 and comprises, in one embodiment, sending device 50 obtaining a depth image from a depth image source (box 142), and obtaining a selection map (box 144). In this embodiment, the depth image comprises information associated with a distance of an object in a picture from a viewpoint, and the selection map represents the picture partitioned into a plurality of areas. In some embodiments, method 140 also comprises sending device 50 obtaining a texture image from a texture image source (box 146). Although a texture image is not required for this embodiment of the present disclosure, it can be utilized at a receiving device to generate volumetric video, as will be described later in more detail.

Method 140 then determines whether the depth image, the selection map, and when obtained, the texture image all have the same or different bit depths (box 148). If so, sending device 50 packs the depth image and the selection map into a first picture having a color format with a plurality of channels (box 152). In this embodiment, the depth image and the selection map are packed into first and second channels, respectively. At least one of the first and second channels is a color channel. If the bit depth is not the same, however, method 140 calls for sending device 50 to first align the depth image, the selection map, and when obtained, the texture image (box 150) and then perform the packing (box 152). In embodiments where sending device 50 has obtained a texture image, method 140 then calls for sending device 50 to additionally spatially pack the texture image into the first picture (box 154).

In addition to the packed video data, method 140 also calls for sending device 50 to send information associated with the packing of one or more of the depth image, the selection map, and the texture image to the receiving device over the network (box 156), as well as metadata (box 158). The metadata, in one embodiment, comprises parameters intrinsic to a camera that captured on or both of the depth image and the texture image.

Regardless of whether sending device 50 does or does not send the information about the packing and/or the metadata to the receiving device, method 140 calls for sending device 50 encoding the first picture to send to the receiving device (box 160) and sending the encoded first picture in a bitstream to the receiving device over a network (box 162).

In one embodiment, the depth image is obtained from a depth image source, such as a camera, for example. However, according to the present disclosure, there are various ways to obtain the selection map. By way of example, Figure 10 is a flow diagram illustrating a method 170 for obtaining the selection map. In one embodiment, for instance, the selection map is obtained from a sensor (box 172). The sensor may be, for example, the camera that captured the depth image. In the same or other embodiment, the selection map is obtained from a file in a storage device (box 174). Additionally, in the same or different embodiment, the selection map may be derived from the depth image obtained by sending device 50 (box 176).

Figure 11 is a flow diagram illustrating a method 180 for deriving the selection map from the depth image. In particular, in one embodiment, sending device 50 partitions the depth image into a plurality of areas based on depth values of the depth image (box 182). So partitioned, sending device 50 then generates the selection map to comprise a plurality of unique sample values (box 184), each of which corresponds to one of the plurality of areas of the partitioned depth image.

In one embodiment of the present disclosure, the second channel precedes the first channel in one or both of a bitstream and processing order. Alternatively, the second channel follows the first channel in one or both of a bitstream and processing order.

In one embodiment, the plurality of channels comprises a third channel. In this embodiment, the third channel comprises a second color channel, and one or more content attributes are packed into the third channel.

In one embodiment, the one or more content attributes comprise information associated with one or more of:

• a different type of selection map;

• an occlusion map describing one or more occluded objects in a picture;

• an alpha channel describing a transparency of the picture;

• an ambient lighting map comprising information associated with ambient light; and

• material properties of a 3D object or the picture.

In one embodiment, the depth image and the selection map are packed into the first picture such that the depth image and the selection map are decoded and post-processed prior to the texture image.

In one embodiment, the first picture comprises a plurality of subpictures. In such embodiment, the depth image and the selection map are packed into separate components of a first subpicture and the texture image is packed into a second subpicture.

In one embodiment, the information associated with the packing signaled to the receiving device comprises any information needed or desired. For example, in one embodiment, such information identifies one or more of:

• a number and/or type of one or more of the depth image, the texture image, and the selection map in the first picture;

• which channels the one or more of the depth image, the texture image, and the selection map occupy in the first picture;

• vertical and horizontal positions of the one or more of the depth image, the texture image, and the selection map in the first picture;

• dimensions of the one or more of the depth image, the texture image, and the selection map in the first picture; and

• an orientation of the one or more of the depth image, the texture image, and the selection map in the first picture.

Turning now to Figure 12, it is a flow diagram illustrating a method 190, implemented by a receiving device such as receiving device 70 or receiving device 120, for processing the video data received from sending device 50. In one embodiment, method 190 calls for the receiving device to obtain a first picture having a color format with a plurality of channels (box 192). In this embodiment, a depth image and a selection map are packed into first and second channels of the first picture, respectively, and at least one of the first and second channels is a color channel. In some embodiments, the receiving device also receives information associated with the packing of the depth image, the selection map, and in some embodiments, the texture image (box 194). Regardless of the information, however, method 190 calls for the receiving device to decode the first picture from a first bitstream (box 196), and when provided, the information associated with the packing from the first bitstream (box 198). As stated above, receiving device may also receive metadata (e.g., the intrinsic camera parameters) from the sending device (box 200), although this is not required.

Next, method 190 calls for the receiving device to extract both the depth image (box 202) and the selection map (box 204) from the first and second channels of the first picture, respectively. In embodiments where the receiving device is a user device (e.g., when receiving device 70 is a mobile device or a HMD), method 190 calls for receiving device 70 to generate volumetric video based on the depth image and the selection map (box 208). In some embodiments, the generated volumetric video is rendered to the user’s associated display device (box 210).

In one embodiment, a texture image is spatially packed into the first picture with the depth image and the selection map. In such embodiments, the receiving device extracts the texture image from the first picture, and uses the texture image to generate the volumetric video.

In one embodiment, the first picture is received in a first bitstream from the sending device over a network. In such embodiments, method 190 calls for the receiving device to decode the first picture from the first bitstream.

In at least one embodiment, the receiving device further receives information from the sending device over the network. In this embodiment, the information is associated with the packing of one or more of the depth image, the selection map, and the texture image and identifies one or more of:

• an orientation of the one or more of the depth image, the texture image, and the selection map in the first picture. In one embodiment, the selection map is encoded with a lower quantization parameter (QP) and/or a higher resolution than the depth image.

In one embodiment, the plurality of channels comprises exactly three channels.

In one embodiment, the color format of the first picture is one of a Red Green Blue (RGB) color format and a YUV color format.

In one embodiment, the depth image and the selection map are packed into the first picture for encoding using a multi-component color format comprising a YUV format. In such embodiments:

• the depth image is signaled in the luminance component of the YUV format, and the selection map is signaled in one of the color components of the YUV format; or

• the selection map is signaled in the luminance component of the YUV format, and the depth image is signaled in one of the color components of the YUV format; or

• the selection map is signaled a first color component of the YUV format, and the depth image is signaled in a second color component of the YUV format.

In one embodiment, at least two channels of the plurality of channels have different resolutions.

In one embodiment, spatial positions of the depth image and the selection map are collocated in the first picture such that a spatial position of the depth image corresponds to a spatial position of the selection map.

In one embodiment, the selection map comprises a plurality of sample map values and is one of:

• an occupancy map indicating which of a plurality of areas of the depth image are selected for output and/or processing at the receiving device;

• a binary map wherein each sample map value comprises one of two possible sample map values;

• a non-binary map wherein each sample map value comprises one of three or more possible sample map values; and

• a multi-threshold map.

In one embodiment, the volumetric video is one of a mesh and a point cloud.

In one embodiment, the sending device comprises one of:

• a mobile device;

• a computing device;

• a Head Mounted Display (HMD), a Virtual Reality (VR) device;

• an augmented reality (AR) device;

• a mixed reality device; and

• a wearable device comprising one or more sensors.

Additionally, in one embodiment, the receiving device comprises one of:

• a mobile device; • a computing device;

• a Head Mounted Display (HMD), a Virtual Reality (VR) device;

• an augmented reality (AR) device;

• a mixed reality device;

• a network node; and

• a wearable device comprising one or more sensors.

In one embodiment, information about the packing of the depth image and the selection map into the first picture is signaled to the receiving device.

In one embodiment, as stated above, a first receiving device 120 in the communications network 42 implements method 220. In such embodiments, the first receiving device 120 is configured to perform the same functions described with respect to method 190 of Figure 12, as well as some additional functions, which are illustrated in Figure 13. It should be noted here that method 220 of Figure 13 is largely the same method 190 previously described with respect to Figure 12 except for the generating and rendering functions (boxes 208, 210). Therefore, functions that are the same in Figures 12 and 13 are indicated using the same reference numbers, and further, are not detailed here.

As seen in Figure 13, once the first receiving device 120 has extracted the depth image, the selection map, and, when provided, the texture image (boxes 202, 204, 206), the first receiving device 120 generates a second bitstream comprising a representation of either the mesh or the point cloud (box 222). So generated, the first receiving device 120 sends the generated second bitstream to a second receiving device, which in one embodiment is receiving device 70 (box 224).

An apparatus can perform any of the methods herein described by implementing any functional means, modules, units, or circuitry. In one embodiment, for example, the apparatuses comprise respective circuits or circuitry configured to perform the steps shown in the method figures. The circuits or circuitry in this regard may comprise circuits dedicated to performing certain functional processing and/or one or more microprocessors in conjunction with memory. For instance, the circuitry may include one or more microprocessor or microcontrollers, as well as other digital hardware, which may include Digital Signal Processors (DSPs), special-purpose digital logic, and the like. The processing circuitry may be configured to execute program code stored in memory, which may include one or several types of memory such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, etc. Program code stored in memory may include program instructions for executing one or more telecommunications and/or data communications protocols as well as instructions for carrying out one or more of the techniques described herein, in several embodiments. In embodiments that employ memory, the memory stores program code that, when executed by the one or more processors, carries out the techniques described herein. Figure 14 illustrates an example sending device 50 configured for processing video data, and more specifically, for packing video data to send to a receiving device. As seen in Figure 14, sending device 50 comprises processing circuitry 400, a memory 402, and communication circuitry 406.

The communication circuitry 406 comprises the hardware required for communicating with a receiving device 70, 120 via network 42. In some embodiments, sending device 50 is capable of wireless communications, and thus, comprises the radio frequency (RF) circuitry needed for transmitting and receiving signals over a wireless communication channel. In these cases, sending device 50 may also be coupled to one or more antenna (not shown). In other embodiments, however, sending device 50 is configured to communicate via a wire interface. In these embodiments, communication circuitry 406 may comprise an ETHERNET or similar interface.

The processing circuitry 400 controls the overall operation of the sending device 50 and processes the signals and data that is transmitted to or received by the sending device 50. The processing circuitry 400 may comprise one or more microprocessors, hardware, firmware, or a combination thereof. The processing circuitry 400 in one embodiment is configured to implement any of the methods of any of Figures 13-18.

Memory 402 comprises both volatile and non-volatile memory for storing computer program code and data needed by the processing circuit 400 for operation. Memory 402 may comprise any tangible, non-transitory computer-readable storage medium for storing data including electronic, magnetic, optical, electromagnetic, or semiconductor data storage. Memory 402 stores a computer program 404 comprising executable instructions that configure the processing circuitry 400 to implement any of the methods of any of Figures 9-11. A computer program 404 in this regard may comprise one or more code modules corresponding to the means or units described above. In general, computer program instructions and configuration information are stored in a non-volatile memory, such as a ROM, erasable programmable read only memory (EPROM) or flash memory. Temporary data generated during operation may be stored in a volatile memory, such as a random access memory (RAM). In some embodiments, computer program 404 for configuring the processing circuit 400 as herein described may be stored in a removable memory, such as a portable compact disc, portable digital video disc, or other removable media. The computer program 404 may also be embodied in a carrier such as an electronic signal, optical signal, radio signal, or computer readable storage medium.

Figure 15 illustrates an example receiving device 70 configured for processing video data received from sending device 50. As seen in Figure 15, receiving device 70 comprises processing circuitry 500, a memory 502, and communication circuitry 506.

The communication circuitry 506 comprises the hardware required for communicating with sending device 50 and/or the first receiving device 120 via network 42. In some embodiments, receiving device 70 is capable of wireless communications, and thus, comprises the radio frequency (RF) circuitry needed for transmitting and receiving signals over a wireless communication channel. In these cases, receiving device 70 may also be coupled to one or more antenna (not shown). In other embodiments, however, receiving device 70 is configured to communicate via a wire interface. In these embodiments, communication circuitry 506 may comprise an ETHERNET or similar interface.

The processing circuitry 500 controls the overall operation of the receiving device 70 and processes the signals and data that is transmitted to or received by the receiving device 70. The processing circuitry 500 may comprise one or more microprocessors, hardware, firmware, or a combination thereof. The processing circuitry 500 in one embodiment is configured to implement any of the methods of any of Figures 12-13.

Memory 502 comprises both volatile and non-volatile memory for storing computer program code and data needed by the processing circuit 500 for operation. Memory 502 may comprise any tangible, non-transitory computer-readable storage medium for storing data including electronic, magnetic, optical, electromagnetic, or semiconductor data storage. Memory 502 stores a computer program 504 comprising executable instructions that configure the processing circuitry 500 to implement any of the methods of any of Figures 12-13. A computer program 504 in this regard may comprise one or more code modules corresponding to the means or units described above. In general, computer program instructions and configuration information are stored in a non-volatile memory, such as a ROM, erasable programmable read only memory (EPROM) or flash memory. Temporary data generated during operation may be stored in a volatile memory, such as a random-access memory (RAM). In some embodiments, computer program 504 for configuring the processing circuit 500 as herein described may be stored in a removable memory, such as a portable compact disc, portable digital video disc, or other removable media. The computer program 504 may also be embodied in a carrier such as an electronic signal, optical signal, radio signal, or computer readable storage medium.

Figure 16 illustrates an example first receiving device 120 configured for processing video data received from sending device 50, and further, for transcoding the video data to send to a second receiving device, such as receiving device 70, for example. As seen in Figure 16, receiving device 120 comprises processing circuitry 600, a memory 602, and communication circuitry 606.

The communication circuitry 606 comprises a network interface having the hardware required for communicating with sending device 50 and/or the second receiving device 70 via network 42. In some embodiments, receiving device 120 is capable of wireless communications, and thus, comprises the radio frequency (RF) circuitry needed for transmitting and receiving signals over a wireless communication channel. In these cases, receiving device 120 may also be coupled to one or more antenna (not shown). In other embodiments, however, receiving device 120 is configured to communicate via a wire interface. In these embodiments, communication circuitry 606 may comprise an ETHERNET or similar interface.

The processing circuitry 600 controls the overall operation of the receiving device 120 and processes the signals and data that is transmitted to or received by the receiving device 120. The processing circuitry 600 may comprise one or more microprocessors, hardware, firmware, or a combination thereof. The processing circuitry 600 in one embodiment is configured to implement method 220 of Figure 13.

Memory 602 comprises both volatile and non-volatile memory for storing computer program code and data needed by the processing circuit 600 for operation. Memory 602 may comprise any tangible, non-transitory computer-readable storage medium for storing data including electronic, magnetic, optical, electromagnetic, or semiconductor data storage. Memory 602 stores a computer program 604 comprising executable instructions that configure the processing circuitry 600 to implement method 220 of Figure 13. A computer program 604 in this regard may comprise one or more code modules corresponding to the means or units described above. In general, computer program instructions and configuration information are stored in a non-volatile memory, such as a ROM, erasable programmable read only memory (EPROM) or flash memory. Temporary data generated during operation may be stored in a volatile memory, such as a random-access memory (RAM). In some embodiments, computer program 604 for configuring the processing circuit 600 as herein described may be stored in a removable memory, such as a portable compact disc, portable digital video disc, or other removable media. The computer program 604 may also be embodied in a carrier such as an electronic signal, optical signal, radio signal, or computer readable storage medium.

Those skilled in the art will also appreciate that embodiments herein further include corresponding computer programs. A computer program comprises instructions which, when executed on at least one processor of an apparatus, cause the apparatus to carry out any of the respective processes described above. A computer program in this regard may comprise one or more code modules corresponding to the means or units described above.

Embodiments further include a carrier containing such a computer program. This carrier may comprise one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

In this regard, embodiments herein also include a computer program product stored on a non-transitory computer readable (storage or recording) medium and comprising instructions that, when executed by a processor of an apparatus, cause the apparatus to perform as described above.

Embodiments further include a computer program product comprising program code portions for performing the steps of any of the embodiments herein when the computer program product is executed by a computing device. This computer program product may be stored on a computer readable recording medium. Additional embodiments will now be described. At least some of these embodiments may be described as applicable in certain contexts and/or wireless network types for illustrative purposes, but the embodiments are similarly applicable in other contexts and/or wireless network types not explicitly described.

Figure 17 shows an example of a communication system 1100 in accordance with some embodiments.

In the example, the communication system 1100 includes a telecommunication network 1102 that includes an access network 1104, such as a radio access network (RAN), and a core network 1106, which includes one or more core network nodes 1108. The access network 1104 includes one or more access network nodes, such as network nodes 1110a and 1110b (one or more of which may be generally referred to as network nodes 1110), or any other similar 3rd Generation Partnership Project (3GPP) access nodes or non-3GPP access points. Moreover, as will be appreciated by those of skill in the art, a network node is not necessarily limited to an implementation in which a radio portion and a baseband portion are supplied and integrated by a single vendor. Thus, it will be understood that network nodes include disaggregated implementations or portions thereof. For example, in some embodiments, the telecommunication network 1102 includes one or more Open-RAN (ORAN) network nodes. An ORAN network node is a node in the telecommunication network 1102 that supports an ORAN specification (e.g., a specification published by the O-RAN Alliance, or any similar organization) and may operate alone or together with other nodes to implement one or more functionalities of any node in the telecommunication network 1102, including one or more network nodes 1110 and/or core network nodes 1108.

Examples of an ORAN network node include an open radio unit (O-RU), an open distributed unit (O-DU), an open central unit (O-CU), including an O-CU control plane (O-CU- CP) or an O-CU user plane (O-CU-UP), a RAN intelligent controller (near-real time or non-real time) hosting software or software plug-ins, such as a near-real time control application (e.g., xApp) or a non-real time control application (e.g., rApp), or any combination thereof (the adjective “open” designating support of an ORAN specification). The network node may support a specification by, for example, supporting an interface defined by the ORAN specification, such as an A1 , F1 , W1 , E1 , E2, X2, Xn interface, an open fronthaul user plane interface, or an open fronthaul management plane interface. Moreover, an ORAN access node may be a logical node in a physical node. Furthermore, an ORAN network node may be implemented in a virtualization environment (described further below) in which one or more network functions are virtualized. For example, the virtualization environment may include an O-Cloud computing platform orchestrated by a Service Management and Orchestration Framework via an 0-2 interface defined by the O-RAN Alliance or comparable technologies. The network nodes 1110 facilitate direct or indirect connection of user equipment (UE), such as by connecting UEs 1112A, 1112B, 1112C, and 1112D (one or more of which may be generally referred to as UEs 1112) to the core network 1106 over one or more wireless connections.

Example wireless communications over a wireless connection include transmitting and/or receiving wireless signals using electromagnetic waves, radio waves, infrared waves, and/or other types of signals suitable for conveying information without the use of wires, cables, or other material conductors. Moreover, in different embodiments, the communication system 1100 may include any number of wired or wireless networks, network nodes, UEs, and/or any other components or systems that may facilitate or participate in the communication of data and/or signals whether via wired or wireless connections. The communication system 1100 may include and/or interface with any type of communication, telecommunication, data, cellular, radio network, and/or other similar type of system.

The UEs 1112 may be any of a wide variety of communication devices, including wireless devices arranged, configured, and/or operable to communicate wirelessly with the network nodes 1110 and other communication devices. Similarly, the network nodes 1110 are arranged, capable, configured, and/or operable to communicate directly or indirectly with the UEs 1112 and/or with other network nodes or equipment in the telecommunication network 1102 to enable and/or provide network access, such as wireless network access, and/or to perform other functions, such as administration in the telecommunication network 1102.

In the depicted example, the core network 1106 connects the network nodes 1110 to one or more hosts, such as host 1116. These connections may be direct or indirect via one or more intermediary networks or devices. In other examples, network nodes may be directly coupled to hosts. The core network 1106 includes one more core network nodes (e.g., core network node 1108) that are structured with hardware and software components. Features of these components may be substantially similar to those described with respect to the UEs, network nodes, and/or hosts, such that the descriptions thereof are generally applicable to the corresponding components of the core network node 1108. Example core network nodes include functions of one or more of a Mobile Switching Center (MSC), Mobility Management Entity (MME), Home Subscriber Server (HSS), Access and Mobility Management Function (AMF), Session Management Function (SMF), Authentication Server Function (AUSF), Subscription Identifier De-concealing function (SIDF), Unified Data Management (UDM), Security Edge Protection Proxy (SEPP), Network Exposure Function (NEF), and/or a User Plane Function (UPF).

The host 1116 may be under the ownership or control of a service provider other than an operator or provider of the access network 1104 and/or the telecommunication network 1102, and may be operated by the service provider or on behalf of the service provider. The host 1116 may host a variety of applications to provide one or more service. Examples of such applications include live and pre-recorded audio/video content, data collection services such as retrieving and compiling data on various ambient conditions detected by a plurality of UEs, analytics functionality, social media, functions for controlling or otherwise interacting with remote devices, functions for an alarm and surveillance center, or any other such function performed by a server.

As a whole, the communication system 1100 of Figure 17 enables connectivity between the UEs, network nodes, and hosts. In that sense, the communication system may be configured to operate according to predefined rules or procedures, such as specific standards that include, but are not limited to: Global System for Mobile Communications (GSM); Universal Mobile Telecommunications System (UMTS); Long Term Evolution (LTE), and/or other suitable 2G, 3G, 4G, 5G standards, or any applicable future generation standard (e.g., 6G); wireless local area network (WLAN) standards, such as the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standards (WiFi); and/or any other appropriate wireless communication standard, such as the Worldwide Interoperability for Microwave Access (WiMax), Bluetooth, Z-Wave, Near Field Communication (NFC) ZigBee, LiFi, and/or any low- power wide-area network (LPWAN) standards such as LoRa and Sigfox.

In some examples, the telecommunication network 1102 is a cellular network that implements 3GPP standardized features. Accordingly, the telecommunications network 1102 may support network slicing to provide different logical networks to different devices that are connected to the telecommunication network 1102. For example, the telecommunications network 1102 may provide Ultra Reliable Low Latency Communication (URLLC) services to some UEs, while providing Enhanced Mobile Broadband (eMBB) services to other UEs, and/or Massive Machine Type Communication (mMTC)ZMassive loT services to yet further UEs.

In some examples, the UEs 1112 are configured to transmit and/or receive information without direct human interaction. For instance, a UE may be designed to transmit information to the access network 1104 on a predetermined schedule, when triggered by an internal or external event, or in response to requests from the access network 1104. Additionally, a UE may be configured for operating in single- or multi-RAT or multi-standard mode. For example, a UE may operate with any one or combination of Wi-Fi, NR (New Radio) and LTE, i.e. being configured for multi-radio dual connectivity (MR-DC), such as E-UTRAN (Evolved-UMTS Terrestrial Radio Access Network) New Radio - Dual Connectivity (EN-DC).

In the example, the hub 1114 communicates with the access network 1104 to facilitate indirect communication between one or more UEs (e.g., UE 1112c and/or 1112D) and network nodes (e.g., network node 1110B). In some examples, the hub 1114 may be a controller, router, content source and analytics, or any of the other communication devices described herein regarding UEs. For example, the hub 1114 may be a broadband router enabling access to the core network 1106 for the UEs. As another example, the hub 1114 may be a controller that sends commands or instructions to one or more actuators in the UEs. Commands or instructions may be received from the UEs, network nodes 1110, or by executable code, script, process, or other instructions in the hub 1114. As another example, the hub 1114 may be a data collector that acts as temporary storage for UE data and, in some embodiments, may perform analysis or other processing of the data. As another example, the hub 1114 may be a content source. For example, for a UE that is a VR headset, display, loudspeaker or other media delivery device, the hub 1114 may retrieve VR assets, video, audio, or other media or data related to sensory information via a network node, which the hub 1114 then provides to the UE either directly, after performing local processing, and/or after adding additional local content. In still another example, the hub 1114 acts as a proxy server or orchestrator for the UEs, in particular if one or more of the UEs are low energy loT devices.

The hub 1114 may have a constant/persistent or intermittent connection to the network node 1110B. The hub 1114 may also allow for a different communication scheme and/or schedule between the hub 1114 and UEs (e.g., UE 1112C and/or 1112D), and between the hub 1114 and the core network 1106. In other examples, the hub 1114 is connected to the core network 1106 and/or one or more UEs via a wired connection. Moreover, the hub 1114 may be configured to connect to an M2M service provider over the access network 1104 and/or to another UE over a direct connection. In some scenarios, UEs may establish a wireless connection with the network nodes 1110 while still connected via the hub 1114 via a wired or wireless connection. In some embodiments, the hub 1114 may be a dedicated hub - that is, a hub whose primary function is to route communications to/from the UEs from/to the network node 1110B. In other embodiments, the hub 1114 may be a non-dedicated hub - that is, a device which is capable of operating to route communications between the UEs and network node 1110B, but which is additionally capable of operating as a communication start and/or end point for certain data channels.

Figure 18 shows a UE 1200 in accordance with some embodiments. As used herein, a UE refers to a device capable, configured, arranged and/or operable to communicate wirelessly with network nodes and/or other UEs. Examples of a UE include, but are not limited to, a smart phone, mobile phone, cell phone, voice over IP (VoIP) phone, wireless local loop phone, desktop computer, personal digital assistant (PDA), wireless cameras, gaming console or device, music storage device, playback appliance, wearable terminal device, wireless endpoint, mobile station, tablet, laptop, laptop-embedded equipment (LEE), laptop-mounted equipment (LME), smart device, wireless customer-premise equipment (CPE), vehicle, vehicle-mounted or vehicle embedded/integrated wireless device, etc. Other examples include any UE identified by the 3rd Generation Partnership Project (3GPP), including a narrow band internet of things (NB- loT) UE, a machine type communication (MTC) UE, and/or an enhanced MTC (eMTC) UE.

A UE may support device-to-device (D2D) communication, for example by implementing a 3GPP standard for sidelink communication, Dedicated Short-Range Communication (DSRC), vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), or vehicle-to-everything (V2X). In other examples, a UE may not necessarily have a user in the sense of a human user who owns and/or operates the relevant device. Instead, a UE may represent a device that is intended for sale to, or operation by, a human user but which may not, or which may not initially, be associated with a specific human user (e.g., a smart sprinkler controller). Alternatively, a UE may represent a device that is not intended for sale to, or operation by, an end user but which may be associated with or operated for the benefit of a user (e.g., a smart power meter).

The UE 1200 includes processing circuitry 1202 that is operatively coupled via a bus 1204 to an input/output interface 1206, a power source 1208, a memory 1210, a communication interface 1212, and/or any other component, or any combination thereof. Certain UEs may utilize all or a subset of the components shown in Figure 18. The level of integration between the components may vary from one UE to another UE. Further, certain UEs may contain multiple instances of a component, such as multiple processors, memories, transceivers, transmitters, receivers, etc.

The processing circuitry 1202 is configured to process instructions and data and may be configured to implement any sequential state machine operative to execute instructions stored as machine-readable computer programs in the memory 1210. The processing circuitry 1202 may be implemented as one or more hardware-implemented state machines (e.g., in discrete logic, field-programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), etc.); programmable logic together with appropriate firmware; one or more stored computer programs, general-purpose processors, such as a microprocessor or digital signal processor (DSP), together with appropriate software; or any combination of the above. For example, the processing circuitry 1202 may include multiple central processing units (CPUs).

In the example, the input/output interface 1206 may be configured to provide an interface or interfaces to an input device, output device, or one or more input and/or output devices. Examples of an output device include a speaker, a sound card, a video card, a display, a monitor, a printer, an actuator, an emitter, a smartcard, another output device, or any combination thereof. An input device may allow a user to capture information into the UE 1200. Examples of an input device include a touch-sensitive or presence-sensitive display, a camera (e.g., a digital camera, a digital video camera, a web camera, etc.), a microphone, a sensor, a mouse, a trackball, a directional pad, a trackpad, a scroll wheel, a smartcard, and the like. The presence-sensitive display may include a capacitive or resistive touch sensor to sense input from a user. A sensor may be, for instance, an accelerometer, a gyroscope, a tilt sensor, a force sensor, a magnetometer, an optical sensor, a proximity sensor, a biometric sensor, etc., or any combination thereof. An output device may use the same type of interface port as an input device. For example, a Universal Serial Bus (USB) port may be used to provide an input device and an output device.

In some embodiments, the power source 1208 is structured as a battery or battery pack. Other types of power sources, such as an external power source (e.g., an electricity outlet), photovoltaic device, or power cell, may be used. The power source 1208 may further include power circuitry for delivering power from the power source 1208 itself, and/or an external power source, to the various parts of the UE 1200 via input circuitry or an interface such as an electrical power cable. Delivering power may be, for example, for charging of the power source 1208. Power circuitry may perform any formatting, converting, or other modification to the power from the power source 1208 to make the power suitable for the respective components of the UE 1200 to which power is supplied.

The memory 1210 may be or be configured to include memory such as random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic disks, optical disks, hard disks, removable cartridges, flash drives, and so forth. In one example, the memory 1210 includes one or more application programs 1214, such as an operating system, web browser application, a widget, gadget engine, or other application, and corresponding data 1216. The memory 1210 may store, for use by the UE 1200, any of a variety of various operating systems or combinations of operating systems.

The memory 1210 may be configured to include a number of physical drive units, such as redundant array of independent disks (RAID), flash memory, USB flash drive, external hard disk drive, thumb drive, pen drive, key drive, high-density digital versatile disc (HD-DVD) optical disc drive, internal hard disk drive, Blu-Ray optical disc drive, holographic digital data storage (HDDS) optical disc drive, external mini-dual in-line memory module (DIMM), synchronous dynamic random access memory (SDRAM), external micro-DIMM SDRAM, smartcard memory such as tamper resistant module in the form of a universal integrated circuit card (UICC) including one or more subscriber identity modules (SIMs), such as a USIM and/or ISIM, other memory, or any combination thereof. The UICC may for example be an embedded UICC (eUlCC), integrated UICC (iUICC) or a removable UICC commonly known as ‘SIM card.’ The memory 1210 may allow the UE 1200 to access instructions, application programs and the like, stored on transitory or non-transitory memory media, to off-load data, or to upload data. An article of manufacture, such as one utilizing a communication system may be tangibly embodied as or in the memory 1210, which may be or comprise a device-readable storage medium.

The processing circuitry 1202 may be configured to communicate with an access network or other network using the communication interface 1212. The communication interface 1212 may comprise one or more communication subsystems and may include or be communicatively coupled to an antenna 1222. The communication interface 1212 may include one or more transceivers used to communicate, such as by communicating with one or more remote transceivers of another device capable of wireless communication (e.g., another UE or a network node in an access network). Each transceiver may include a transmitter 1218 and/or a receiver 1220 appropriate to provide network communications (e.g., optical, electrical, frequency allocations, and so forth). Moreover, the transmitter 1218 and receiver 1220 may be coupled to one or more antennas (e.g., antenna 1222) and may share circuit components, software or firmware, or alternatively be implemented separately.

In the illustrated embodiment, communication functions of the communication interface 1212 may include cellular communication, Wi-Fi communication, LPWAN communication, data communication, voice communication, multimedia communication, short-range communications such as Bluetooth, near-field communication, location-based communication such as the use of the global positioning system (GPS) to determine a location, another like communication function, or any combination thereof. Communications may be implemented in according to one or more communication protocols and/or standards, such as IEEE 802.11 , Code Division Multiplexing Access (CDMA), Wideband Code Division Multiple Access (WCDMA), GSM, LTE, New Radio (NR), UMTS, WiMax, Ethernet, transmission control protocol/internet protocol (TCP/IP), synchronous optical networking (SONET), Asynchronous Transfer Mode (ATM), QUIC, Hypertext Transfer Protocol (HTTP), and so forth.

Regardless of the type of sensor, a UE may provide an output of data captured by its sensors, through its communication interface 1212, via a wireless connection to a network node. Data captured by sensors of a UE can be communicated through a wireless connection to a network node via another UE. The output may be periodic (e.g., once every 15 minutes if it reports the sensed temperature), random (e.g., to even out the load from reporting from several sensors), in response to a triggering event (e.g., when moisture is detected an alert is sent), in response to a request (e.g., a user initiated request), or a continuous stream (e.g., a live video feed of a patient).

As another example, a UE comprises a mobile device, such as phone or tablet having a depth sensor, Augmented Reality (AR) glasses, a Mixed Reality (MR) device, a Virtual Reality (VR) device, an actuator, a motor, or a switch, related to a communication interface configured to receive wireless input from a network node via a wireless connection. In response to the received wireless input the states of the actuator, the motor, or the switch may change. For example, the UE may comprise a motor that adjusts the control surfaces or rotors of a drone in flight according to the received input or to a robotic arm performing a medical procedure according to the received input.

A UE, when in the form of an Internet of Things (loT) device, may be a device for use in one or more application domains, these domains comprising, but not limited to, city wearable technology, extended industrial application and healthcare. Non-limiting examples of such an loT device are a device which is or which is embedded in: a connected refrigerator or freezer, a TV, a connected lighting device, an electricity meter, a robot vacuum cleaner, a voice controlled smart speaker, a home security camera, a motion detector, a thermostat, a smoke detector, a door/window sensor, a flood/moisture sensor, an electrical door lock, a connected doorbell, an air conditioning system like a heat pump, an autonomous vehicle, a surveillance system, a weather monitoring device, a vehicle parking monitoring device, an electric vehicle charging station, a smart watch, a fitness tracker, a head-mounted display for Augmented Reality (AR) or Virtual Reality (VR), a wearable for tactile augmentation or sensory enhancement, a water sprinkler, an animal- or item-tracking device, a sensor for monitoring a plant or animal, an industrial robot, an Unmanned Aerial Vehicle (UAV), and any kind of medical device, like a heart rate monitor or a remote controlled surgical robot. A UE in the form of an loT device comprises circuitry and/or software in dependence of the intended application of the loT device in addition to other components as described in relation to the UE 1200 shown in Figure 18.

As yet another specific example, in an loT scenario, a UE may represent a machine or other device that performs monitoring and/or measurements, and transmits the results of such monitoring and/or measurements to another UE and/or a network node. The UE may in this case be an M2M device, which may in a 3GPP context be referred to as an MTC device. As one particular example, the UE may implement the 3GPP NB-loT standard. In other scenarios, a UE may represent a vehicle, such as a car, a bus, a truck, a ship and an airplane, or other equipment that is capable of monitoring and/or reporting on its operational status or other functions associated with its operation.

In practice, any number of UEs may be used together with respect to a single use case. For example, a first UE might be or be integrated in a drone and provide the drone’s speed information (obtained through a speed sensor) to a second UE that is a remote controller operating the drone. When the user makes changes from the remote controller, the first UE may adjust the throttle on the drone (e.g. by controlling an actuator) to increase or decrease the drone’s speed. The first and/or the second UE can also include more than one of the functionalities described above. For example, a UE might comprise the sensor and the actuator, and handle communication of data for both the speed sensor and the actuators.

Figure 19 shows a network node 1300 in accordance with some embodiments. As used herein, network node refers to equipment capable, configured, arranged and/or operable to communicate directly or indirectly with a UE and/or with other network nodes or equipment, in a telecommunication network. Examples of network nodes include, but are not limited to, access points (APs) (e.g., radio access points), base stations (BSs) (e.g., radio base stations, Node Bs, evolved Node Bs (eNBs) and NR NodeBs (gNBs)), O-RAN nodes or components of an O-RAN node (e.g., O-RU, O-DU, O-CU).

Base stations may be categorized based on the amount of coverage they provide (or, stated differently, their transmit power level) and so, depending on the provided amount of coverage, may be referred to as femto base stations, pico base stations, micro base stations, or macro base stations. A base station may be a relay node or a relay donor node controlling a relay. A network node may also include one or more (or all) parts of a distributed radio base station such as centralized digital units, distributed units (e.g., in an O-RAN access node) and/or remote radio units (RRUs), sometimes referred to as Remote Radio Heads (RRHs). Such remote radio units may or may not be integrated with an antenna as an antenna integrated radio. Parts of a distributed radio base station may also be referred to as nodes in a distributed antenna system (DAS).

Other examples of network nodes include multiple transmission point (multi-TRP) 5G access nodes, multi-standard radio (MSR) equipment such as MSR BSs, network controllers such as radio network controllers (RNCs) or base station controllers (BSCs), base transceiver stations (BTSs), transmission points, transmission nodes, multi-cell/multicast coordination entities (MCEs), Operation and Maintenance (O&M) nodes, Operations Support System (OSS) nodes, Self-Organizing Network (SON) nodes, positioning nodes (e.g., Evolved Serving Mobile Location Centers (E-SMLCs)), and/or Minimization of Drive Tests (MDTs).

The network node 1300 includes a processing circuitry 1302, a memory 1304, a communication interface 1306, and a power source 1308. The network node 1300 may be composed of multiple physically separate components (e.g., a NodeB component and a RNC component, or a BTS component and a BSC component, etc.), which may each have their own respective components. In certain scenarios in which the network node 1300 comprises multiple separate components (e.g., BTS and BSC components), one or more of the separate components may be shared among several network nodes. For example, a single RNC may control multiple NodeBs. In such a scenario, each unique NodeB and RNC pair, may in some instances be considered a single separate network node. In some embodiments, the network node 1300 may be configured to support multiple radio access technologies (RATs). In such embodiments, some components may be duplicated (e.g., separate memory 1304 for different RATs) and some components may be reused (e.g., a same antenna 1310 may be shared by different RATs). The network node 1300 may also include multiple sets of the various illustrated components for different wireless technologies integrated into network node 1300, for example GSM, WCDMA, LTE, NR, WiFi, Zigbee, Z-wave, LoRaWAN, Radio Frequency Identification (RFID) or Bluetooth wireless technologies. These wireless technologies may be integrated into the same or different chip or set of chips and other components within network node 1300.

The processing circuitry 1302 may comprise a combination of one or more of a microprocessor, controller, microcontroller, central processing unit, digital signal processor, application-specific integrated circuit, field programmable gate array, or any other suitable computing device, resource, or combination of hardware, software and/or encoded logic operable to provide, either alone or in conjunction with other network node 1300 components, such as the memory 1304, to provide network node 1300 functionality.

In some embodiments, the processing circuitry 1302 includes a system on a chip (SOC). In some embodiments, the processing circuitry 1302 includes one or more of radio frequency (RF) transceiver circuitry 1312 and baseband processing circuitry 1314. In some embodiments, the radio frequency (RF) transceiver circuitry 1312 and the baseband processing circuitry 1314 may be on separate chips (or sets of chips), boards, or units, such as radio units and digital units. In alternative embodiments, part or all of RF transceiver circuitry 1312 and baseband processing circuitry 1314 may be on the same chip or set of chips, boards, or units.

The memory 1304 may comprise any form of volatile or non-volatile computer-readable memory including, without limitation, persistent storage, solid-state memory, remotely mounted memory, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), mass storage media (for example, a hard disk), removable storage media (for example, a flash drive, a Compact Disk (CD) or a Digital Video Disk (DVD)), and/or any other volatile or non-volatile, non-transitory device-readable and/or computer-executable memory devices that store information, data, and/or instructions that may be used by the processing circuitry 1302. The memory 1304 may store any suitable instructions, data, or information, including a computer program, software, an application including one or more of logic, rules, code, tables, and/or other instructions capable of being executed by the processing circuitry 1302 and utilized by the network node 1300. The memory 1304 may be used to store any calculations made by the processing circuitry 1302 and/or any data received via the communication interface 1306. In some embodiments, the processing circuitry 1302 and memory 1304 is integrated.

The communication interface 1306 is used in wired or wireless communication of signaling and/or data between a network node, access network, and/or UE. As illustrated, the communication interface 1306 comprises port(s)/terminal(s) 1316 to send and receive data, for example to and from a network over a wired connection. The communication interface 1306 also includes radio front-end circuitry 1318 that may be coupled to, or in certain embodiments a part of, the antenna 1310. Radio front-end circuitry 1318 comprises filters 1320 and amplifiers 1322. The radio front-end circuitry 1318 may be connected to an antenna 1310 and processing circuitry 1302. The radio front-end circuitry may be configured to condition signals communicated between antenna 1310 and processing circuitry 1302. The radio front-end circuitry 1318 may receive digital data that is to be sent out to other network nodes or UEs via a wireless connection. The radio front-end circuitry 1318 may convert the digital data into a radio signal having the appropriate channel and bandwidth parameters using a combination of filters 1320 and/or amplifiers 1322. The radio signal may then be transmitted via the antenna 1310. Similarly, when receiving data, the antenna 1310 may collect radio signals which are then converted into digital data by the radio front-end circuitry 1318. The digital data may be passed to the processing circuitry 1302. In other embodiments, the communication interface may comprise different components and/or different combinations of components.

In certain alternative embodiments, the network node 1300 does not include separate radio front-end circuitry 1318, instead, the processing circuitry 1302 includes radio front-end circuitry and is connected to the antenna 1310. Similarly, in some embodiments, all or some of the RF transceiver circuitry 1312 is part of the communication interface 1306. In still other embodiments, the communication interface 1306 includes one or more ports or terminals 1316, the radio front-end circuitry 1318, and the RF transceiver circuitry 1312, as part of a radio unit (not shown), and the communication interface 1306 communicates with the baseband processing circuitry 1314, which is part of a digital unit (not shown).

The antenna 1310 may include one or more antennas, or antenna arrays, configured to send and/or receive wireless signals. The antenna 1310 may be coupled to the radio front-end circuitry 1318 and may be any type of antenna capable of transmitting and receiving data and/or signals wirelessly. In certain embodiments, the antenna 1310 is separate from the network node 1300 and connectable to the network node 1300 through an interface or port.

The antenna 1310, communication interface 1306, and/or the processing circuitry 1302 may be configured to perform any receiving operations and/or certain obtaining operations described herein as being performed by the network node. Any information, data and/or signals may be received from a UE, another network node and/or any other network equipment. Similarly, the antenna 1310, the communication interface 1306, and/or the processing circuitry 1302 may be configured to perform any transmitting operations described herein as being performed by the network node. Any information, data and/or signals may be transmitted to a UE, another network node and/or any other network equipment.

The power source 1308 provides power to the various components of network node 1300 in a form suitable for the respective components (e.g., at a voltage and current level needed for each respective component). The power source 1308 may further comprise, or be coupled to, power management circuitry to supply the components of the network node 1300 with power for performing the functionality described herein. For example, the network node 1300 may be connectable to an external power source (e.g., the power grid, an electricity outlet) via an input circuitry or interface such as an electrical cable, whereby the external power source supplies power to power circuitry of the power source 1308. As a further example, the power source 1308 may comprise a source of power in the form of a battery or battery pack which is connected to, or integrated in, power circuitry. The battery may provide backup power should the external power source fail.

Embodiments of the network node 1300 may include additional components beyond those shown in Figure 19 for providing certain aspects of the network node’s functionality, including any of the functionality described herein and/or any functionality necessary to support the subject matter described herein. For example, the network node 1300 may include user interface equipment to allow input of information into the network node 1300 and to allow output of information from the network node 1300. This may allow a user to perform diagnostic, maintenance, repair, and other administrative functions for the network node 1300.

Figure 20 is a block diagram of a host 1400, which may be an embodiment of the host 1116 of Figure 17, in accordance with various aspects described herein. As used herein, the host 1400 may be or comprise various combinations hardware and/or software, including a standalone server, a blade server, a cloud-implemented server, a distributed server, a virtual machine, container, or processing resources in a server farm. The host 1400 may provide one or more services to one or more UEs.

The host 1400 includes processing circuitry 1402 that is operatively coupled via a bus 1404 to an input/output interface 1406, a network interface 1408, a power source 1410, and a memory 1412. Other components may be included in other embodiments. Features of these components may be substantially similar to those described with respect to the devices of previous figures, such as Figures 18 and 19, such that the descriptions thereof are generally applicable to the corresponding components of host 1400.

The memory 1412 may include one or more computer programs including one or more host application programs 1414 and data 1416, which may include user data, e.g., data generated by a UE for the host 1400 or data generated by the host 1400 for a UE. Embodiments of the host 1400 may utilize only a subset or all of the components shown. The host application programs 1414 may be implemented in a container-based architecture and may provide support for video codecs (e.g., Versatile Video Coding (WC), High Efficiency Video Coding (HEVC), Advanced Video Coding (AVC), MPEG, VP9) and audio codecs (e.g., FLAC, Advanced Audio Coding (AAC), MPEG, G.711), including transcoding for multiple different classes, types, or implementations of UEs (e.g., handsets, desktop computers, wearable display systems, heads-up display systems). The host application programs 1414 may also provide for user authentication and licensing checks and may periodically report health, routes, and content availability to a central node, such as a device in or on the edge of a core network. Accordingly, the host 1400 may select and/or indicate a different host for over-the-top services for a UE. The host application programs 1414 may support various protocols, such as the HTTP Live Streaming (HLS) protocol, Real-Time Messaging Protocol (RTMP), Real-Time Streaming Protocol (RTSP), Dynamic Adaptive Streaming over HTTP (MPEG-DASH), etc.

Figure 21 is a block diagram illustrating a virtualization environment 1500 in which functions implemented by some embodiments may be virtualized. In the present context, virtualizing means creating virtual versions of apparatuses or devices which may include virtualizing hardware platforms, storage devices and networking resources. As used herein, virtualization can be applied to any device described herein, or components thereof, and relates to an implementation in which at least a portion of the functionality is implemented as one or more virtual components. Some or all of the functions described herein may be implemented as virtual components executed by one or more virtual machines (VMs) implemented in one or more virtual environments 1500 hosted by one or more of hardware nodes, such as a hardware computing device that operates as a network node, UE, core network node, or host. Further, in embodiments in which the virtual node does not require radio connectivity (e.g., a core network node or host), then the node may be entirely virtualized. In some embodiments, the virtualization environment 1500 includes components defined by the O-RAN Alliance, such as an O-Cloud environment orchestrated by a Service Management and Orchestration Framework via an 0-2 interface.

Applications 1502 (which may alternatively be called software instances, virtual appliances, network functions, virtual nodes, virtual network functions, etc.) are run in the virtualization environment Q400 to implement some of the features, functions, and/or benefits of some of the embodiments disclosed herein.

Hardware 1504 includes processing circuitry, memory that stores software and/or instructions executable by hardware processing circuitry, and/or other hardware devices as described herein, such as a network interface, input/output interface, and so forth. Software may be executed by the processing circuitry to instantiate one or more virtualization layers 1506 (also referred to as hypervisors or virtual machine monitors (VMMs)), provide VMs 1508A and 1508B (one or more of which may be generally referred to as VMs 1508), and/or perform any of the functions, features and/or benefits described in relation with some embodiments described herein. The virtualization layer 1506 may present a virtual operating platform that appears like networking hardware to the VMs 1508.

The VMs 1508 comprise virtual processing, virtual memory, virtual networking or interface and virtual storage, and may be run by a corresponding virtualization layer 1506. Different embodiments of the instance of a virtual appliance 1502 may be implemented on one or more of VMs 1508, and the implementations may be made in different ways. Virtualization of the hardware is in some contexts referred to as network function virtualization (NFV). NFV may be used to consolidate many network equipment types onto industry standard high volume server hardware, physical switches, and physical storage, which can be located in data centers, and customer premise equipment.

In the context of NFV, a VM 1508 may be a software implementation of a physical machine that runs programs as if they were executing on a physical, non-virtualized machine. Each of the VMs 1508, and that part of hardware 1504 that executes that VM, be it hardware dedicated to that VM and/or hardware shared by that VM with others of the VMs, forms separate virtual network elements. Still in the context of NFV, a virtual network function is responsible for handling specific network functions that run in one or more VMs 1508 on top of the hardware 1504 and corresponds to the application 1502.

Hardware 1504 may be implemented in a standalone network node with generic or specific components. Hardware 1504 may implement some functions via virtualization. Alternatively, hardware 1504 may be part of a larger cluster of hardware (e.g. such as in a data center or CPE) where many hardware nodes work together and are managed via management and orchestration 1510, which, among others, oversees lifecycle management of applications 1502. In some embodiments, hardware 1504 is coupled to one or more radio units that each include one or more transmitters and one or more receivers that may be coupled to one or more antennas. Radio units may communicate directly with other hardware nodes via one or more appropriate network interfaces and may be used in combination with the virtual components to provide a virtual node with radio capabilities, such as a radio access node or a base station. In some embodiments, some signaling can be provided with the use of a control system 1512 which may alternatively be used for communication between hardware nodes and radio units.

Figure 22 shows a communication diagram of a host 1602 communicating via a network node 1604 with a UE 1606 over a partially wireless connection in accordance with some embodiments. Example implementations, in accordance with various embodiments, of the UE (such as a UE 1112A of Figure 17 and/or UE 1200 of Figure 18), network node (such as network node 1110A of Figure 17 and/or network node 1300 of Figure 19), and host (such as host 1116 of Figure 17 and/or host 1400 of Figure 20) discussed in the preceding paragraphs will now be described with reference to Figure 22.

Like host 1400, embodiments of host 1602 include hardware, such as a communication interface, processing circuitry, and memory. The host 1602 also includes software, which is stored in or accessible by the host 1602 and executable by the processing circuitry. The software includes a host application that may be operable to provide a service to a remote user, such as the UE 1606 connecting via an over-the-top (OTT) connection 1650 extending between the UE 1606 and host 1602. In providing the service to the remote user, a host application may provide user data which is transmitted using the OTT connection 1650.

The network node 1604 includes hardware enabling it to communicate with the host 1602 and UE 1606. The connection 1660 may be direct or pass through a core network (like core network 1106 of Figure 17) and/or one or more other intermediate networks, such as one or more public, private, or hosted networks. For example, an intermediate network may be a backbone network or the Internet.

The UE 1606 includes hardware and software, which is stored in or accessible by UE 1606 and executable by the UE’s processing circuitry. The software includes a client application, such as a web browser or operator-specific “app” that may be operable to provide a service to a human or non-human user via UE 1606 with the support of the host 1602. In the host 1602, an executing host application may communicate with the executing client application via the OTT connection 1650 terminating at the UE 1606 and host 1602. In providing the service to the user, the UE's client application may receive request data from the host's host application and provide user data in response to the request data. The OTT connection 1650 may transfer both the request data and the user data. The UE's client application may interact with the user to generate the user data that it provides to the host application through the OTT connection 1650.

The OTT connection 1650 may extend via a connection 1660 between the host 1602 and the network node 1604 and via a wireless connection 1670 between the network node 1604 and the UE 1606 to provide the connection between the host 1602 and the UE 1606. The connection 1660 and wireless connection 1670, over which the OTT connection 1650 may be provided, have been drawn abstractly to illustrate the communication between the host 1602 and the UE 1606 via the network node 1604, without explicit reference to any intermediary devices and the precise routing of messages via these devices.

As an example of transmitting data via the OTT connection 1650, in step 1608, the host 1602 provides user data, which may be performed by executing a host application. In some embodiments, the user data is associated with a particular human user interacting with the UE 1606. In other embodiments, the user data is associated with a UE 1606 that shares data with the host 1602 without explicit human interaction. In step 1610, the host 1602 initiates a transmission carrying the user data towards the UE 1606. The host 1602 may initiate the transmission responsive to a request transmitted by the UE 1606. The request may be caused by human interaction with the UE 1606 or by operation of the client application executing on the UE 1606. The transmission may pass via the network node 1604, in accordance with the teachings of the embodiments described throughout this disclosure. Accordingly, in step 1612, the network node 1604 transmits to the UE 1606 the user data that was carried in the transmission that the host 1602 initiated, in accordance with the teachings of the embodiments described throughout this disclosure. In step 1614, the UE 1606 receives the user data carried in the transmission, which may be performed by a client application executed on the UE 1606 associated with the host application executed by the host 1602.

In some examples, the UE 1606 executes a client application which provides user data to the host 1602. The user data may be provided in reaction or response to the data received from the host 1602. Accordingly, in step 1616, the UE 1606 may provide user data, which may be performed by executing the client application. In providing the user data, the client application may further consider user input received from the user via an input/output interface of the UE 1606. Regardless of the specific manner in which the user data was provided, the UE 1606 initiates, in step 1618, transmission of the user data towards the host 1602 via the network node 1604. In step 1620, in accordance with the teachings of the embodiments described throughout this disclosure, the network node 1604 receives user data from the UE 1606 and initiates transmission of the received user data towards the host 1602. In step 1622, the host 1602 receives the user data carried in the transmission initiated by the UE 1606.

One or more of the various embodiments improve the performance of OTT services provided to the UE 1606 using the OTT connection 1650, in which the wireless connection 1670 forms the last segment. More precisely, the teachings of these embodiments may improve the low latency processing on sender and receiver devices, and thereby provide benefits such as low bitrate uplink transmission of depth and texture bitstreams.

In an example scenario, factory status information may be collected and analyzed by the host 1602. As another example, the host 1602 may process audio and video data which may have been retrieved from a UE for use in creating maps. As another example, the host 1602 may collect and analyze real-time data to assist in controlling vehicle congestion (e.g., controlling traffic lights). As another example, the host 1602 may store surveillance video uploaded by a UE. As another example, the host 1602 may store or control access to media content such as video, audio, VR or AR which it can broadcast, multicast or unicast to UEs. As other examples, the host 1602 may be used for energy pricing, remote control of non-time critical electrical load to balance power generation needs, location services, presentation services (such as compiling diagrams etc. from data collected from remote devices), or any other function of collecting, retrieving, storing, analyzing and/or transmitting data.

In some examples, a measurement procedure may be provided for the purpose of monitoring data rate, latency and other factors on which the one or more embodiments improve. There may further be an optional network functionality for reconfiguring the OTT connection 1650 between the host 1602 and UE 1606, in response to variations in the measurement results. The measurement procedure and/or the network functionality for reconfiguring the OTT connection may be implemented in software and hardware of the host 1602 and/or UE 1606. In some embodiments, sensors (not shown) may be deployed in or in association with other devices through which the OTT connection 1650 passes; the sensors may participate in the measurement procedure by supplying values of the monitored quantities exemplified above, or supplying values of other physical quantities from which software may compute or estimate the monitored quantities. The reconfiguring of the OTT connection 1650 may include message format, retransmission settings, preferred routing etc.; the reconfiguring need not directly alter the operation of the network node 1604. Such procedures and functionalities may be known and practiced in the art. In certain embodiments, measurements may involve proprietary UE signaling that facilitates measurements of throughput, propagation times, latency and the like, by the host 1602. The measurements may be implemented in that software causes messages to be transmitted, in particular empty or ‘dummy’ messages, using the OTT connection 1650 while monitoring propagation times, errors, etc.

Although the computing devices described herein (e.g., UEs, network nodes, hosts) may include the illustrated combination of hardware components, other embodiments may comprise computing devices with different combinations of components. It is to be understood that these computing devices may comprise any suitable combination of hardware and/or software needed to perform the tasks, features, functions and methods disclosed herein. Determining, calculating, obtaining or similar operations described herein may be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the network node, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination. Moreover, while components are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, computing devices may comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components. For example, a communication interface may be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface. In another example, non-computationally intensive functions of any of such components may be implemented in software or firmware and computationally intensive functions may be implemented in hardware.

In certain embodiments, some or all of the functionality described herein may be provided by processing circuitry executing instructions stored on in memory, which in certain embodiments may be a computer program product in the form of a non-transitory computer- readable storage medium. In alternative embodiments, some or all of the functionality may be provided by the processing circuitry without executing instructions stored on a separate or discrete device-readable storage medium, such as in a hard-wired manner. In any of those particular embodiments, whether executing instructions stored on a non-transitory computer- readable storage medium or not, the processing circuitry can be configured to perform the described functionality. The benefits provided by such functionality are not limited to the processing circuitry alone or to other components of the computing device, but are enjoyed by the computing device as a whole, and/or by end users and a wireless network generally.

The present embodiments may, of course, be carried out in other ways than those specifically set forth herein without departing from characteristics described herein. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.

Claims

1. A method (140) for processing video data, the method comprising: obtaining (142) a depth image (30) from a depth image source, wherein the depth image comprises information associated with a distance of an object in a picture from a viewpoint; obtaining (144) a selection map (28), wherein the selection map represents the picture partitioned into a plurality of areas; and packing (152) the depth image and the selection map into a first picture (20) having a color format with a plurality of channels (22, 24, 26), wherein the depth image and the selection map are packed in first and second channels, respectively, with at least one of the first and second channels being a color channel.

2. The method of claim 1 , further comprising: encoding (160) the first picture for sending to a receiving device (70, 120); and sending (162) the encoded first picture in a bitstream (44) to the receiving device over a network (42).

3. The method of any of claims 1-2, further comprising: obtaining (146) a texture image (32) from a texture image source; and spatially packing (154) the texture image into the first picture prior to the encoding.

4. The method of any of claims 1-3, wherein obtaining the selection map comprises one of: obtaining (172) the selection map from a sensor; obtaining (174) the selection map from a file in a storage device; and deriving (176) the selection map from the depth image, wherein deriving the selection map from the depth image comprises: partitioning (182) the depth image into a plurality of areas based on depth values of the depth image; and generating (184) the selection map to comprise a plurality of unique sample values, with each unique sample value corresponding to one of the plurality of areas of the depth image.

5. The method of any of claims 1-4: wherein the second channel precedes the first channel in one or both of a bitstream and processing order; or wherein the second channel follows the first channel in one or both of a bitstream and processing order.

6. The method of any of claims 1-5, wherein the plurality of channels comprises a third channel, and wherein the third channel comprises a second color channel, and wherein one or more content attributes are packed into the third channel.

7. The method of claim 6, wherein the one or more content attributes comprise information associated with one or more of: a different type of selection map; an occlusion map describing one or more occluded objects in a picture; an alpha channel describing a transparency of the picture; an ambient lighting map comprising information associated with ambient light; and material properties of a 3D object or the picture.

8. The method of any of claims 1-7, wherein the depth image and the selection map are packed into the first picture such that the depth image and the selection map are decoded and postprocessed prior to the texture image.

9. The method of any of claims 1-8, wherein the first picture comprises a plurality of subpictures, and wherein the depth image and the selection map are packed into separate components of a first subpicture and the texture image is packed into a second subpicture.

10. The method of any of claims 1-9, further comprising sending (156) information associated with the packing of one or more of the depth image, the selection map, and the texture image to the receiving device over the network, wherein the information identifies one or more of: a number and/or type of one or more of the depth image, the texture image, and the selection map in the first picture; which channels the one or more of the depth image, the texture image, and the selection map occupy in the first picture; vertical and horizontal positions of the one or more of the depth image, the texture image, and the selection map in the first picture; dimensions of the one or more of the depth image, the texture image, and the selection map in the first picture; and an orientation of the one or more of the depth image, the texture image, and the selection map in the first picture.

11. A method (190, 220), implemented by a receiving device (70, 120), for processing video data, the method comprising: obtaining (192) a first picture (20) having a color format with a plurality of channels (22, 24, 26), wherein a depth image (30) and a selection map (28) are packed into first and second channels, respectively, with at least one of the first and second channels being a color channel; extracting (202) the depth image from the first channel; and extracting (204) the selection map from the second channel.

12. The method of claim 11 , wherein a texture image (32) is spatially packed into the first picture with the depth image and the selection map, and wherein the method further comprises extracting (206) the texture image from the first picture.

13. The method of any of claims 11-12, further comprising generating (208) volumetric video based on the depth image and the selection map.

14. The method of claim 13, further comprising generating the volumetric video based further on the texture image.

15. The method of any of claims 11-14, wherein the first picture is received in a first bitstream from a sending device over a network, and wherein the method further comprises decoding (196) the first picture from the first bitstream.

16. The method of any of claims 11-15, further comprising receiving (194) information associated with the packing of one or more of the depth image, the selection map, and the texture image from the sending device over the network, wherein the information identifies one or more of: a number and/or type of one or more of the depth image, the texture image, and the selection map in the first picture; which channels the one or more of the depth image, the texture image, and the selection map occupy in the first picture; vertical and horizontal positions of the one or more of the depth image, the texture image, and the selection map in the first picture; dimensions of the one or more of the depth image, the texture image, and the selection map in the first picture; and an orientation of the one or more of the depth image, the texture image, and the selection map in the first picture.

17. The method of any of the preceding claims, wherein the selection map is encoded with a lower quantization parameter (QP) and/or a higher resolution than the depth image.

18. The method of any of the preceding claims, wherein the plurality of channels comprises exactly three channels.

19. The method of any of the preceding claims, wherein the color format of the first picture is one of a Red Green Blue (RGB) color format and a YUV color format.

20. The method of any of the preceding claims, wherein the depth image and the selection map are packed into the first picture for encoding using a multi-component color format comprising a YUV format, and wherein: the depth image is signaled in the luminance component of the YUV format, and the selection map is signaled in one of the color components of the YUV format; or the selection map is signaled in the luminance component of the YUV format, and the depth image is signaled in one of the color components of the YUV format; or the selection map is signaled a first color component of the YUV format, and the depth image is signaled in a second color component of the YUV format.

21 . The method of any of the preceding claims, wherein at least two channels of the plurality of channels have different resolutions.

22. The method of any of the preceding claims, wherein spatial positions of the depth image and the selection map are collocated in the first picture such that a spatial position of the depth image corresponds to a spatial position of the selection map.

23. The method of any of the preceding claims, wherein the selection map comprises a plurality of sample map values and is one of: an occupancy map indicating which of a plurality of areas of the depth image are selected for output and/or processing at the receiving device; a binary map wherein each sample map value comprises one of two possible sample map values; a non-binary map wherein each sample map value comprises one of three or more possible sample map values; and a multi-threshold map.

24. The method of any of the preceding claims, wherein the volumetric video is one of a mesh and a point cloud.

25. The method of any of the preceding claims, wherein information about the packing of the depth image and the selection map into the first picture is signaled to the receiving device.

26. A sending device (50) for processing video data, the sending device being configured to: obtain (142) a depth image (30) from a depth image source, wherein the depth image comprises information associated with a distance of an object in a picture from a viewpoint; obtain (144) a selection map (28), wherein the selection map represents the picture partitioned into a plurality of areas; and pack (152) the depth image and the selection map into a first picture having a color format with a plurality of channels (22, 24, 26), wherein the depth image and the selection map are packed in first and second channels, respectively, with at least one of the first and second channels being a color channel.

27. The sending device of claim 26, further configured to perform the method of any one of claims 2-10 and 17-25.

28. A sending device (50) for processing video data, the sending device comprising: communication circuitry (406) for communicating with a receiving device; and processing circuitry (400) configured to: obtain (142) a depth image (30) from a depth image source, wherein the depth image comprises information associated with a distance of an object in a picture from a viewpoint; obtain (144) a selection map (28), wherein the selection map represents the picture partitioned into a plurality of areas; and pack (152) the depth image and the selection map into a first picture having a color format with a plurality of channels (22, 24, 26), wherein the depth image and the selection map are packed in first and second channels, respectively, with at least one of the first and second channels being a color channel.

29. The sending device of claim 28, wherein the processing circuitry is further configured to perform the method of any one of claims 2-10 and 17-25.

30. A computer program (404) comprising executable instructions that, when executed by a processing circuit (400) in a sending device (50), causes the sending device to perform any one of the methods of claims 1-10 and 17-25.

31 . A carrier containing a computer program of claim 30, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

32. A non-transitory computer-readable storage medium (402) containing a computer program (404) comprising executable instructions that, when executed by a processing circuit (400) in a sending device (50) causes the sending device to perform any one of the methods of claims 1- 10 and 17-25.

33. A receiving device (70, 120) for processing video data, the receiving device configured to: obtain (192) a first picture(20) having a color format with a plurality of channels (22, 24, 26), wherein a depth image (30) and a selection map (28) are packed into first and second channels, respectively, with at least one of the first and second channels being a color channel; extract (202) the depth image from the first channel; extract (204) the selection map from the second channel.

34. The receiving device of claim 33, further configured to perform the method of any one of claims 11-25.

35. A receiving device (70, 120) for processing video data, the receiving device comprising: communication circuitry (506) for communicating with one or more devices via a network

(42); and processing circuitry (500) configured to: obtain (192) a first picture(20) having a color format with a plurality of channels (22, 24, 26), wherein a depth image (30) and a selection map (28) are packed into first and second channels, respectively, with at least one of the first and second channels being a color channel; extract (202) the depth image from the first channel; extract (204) the selection map from the second channel.

36. The receiving device of claim 35, wherein the processing circuitry is further configured to perform the method of any one of claims 12-25.

37. A computer program (504) comprising executable instructions that, when executed by a processing circuit (500) in a receiving device (70, 120), causes the receiving device to perform any one of the methods of claims 11-25.

38. A carrier containing a computer program of claim 37, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

39. A non-transitory computer-readable storage medium (502) containing a computer program (504) comprising executable instructions that, when executed by a processing circuit (500) in a receiving device (70, 120) causes the receiving device to perform any one of the methods of claims 11-25.

40. A communication system (40) for processing video data, the communication system comprising: a sending device (50) and a receiving device (70, 120); wherein the sending device is configured to: obtain (142) a depth image (30) from a depth image source, wherein the depth image comprises information associated with a distance of an object in a picture from a viewpoint; obtain (144) a selection map (28), wherein the selection map represents the picture partitioned into a plurality of areas; and pack (152) the depth image and the selection map into a first picture (20) having a color format with a plurality of channels (22, 24, 26), wherein the depth image and the selection map are packed in first and second channels, respectively, with at least one of the first and second channels being a color channel; and wherein the receiving device (60, 120) is configured to: obtain (192) the first picture having the color format with the plurality of channels, wherein the depth image and the selection map are packed into the first and second channels, respectively, with the at least one of the first and second channels being the color channel; extract (202) the depth image from the first channel; and extract (204) the selection map from the second channel.