AU2024200305A1

AU2024200305A1 - Method, apparatus and system for encoding and decoding tensors and video

Info

Publication number: AU2024200305A1
Application number: AU2024200305A
Authority: AU
Inventors: Christopher James ROSEWARNE
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2025-07-31
Also published as: WO2025151918A1

Abstract

#$%^&*AU2024200305A120250731.pdf##### 43452381_1 Abstract METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING TENSORS AND VIDEO A system and method of decoding tensors fand video from a bitstream. The network second portion is a second portion of a neural network. The decoding method comprises determining a first codec for the video based on first information decoded from the bitstream and determining a second codec for the tensors based on second information decoded from the bitstream. The video is decoded according to the first codec and the tensors are decoded according to the second codec. The first information and the second information are independent of each other and the video and the tensors are associated with each other. Abstract METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING TENSORS AND VIDEO A system and method of decoding tensors fand video from a bitstream. The network second portion is a second portion of a neural network. The decoding method comprises determining a first codec for the video based on first information decoded from the bitstream and determining a second codec for the tensors based on second information decoded from the bitstream. The video is decoded according to the first codec and the tensors are decoded according to the second codec. The first information and the second information are independent of each other and the video and the tensors are associated with each other. 43452381_1 20 24 20 03 05 17 J an 2 02 4 A b s t r a c t 2 0 2 4 2 0 0 3 0 5 1 7 J a n 2 0 2 4 M E T H O D , A P P A R A T U S A N D S Y S T E M F O R E N C O D I N G A N D D E C O D I N G T E N S O R S A N D V I D E O A s y s t e m a n d m e t h o d o f d e c o d i n g t e n s o r s f a n d v i d e o f r o m a b i t s t r e a m . T h e n e t w o r k s e c o n d p o r t i o n i s a s e c o n d p o r t i o n o f a n e u r a l n e t w o r k . T h e d e c o d i n g m e t h o d c o m p r i s e s d e t e r m i n i n g a f i r s t c o d e c f o r t h e v i d e o b a s e d o n f i r s t i n f o r m a t i o n d e c o d e d f r o m t h e b i t s t r e a m a n d d e t e r m i n i n g a s e c o n d c o d e c f o r t h e t e n s o r s b a s e d o n s e c o n d i n f o r m a t i o n d e c o d e d f r o m t h e b i t s t r e a m . T h e v i d e o i s d e c o d e d a c c o r d i n g t o t h e f i r s t c o d e c a n d t h e t e n s o r s a r e d e c o d e d a c c o r d i n g t o t h e s e c o n d c o d e c . T h e f i r s t i n f o r m a t i o n a n d t h e s e c o n d i n f o r m a t i o n a r e i n d e p e n d e n t o f e a c h o t h e r a n d t h e v i d e o a n d t h e t e n s o r s a r e a s s o c i a t e d w i t h e a c h o t h e r . 4 3 4 5 2 3 8 1 _ 1 1/21 43452406_1 Fig. 1 100 Source device 110 Video source 112 Transmitter/mux 122 Receiver/demux 142 130 Storage 132 113 121 145 NN part 1 114 NN part 2 166 Task result renderer 168 167 Destination device 140 Tensor encoder 116 Tensor decoder 146 115 149 Video encoder 150 Video decoder 170 151 172 123 171 Box encapsulator 154 155 Box extractor 144 143 1/21 100 Source device Destination device 110 140 Task result Video source renderer 112 168 113 167 172 Video Video encoder 150 decoder 170 151 NN part 1 NN part 2 114 166 115 149 Tensor encoder Tensor decoder 116 146 121 145 Box encapsulator Box extractor 154 144 171 155 143 Transmitter/mux Storage Receiver/demux 122 132 142 123 130 Fig. 1 43452406_1 20 24 20 03 05 17 J an 2 02 4 1 / 2 1 1 7 J a n 2 0 2 4 1 0 0 S o u r c e d e v i c e D e s t i n a t i o n d e v i c e 1 1 0 1 4 0 T a s k r e s u l t V i d e o s o u r c e r e n d e r e r 1 1 2 1 6 8 2 0 2 4 2 0 0 3 0 5 1 1 3 1 6 7 1 7 2 V i d e o V i d e o e n c o d e r 1 5 0 d e c o d e r 1 7 0 1 5 1 N N p a r t 1 N N p a r t 2 1 1 4 1 6 6 1 1 5 1 4 9 T e n s o r e n c o d e r T e n s o r d e c o d e r 1 1 6 1 4 6 1 2 1 1 4 5 B o x e n c a p s u l a t o r B o x e x t r a c t o r 1 5 4 1 4 4 1 7 1 1 5 5 1 4 3 T r a n s m i t t e r / m u x S t o r a g e R e c e i v e r / d e m u x 1 2 2 1 3 2 1 4 2 1 2 3 1 3 0 F i g . 1 4 3 4 5 2 4 0 6 _ 1

Description

1/21 1/21

100 100

Source device Source device Destination device Destination device 110 110 140 140

Taskresult Task result Video source Video source renderer renderer 2024200305

112 112 168 168 113 113 167 167 172 172

Video Video Video Video encoder 150 encoder 150 decoder 170 decoder 170

151 151

NN part11 NN part NN part 22 NN part 114 114 166 166

115 115 149 149

Tensor encoder Tensor encoder Tensor decoder Tensor decoder 116 116 146 146

121 121 145 145

Box encapsulator Box encapsulator Box extractor Box extractor 154 154 144 144 171 171 155 155 143 143

Transmitter/mux Transmitter/mux Storage Storage Receiver/demux Receiver/demux 122 132 132 142 122 142

123 123 130 130

Fig. 1 Fig. 1

43452406_1 43452406_1

1

METHOD,APPARATUS METHOD, APPARATUSAND ANDSYSTEM SYSTEM FORENCODING FOR ENCODING AND AND DECODING DECODING 17 Jan 2024

TENSORSAND TENSORS ANDVIDEO VIDEO TECHNICALFIELD TECHNICAL FIELD

[0001] The

[0001] The present present invention invention relates relates generally generally to digital to digital video processing video signal signal processing and, in and, in particular, totoa amethod, method,apparatus apparatus and and system system for for encoding anddecoding decodingtensors tensorsand andvideo videofrom from 43452381_1

particular, encoding and 2024200305

a bitstream. a bitstream. The The tensors tensors may be generated may be generatedusing usingaaconvolutional convolutionalneural neuralnetwork. network.TheThe present present

invention also invention also relates relatesto toa acomputer computer program productincluding program product includingaa computer computerreadable readablemedium medium having recorded having recordedthereon thereonaacomputer computerprogram program forfor encoding encoding and and decoding decoding tensors tensors and and a video a video

from aa bitstream. from bitstream.

BACKGROUND BACKGROUND

[0002] Convolutionalneural

[0002] Convolutional neuralnetworks networks (CNNs) (CNNs) are are an emerging an emerging technology technology addressing, addressing, amongamong

other things, use cases involving machine vision such as object detection, instance other things, use cases involving machine vision such as object detection, instance

segmentation,object segmentation, object tracking, tracking, human poseestimation, human pose estimation,and andaction actionrecognition. recognition.Applications Applicationsforfor CNNs CNNs can can involve involve use use of of ‘edge 'edge devices’,with devices', withsensors sensorsand andsome some processing processing capability, capability, coupled coupled

to application servers as part of a ‘cloud’. CNNs can require relatively high computational to application servers as part of a 'cloud'. CNNs can require relatively high computational

complexity, more complexity, morethan thancan cantypically typicallybe beafforded affordedeither either in in computing capacityororpower computing capacity power consumptionbybyananedge consumption edge device.Executing device. Executing a CNN a CNN in a in a distributed distributed manner manner has emerged has emerged as oneas one solution to solution to running running leading-edge networksusing leading-edge networks usinglimited limitedcapability capability edge edgedevices deviceswithout without requiring all requiring allcomputational computational complexity to be complexity to be incurred incurred within within cloud cloud servers servers whilst whilst edge devices edge devices

have potentially under-utilised inferencing resources. In other words, distributed processing have potentially under-utilised inferencing resources. In other words, distributed processing

allows legacy edge devices to still provide the capability of leading-edge CNNs by distributing allows legacy edge devices to still provide the capability of leading-edge CNNs by distributing

processing between processing betweenthe theedge edgedevice deviceand andother otherprocessing processingmeans, means, such such as as cloud cloud servers. servers. Such Such a a distributed network architecture may be referred to as ‘collaborative intelligence’ (CI) and distributed network architecture may be referred to as 'collaborative intelligence' (CI) and

offers benefits such as re-using a partial result from a first portion of the network with several offers benefits such as re-using a partial result from a first portion of the network with several

different second portions, perhaps with each portion being optimised for a different task. CI different second portions, perhaps with each portion being optimised for a different task. CI

architectures introduce a need for efficient compression of tensor data, for transmission over a architectures introduce a need for efficient compression of tensor data, for transmission over a

network such network such as asaaWAN. WAN.

[0003] CNNs

[0003] CNNs typicallyinclude typically includemany many layers, layers, such such as as convolution convolution layers layers and and fullyconnected fully connected layers, with data passing from one layer to the next in the form of ‘tensors’. Splitting a network layers, with data passing from one layer to the next in the form of 'tensors'. Splitting a network

across different across different devices devices introduces introduces aa need need to tocompress the intermediate compress the intermediate multi-dimensional multi-dimensional

43452381_1 43452381_1

2

tensor data that passes from one layer to the next within a CNN in order to facilitate tensor data that passes from one layer to the next within a CNN in order to facilitate 17 Jan 2024

transmission over transmission over aa network networkhaving havingbandwidth bandwidth limitationsororcosts. limitations costs.Compression Compression of such of such

tensors may be referred to as ‘feature compression’ and the intermediate tensor data is often tensors may be referred to as 'feature compression' and the intermediate tensor data is often

referred to as ‘features’ or ‘feature maps’. Features or feature maps are generally a collection of referred to as 'features' or 'feature maps'. Features or feature maps are generally a collection of

two-dimensional(2D) two-dimensional (2D)arrays arraysofofvalues valueswhich, which,when when combined combined into into a 3Da (or 3D 4D) (or 4D) data data structure structure

form aa tensor, form tensor, with with each each feature feature map correspondingtotoone map corresponding one'channel' ‘channel’ofofthe the tensor. tensor. Intermediate Intermediate

tensor data represents a partially processed form of input such as an image frame or video tensor data represents a partially processed form of input such as an image frame or video 43452381_1

2024200305

frame, encountered frame, encounteredwithin withinaaneural neural network. network.International InternationalOrganisation Organisationfor forStandardisation Standardisation / International Electrotechnical International Electrotechnical Commission JointTechnical Commission Joint TechnicalCommittee Committee 1 / 1Subcommittee / Subcommittee 29 / 29 / WorkingGroups Working Groups 2 (ISO/IEC 2 (ISO/IEC JTC1/SC29/WG2), JTC1/SC29/WG2), alsoasknown also known as the “Moving the "Moving Picture Picture Experts Experts Group”(MPEG) Group" (MPEG) Technical Technical Requirements Requirements are tasked are tasked with studying with studying requirements requirements for for compressiontechnology compression technologyinin variouscontexts various contextsand andoften oftenininrelation relation to to video. WG2 video. WG2 ‘MPEG 'MPEG

TechnicalRequirements' Technical Requirements’ hasestablished has establisheda a'Feature ‘FeatureCompression Compressionfor for Video Video Coding Coding for for Machines’ (FCVCM) Machines' ad-hocgroup, (FCVCM) ad-hoc group,mandated mandatedtoto study study feature featurecompression. compression.The TheFCVCM FCVCM

AHG has issued a ‘Call for Proposals’ soliciting responses to form the basis for a AHG has issued a 'Call for Proposals' soliciting responses to form the basis for a

standardisation project standardisation project on on feature feature compression. Previously, responses compression. Previously, responsestoto aa 'Call ‘Call for for Evidence’ Evidence'

(CfE) demonstrated (CfE) demonstratedtechnology technology thatcan that cansignificantly significantlyoutperform outperformfeature featurecompression compression results results

achieved using state-of-the-art standardised technology directly applied to the tensors. achieved using state-of-the-art standardised technology directly applied to the tensors.

[0004] CNNs

[0004] CNNs typically typically require require weights weights for eachfor of each of thetolayers the layers to be predetermined be predetermined in a trainingin a training

stage, where stage, a very where a very large large amount of training amount of training data data is ispassed passed through through the the CNN anda aresult CNN and result determinedbybythe determined thenetwork networkundergoing undergoing training training being being compared compared to ground to ground truth truth associated associated with with

the training data. Discrepancy between the obtained and desired result is expressed as a ‘loss’ the training data. Discrepancy between the obtained and desired result is expressed as a 'loss'

and measured and measuredwith witha a'loss ‘lossfunction'. function’. Using Usingthe thedetermined determinedloss, loss,aaprocess processfor for updating updatingnetwork network weights, such weights, such as as stochastic stochastic gradient gradient descent descent (SGD), is performed. (SGD), is Network performed. Network weight weight update update

typically involves a process of back-propagation of ‘gradients’ that begins at the output layer of typically involves a process of back-propagation of 'gradients' that begins at the output layer of

the network the andproceeds network and proceedsbackward backwardto to terminate terminate when when the the input input layer layer to to thenetwork the network is is updated, updated,

covering intermediate, or ‘hidden’, layers of the network. Gradients are indicative of deltas to covering intermediate, or 'hidden', layers of the network. Gradients are indicative of deltas to

be applied be applied to to network weightsand network weights andare arethemselves themselvesupdated updatedasas partofofthe part theback backpropagation propagation process. The rate of weight update is set by a ‘learning rate’ hyperparameter, typically set to process. The rate of weight update is set by a 'learning rate' hyperparameter, typically set to

facilitate the training process in finding a global minima in terms of loss (i.e., highest possible facilitate the training process in finding a global minima in terms of loss (i.e., highest possible

task performance for the network architecture and training data) while avoiding the training task performance for the network architecture and training data) while avoiding the training

process becoming process becoming'stuck' ‘stuck’ininaalocal local minima. minima.Becoming Becoming stuck stuck in ainlocal a local minima minima corresponds corresponds to to obtaining sub-optimal obtaining sub-optimaltask task performance performancefor forthe thenetwork networkarchitecture architectureand andbeing beingincapable incapableofof finding new finding weightvalues new weight valuesthat that could couldlead lead to to higher higher task task performance. Network performance. Network weights weights areare

43452381_1 43452381_3

3

repeatedly updated repeatedly updatedby bysupplying supplyinginput inputdata dataand andground groundtruth truthdata dataorganised organisedinto into'batches' ‘batches’ to to 17 Jan 2024

iteratively refine iteratively refinethe network the networkperformance until further performance until furtherimprovement in accuracy improvement in accuracyis is no no longer longer achievable. An iteration through the entire training dataset forms one ‘epoch’ of training and achievable. An iteration through the entire training dataset forms one 'epoch' of training and

training typically training typicallyrequires requiresperforming performing multiple multiple epochs epochs to to achieve achieve a a high high level levelof ofperformance performance

for the for the task. task. Weights for aa trained Weights for trainednetwork network are are then then available availablefor fordeployment, deployment, and and the the network network

operates in operates in aa mode whereweights mode where weightsare arefixed fixedand andgradients gradientsfor forweight weightupdate updateare areomitted. omitted.The The process of process of executing executing aa pretrained pretrained CNN withananinput CNN with inputand andprogressively progressivelytransforming transforming thethe input input 43452381_1

2024200305

into an into an output output according to aa topology according to topology of of the the CNN CNN isis commonly commonly referred referred to to asas ‘inferencing’. 'inferencing'.

[0005] Generally,a atensor

[0005] Generally, tensorhas hasfour fourdimensions, dimensions,namely: namely: batch,channels, batch, channels,height heightand and width. width.

The first dimension, ‘batch’, is typically of size one when inferencing on video data and The first dimension, 'batch', is typically of size one when inferencing on video data and

indicates that indicates thatone one frame frame is ispassed passed through through aa CNN CNN asasone onebatch. batch.When When training training a network, a network, thethe

value of value of the the batch batch dimension maybebeincreased dimension may increasedSOsothat thatmultiple multipleframes framesare arepassed passedthrough throughthe the networkinin each network eachbatch batchbefore beforethe the network networkweights weightsare areupdated, updated,according accordingtoto a apredetermined predetermined ‘batch 'batch size’. size'. A A multi-frame video may multi-frame video maybebepassed passedthrough through asas a asingle singletensor tensorwith withthe the batch batch dimensionincreased dimension increasedininsize size according accordingto to the the number number ofofframes framesofofa agiven givenvideo. video.However, However,forfor

practical considerations practical considerations relating relatingtotomemory consumptionandand memory consumption access,inferencing access, inferencingononvideo video data data

is typically is typicallyperformed performed on on a a frame-wise basis. The frame-wise basis. ‘channels’ dimension The 'channels' dimensionindicates indicatesthe the number numberofof concurrent 'feature concurrent ‘feature maps’ for aa given maps' for given tensor tensor and and the the height height and and width dimensionsindicate width dimensions indicatethe the size of size of the thefeature featuremaps maps at atthe theparticular particularstage of of stage thethe CNN. CNN.Channel Channel count varies through count varies the through the

layers of layers of aaCNN accordingtotothe CNN according thenetwork networkarchitecture. architecture. Feature Featuremap map sizealso size alsovaries, varies,depending depending on subsampling on subsamplingororupsampling upsampling occurring occurring in in specificnetwork specific network layers. layers.

[0006] The

[0006] The overall overall complexity complexity of theof thetends CNN CNN to tends to be relatively be relatively high, with high, withlarge relatively relatively large numbersofofmultiply-accumulate numbers multiply-accumulate (MAC) (MAC) operations operations beingbeing performed performed and numerous and numerous

intermediate tensors intermediate tensors being written to being written to and and read read from from memory, alongwith memory, along withreading reading weights weights forfor

performanceofofeach performance eachlayer layerofofthe the CNN. CNN.As As such, such, dividing dividing a neural a neural network network into into portions portions allows allows

implementationofofmore implementation morecomplex complex networks networks eveneven in systems in systems containing containing less less capable capable edge edge

devices, without requiring cloud servers to bear the full burden of performing the network. devices, without requiring cloud servers to bear the full burden of performing the network.

[0007] Feature compression

[0007] Feature compressionmay may benefit benefit from from existing existing video video compression compression standards, standards, suchsuch as as

Versatile Video Versatile Coding(VVC), Video Coding (VVC), developed developed by the by the Joint Joint Video Video Experts Experts TeamTeam (JVET). (JVET). VVC is VVC is anticipated to anticipated to address address ongoing demandfor ongoing demand forever-higher ever-highercompression compression performance, performance, especially especially as as video formats increase in capability (for example, with higher resolution and higher frame rate) video formats increase in capability (for example, with higher resolution and higher frame rate)

and to and to address address increasing increasing market demand market demand forservice for servicedelivery deliveryover overWANs, WANs, where where bandwidth bandwidth

43452381_1 43452381_1

4

costs are costs are relatively relativelyhigh. high.VVC is implementable VVC is implementable inincontemporary contemporary siliconprocesses silicon processes and and offers offers 17 Jan 2024

an acceptable an acceptable trade-off trade-off between achievedperformance between achieved performance versus versus implementation implementation cost. cost. TheThe

implementationcost implementation costmay maybebe considered considered forfor example, example, in in terms terms of of oneone or or more more of of siliconarea, silicon area, CPUprocessor CPU processorload, load,memory memory utilisationandand utilisation bandwidth. bandwidth. Other Other video video compression compression standards, standards,

such as such as High EfficiencyVideo High Efficiency VideoCoding Coding (HEVC) (HEVC) or AV-1, or AV-1, maybealso may also befor used used for feature feature

compressionapplications. compression applications. 43452381_1

2024200305

[0008] Videodata

[0008] Video dataincludes includesaa sequence sequenceofofframes framesofofimage image data,each data, eachframe frame including including one one or or

morecolour more colourchannels. channels.Where Where feature feature mapmap datadata is to is to bebe represented represented inin a apacked packed frame, frame,

generally aa monochrome generally frame monochrome frame having having luminance luminance only only andchroma and no no chroma channels channels is adequate. is adequate. In In the context of block-based coding the frame format description, a ‘sample’ is a single value, as the context of block-based coding the frame format description, a 'sample' is a single value, as

wouldbebeobtained would obtainedfrom fromone one cellininan cell animaging imagingsensor. sensor.When When onlyonly lumaluma samples samples are present, are present,

the resulting the resulting monochrome frames monochrome frames areare saidtotouse said usea a"4:0:0 “4:0:0chroma chroma format”. format".

[0009] TheVVC

[0009] The VVC standard standard specifies specifies a ‘block a 'block based’ based' architecture,ininwhich architecture, whichframes framesarearefirstly firstly divided into divided into an an array array of of square square regions regions known as 'coding known as ‘codingtree tree units' units’ (CTUs). InVVC, (CTUs). In VVC, CTUs CTUs

generally occupy generally 128×128 occupy 128x128 luma luma samples. samples. Other Other possible possible CTU sizes CTU sizes when the when using using VVCthe VVC standard are standard are 32×32 and64x64. 32x32 and 64×64.However, However, CTUsCTUs at theatright the right and and bottom bottom edge edge of each of each frame frame

may be smaller in area, with implicit splitting occurring to ensure coding blocks remain in the may be smaller in area, with implicit splitting occurring to ensure coding blocks remain in the

frame. Associated frame. Associatedwith witheach eachCTU CTU is ais 'coding a ‘coding tree’defining tree' defininga adecomposition decompositionof of thethe area area ofof the the

CTU into a set of blocks, also referred to as ‘coding units’ (CUs). Blocks applicable to only the CTU into a set of blocks, also referred to as 'coding units' (CUs). Blocks applicable to only the

lumachannel luma channelororonly onlythe the chroma chromachannels channels arereferred are referredtotoasas 'coding ‘codingblocks' blocks’(CBs). (CBs).A A prediction of the contents of a coding block is held in a ‘prediction block’ (PB) or ‘prediction prediction of the contents of a coding block is held in a 'prediction block' (PB) or 'prediction

unit’ (PU) unit' (PU) and a residual and a residual block block defining defining an an array array of ofsample sample values values to to be be additively additivelycombined combined

with the PB or PU is referred to as a ‘transform block’ (TB) or ‘transform unit’ (TU), owing to with the PB or PU is referred to as a 'transform block' (TB) or 'transform unit' (TU), owing to

the typical use of a transformation process in the generation of the TB or TU. the typical use of a transformation process in the generation of the TB or TU.

[00010] Notwithstanding

[00010] Notwithstanding theabove the above distinctionbetween distinction between ‘units’and 'units' and'blocks', ‘blocks’,the theterm term'block' ‘block’ may be used as a general term to refer to areas or regions of a frame for which operations are may be used as a general term to refer to areas or regions of a frame for which operations are

applied to all colour channels. applied to all colour channels.

[00011] Foreach

[00011] For eachCU, CU,a aprediction predictionunit unit(PU) (PU)ofofthe the contents contents (sample (samplevalues) values)ofof the the corresponding area of frame data is generated (a ‘prediction unit’). Further, a representation of corresponding area of frame data is generated (a 'prediction unit'). Further, a representation of

the difference (or ‘spatial domain’ residual) between the prediction and the contents of the area the difference (or 'spatial domain' residual) between the prediction and the contents of the area

as seen as seen at at input inputto tothe theencoder encoderisis formed. formed. The The difference difference in ineach each colour colour channel channel may be may be

43452381_1 43452381_1

5

transformedand transformed andcoded codedasasa asequence sequenceofofresidual residualcoefficients, coefficients, forming oneorormore forming one moreTUs TUsforfor a a 17 Jan 2024

given CU. given CU.The The applied applied transform transform maymay be abeDiscrete a Discrete Cosine Cosine Transform Transform (DCT)(DCT) or or other other transform, applied to each block of residual values. The transform is applied separably, (i.e., transform, applied to each block of residual values. The transform is applied separably, (i.e.,

the two-dimensional the transformisisperformed two-dimensional transform performedinintwo twopasses, passes,one onehorizontally horizontallyand andone onevertically). vertically). Theblock The blockisis firstly firstly transformed transformed by by applying applying a a one-dimensional transformtotoeach one-dimensional transform eachrow rowofof samples in the block. Then, the partial result is transformed by applying a one-dimensional samples in the block. Then, the partial result is transformed by applying a one-dimensional

transform to each column of the partial result to produce a final block of transform coefficients transform to each column of the partial result to produce a final block of transform coefficients 43452381_1

2024200305

that substantially decorrelates the residual samples. Transforms of various sizes are supported that substantially decorrelates the residual samples. Transforms of various sizes are supported

by the by the VVC standard,including VVC standard, includingtransforms transforms ofof rectangular-shaped rectangular-shaped blocks, blocks, with with each each side side

dimensionbeing dimension beinga apower powerofof two.Transform two. Transform coefficients coefficients areare quantised quantised forfor entropy entropy encoding encoding

into a bitstream. into a bitstream.

[00012] PBsororPUs

[00012] PBs PUsininVVC VVCmay may be generated be generated usingusing either either an intra-frame an intra-frame prediction prediction or an or an

inter-frame prediction inter-frame prediction process. Intra-frame prediction process. Intra-frame prediction involves involves the the use use of of previously previously processed processed

samples in a frame being used to generate a prediction of a current block of data samples in the samples in a frame being used to generate a prediction of a current block of data samples in the

frame. Inter-frame prediction involves generating a prediction of a current block of samples in frame. Inter-frame prediction involves generating a prediction of a current block of samples in

a frame a using aa block frame using of samples block of obtainedfrom samples obtained fromone oneorortwo twopreviously previouslydecoded decoded frames. frames. TheThe

block of block of samples obtainedfrom samples obtained froma apreviously previouslydecoded decoded frame frame is is offsetfrom offset fromthethespatial spatiallocation location of the current block according to a motion vector, which often has filtering applied. Intra-frame of the current block according to a motion vector, which often has filtering applied. Intra-frame

prediction blocks can be (i) a uniform sample value (“DC intra prediction”), (ii) a plane having prediction blocks can be (i) a uniform sample value ("DC intra prediction"), (ii) a plane having

an offset and horizontal and vertical gradient (“planar intra prediction”), (iii) a population of the an offset and horizontal and vertical gradient ("planar intra prediction"), (iii) a population of the

block with neighbouring samples applied in a particular direction (“angular intra prediction”) or block with neighbouring samples applied in a particular direction ("angular intra prediction") or

(iv) the result of a matrix multiplication using neighbouring samples and selected matrix (iv) the result of a matrix multiplication using neighbouring samples and selected matrix

coefficients. coefficients.

[00013] VVC

[00013] VVC maymay be used be used to compress to compress intermediate intermediate feature feature mapsmaps from from a first a first portion portion (a (a

‘backbone’) of aa neural 'backbone') of neural network separatedinto network separated into two twoportions. portions. In In compression, the feature compression, the feature maps maps

from the from the backbone backboneare arearranged arrangedinto intoaaframe frameand andquantised quantisedfrom from a floating-pointdomain a floating-point domainto to a a sampledomain sample domainsuitable suitablefor forcompression compressionas as video video data.CNNs data. CNNs represent represent a quickly a quickly evolving evolving

technology, and technology, andthe the CNNs CNNs used used forfor machine machine vision vision tasks tasks cancan become become updated updated with with different different

versions or versions or replaced replaced with with new CNN new CNN architectures architectures asas technology technology develops. develops. A bitstream A bitstream

containing tensors containing tensors from from aa backbone backbone(a(aCNN CNN firstportion) first portion)may maybe be decoded decoded andand supplied supplied to one to one

or more or networkheads more network heads(CNN (CNN second second portions) portions) out out of many of many available available network network heads. heads.

43452381_1 43452381_1

6

[00014] VVC

[00014] VVC bitstreams bitstreams contain contain a number a number of Network of Network Abstraction Abstraction LayerLayer (NAL) (NAL) units. units. NAL NAL 17 Jan 2024

units defining the properties of the video data, such as resolution and chroma format and units defining the properties of the video data, such as resolution and chroma format and

quantization information, quantization information, are are stored stored in innon-Video CodingLayer non-Video Coding Layer(VCL) (VCL) NAL NAL unitsunits such such as a as a ‘Sequence ParameterSet' 'Sequence Parameter Set’(SPS) (SPS)andand a ‘PictureParameter a 'Picture Parameter Set’(PPS). Set' (PPS).NALNAL units units holding holding coded coded

slices, that is, runs of coding tree units, are stored in VCL-layer NAL units, such as intra slices slices, that is, runs of coding tree units, are stored in VCL-layer NAL units, such as intra slices

or P slices or B slices. or P slices or B slices. 43452381_1

2024200305

[00015] Machine

[00015] Machine visiontasks vision tasksperformed performedby by CNNs CNNs can provide can provide outputs outputs basedbased onparticular on the the particular head network head networkused. used.Results Resultscan canrelate relate to to matters matters such as bounding such as boxes,segmentation bounding boxes, segmentation markers markers

and the and the like. like.In Insome some instances, instances,aahuman maywish human may wishtotoreview reviewororinterpret interpret aa CNN result.AAneed CNN result. need exists to exists toassist assisthuman human interpretation interpretationofofCNNs in some CNNs in use cases. some use cases. Where Wherea abitstream bitstreamisis to to contain contain

both compressed both compressedvideo videodata dataand andcompressed compressed feature feature data, data, a need a need existstotodefine exists defineaastorage storage format capable of holding the video and feature data. Further, flexibility is required to allow format capable of holding the video and feature data. Further, flexibility is required to allow

usability as usability asencoding encoding trends trends change. change.

SUMMARY SUMMARY

[00016] It is an object of the present invention to substantially overcome, or at least ameliorate,

one or one or more disadvantagesofofexisting more disadvantages existingarrangements. arrangements.

[00017]One

[00017] Oneaspect aspectofofthe thepresent present disclosure disclosure provides provides aa method methodofofdecoding decodingtensors tensorsand andvideo video from aa bitstream, from bitstream, the the tensors tensors being being related relatedtotothe video, the thethe video, method methodcomprising: comprising: determining determining a a

first codec for the video based on first information decoded from the bitstream; determining a first codec for the video based on first information decoded from the bitstream; determining a

secondcodec second codecfor forthe the tensors tensors based on second based on secondinformation informationdecoded decoded from from thethe bitstream; bitstream;

decodingthe decoding the video videoaccording accordingtotothe the first first codec; codec; decoding the tensors decoding the tensors according according to to the the second second

codec; and codec; and wherein whereinthe thefirst first information information and the second and the informationare second information are independent independentofofeach each other and the video and the tensors are associated with each other. other and the video and the tensors are associated with each other.

[00018] Anotheraspect

[00018] Another aspectofofthe thepresent present disclosure disclosure provides provides aa method methodofofencoding encodingtensors tensorsand and video into a bitstream, the tensors being related to the video, the method comprising: encoding video into a bitstream, the tensors being related to the video, the method comprising: encoding

first information into the bitstream, the first information being used to determine a first codec first information into the bitstream, the first information being used to determine a first codec

for the for the video; video; encoding encoding second informationinto second information into the the bitstream, bitstream, the the second second information being information being

used to determine a second codec for the tensors; encoding the video into the bitstream used to determine a second codec for the tensors; encoding the video into the bitstream

according to the first codec; encoding the tensors into the bitstream according to the second according to the first codec; encoding the tensors into the bitstream according to the second

43452381_1 43452381_1

7

[00019] Anotheraspect

[00019] Another aspectofofthe thepresent present disclosure disclosure provides provides aa decoder decoderfor for decoding decodingtensors tensorsand and 17 Jan 2024

video from a bitstream, the tensors being related to the video, the decoder configured for: video from a bitstream, the tensors being related to the video, the decoder configured for:

determining a first codec for the video based on first information decoded from the bitstream; determining a first codec for the video based on first information decoded from the bitstream;

determiningaa second determining secondcodec codecfor forthe thetensors tensors based basedononsecond secondinformation informationdecoded decoded from from the the

bitstream; decoding the video according to the first codec; decoding the tensors according to the bitstream; decoding the video according to the first codec; decoding the tensors according to the

secondcodec; second codec;and andwherein whereinthe thefirst first information andthe information and the second secondinformation informationare areindependent independentofof each other and the video and the tensors are associated with each other. each other and the video and the tensors are associated with each other. 43452381_1

2024200305

[00020] Another

[00020] Anotheraspect aspectofofthe thepresent present disclosure disclosure provides provides aa non-transitory non-transitory computer-readable computer-readable storage medium storage which medium which stores stores a a program program forfor executing executing a method a method of decoding of decoding tensors tensors and and video video

from aa bitstream, from bitstream, the the tensors tensors being being related relatedtotothe video, the thethe video, method methodcomprising: comprising: decoding decoding

tensors and video from a bitstream, the tensors being related to the video, the method tensors and video from a bitstream, the tensors being related to the video, the method

comprising:determining comprising: determininga afirst first codec for the codec for the video video based based on on first firstinformation information decoded fromthe decoded from the bitstream; determining bitstream; determining aa second secondcodec codecfor forthe the tensors tensors based based on on second secondinformation informationdecoded decoded from the bitstream; decoding the video according to the first codec; decoding the tensors from the bitstream; decoding the video according to the first codec; decoding the tensors

according to according to the the second codec; and second codec; andwherein whereinthe thefirst first information and the information and the second secondinformation informationare are independent of each other and the video and the tensors are associated with each other. independent of each other and the video and the tensors are associated with each other.

[00021] Anotheraspect

[00021] Another aspectofofthe thepresent present disclosure disclosure provides provides an an encoder encoderfor for encoding encodingtensors tensorsand and video into a bitstream, the tensors being related to the video, the encoder configured for: video into a bitstream, the tensors being related to the video, the encoder configured for:

encoding firstinformation encoding first informationintointo the the bitstream, bitstream, the first the first information information being being used to used to determine determine a a first codec for the video; encoding second information into the bitstream, the second first codec for the video; encoding second information into the bitstream, the second

information being information beingused usedtoto determine determineaasecond secondcodec codecforforthe thetensors; tensors; encoding encodingthe thevideo videointo intothe the bitstream according to the first codec; encoding the tensors into the bitstream according to the bitstream according to the first codec; encoding the tensors into the bitstream according to the

secondcodec; second codec;and andwherein whereinthe thefirst first information andthe information and the second secondinformation informationare areindependent independentofof each other and the video and the tensors are associated with each other. each other and the video and the tensors are associated with each other.

[00022] Anotheraspect

[00022] Another aspectofofthe thepresent present disclosure disclosure provides provides aa computer-implemented computer-implemented medium medium

non-transitory computer-readable non-transitory storagemedium computer-readable storage medium which which stores stores a program a program for for executing executing a a method of encoding tensors and video into a bitstream, the tensors being related to the video, method of encoding tensors and video into a bitstream, the tensors being related to the video,

the method comprising: encoding first information into the bitstream, the first information the method comprising: encoding first information into the bitstream, the first information

being used to determine a first codec for the video; encoding second information into the being used to determine a first codec for the video; encoding second information into the

bitstream, the bitstream, the second second information being used information being usedto to determine determineaasecond secondcodec codecfor forthe thetensors; tensors; encoding the video into the bitstream according to the first codec; encoding the tensors into the encoding the video into the bitstream according to the first codec; encoding the tensors into the

bitstream according bitstream according to to the the second codec; and second codec; andwherein whereinthe thefirst first information and the information and the second second

43452381_1 43452381_1

8

information are information are independent independentofofeach eachother otherand andthe thevideo videoand andthe thetensors tensors are are associated associated with with 17 Jan 2024

each other. each other.

[00023] Other

[00023] Other aspects aspects are are also also disclosed. disclosed.

BRIEF BRIEF DESCRIPTION DESCRIPTION OF OF THE THE DRAWINGS DRAWINGS 43452381_1

[00024] Atleast

[00024] At least one embodiment one embodiment of of thepresent the presentinvention inventionwill willnow nowbebe described described with with reference reference 2024200305

to the to the following following drawings andan drawings and anappendix, appendix,ininwhich: which:

[00025] Fig. 11 is

[00025] Fig. is aaschematic schematic block block diagram showinga adistributed diagram showing distributedmachine machine tasksystem; task system;

[00026] Figs.

[00026] Figs. 2A 2Aand and2B2Bform form a schematic a schematic block block diagram diagram of aofgeneral-purpose a general-purpose computer computer

systemupon system uponwhich which thedistributed the distributedmachine machine tasksystem task system of of Fig.1 1may Fig. maybe be practiced; practiced;

[00027] Fig. 3A

[00027] Fig. 3Aisis aa schematic blockdiagram schematic block diagramshowing showing functional functional modules modules of aofbackbone a backbone portion of portion of aa CNN; CNN;

[00028] Fig. 3B

[00028] Fig. 3Bis is aa schematic block diagram schematic block diagramshowing showing a residualblock a residual blockofofFig. Fig.3A; 3A;

[00029] Fig. 3C

[00029] Fig. 3Cis is aa schematic block diagram schematic block diagramshowing showing a residualunit a residual unitofofFig. Fig. 3A; 3A;

[00030] Fig. 3D

[00030] Fig. 3Disis aa schematic blockdiagram schematic block diagramshowing showing a CBL a CBL module module of Fig. of Fig. 3A; 3A;

[00031] Fig. 44 is

[00031] Fig. is aaschematic schematic block block diagram showingfunctional diagram showing functionalmodules modules of of an an alternative alternative

backboneportion backbone portionofofaaCNN; CNN;

[00032] Fig. 55 is

[00032] Fig. is aaschematic schematic block block diagram of aa tensor diagram of tensor encoder using aa configurable encoder using configurable tensor tensor compressorstage; compressor stage;

[00033] Fig. 66 is

[00033] Fig. is aaschematic schematic block block diagram showinga amulti-scale diagram showing multi-scalefeature featurefusion fusionstage stage for for aa tensor compressor; tensor compressor;

[00034] Fig. 77 is

[00034] Fig. is aaschematic schematic block block diagram showingananinter-channel diagram showing inter-channeldecorrelation-based decorrelation-based tensor compressor; tensor compressor;

[00035] Fig. 88 is

[00035] Fig. is aaschematic schematic block block diagram showingfunctional diagram showing functionalmodules modules of of a video a video encoder; encoder;

43452381_1 43452381_1

9

[00036] Figs. 9A

[00036] Figs. 9A&&9B9B areschematic are schematic block block diagrams diagrams showing showing an arrangement an arrangement of regions of regions or or 17 Jan 2024

subpictures for subpictures for holding holding compressed tensordata; compressed tensor data;

[00037] Figs. 10A

[00037] Figs. 10A&&10B 10B areare schematic schematic block block diagrams diagrams showing showing another another arrangement arrangement of of regions or regions or subpictures subpictures for for holding holding compressed tensordata; compressed tensor data;

[00038] Fig. 11A

[00038] Fig. 11Aisis aa schematic schematicblock blockdiagram diagramshowing showing a bitstream a bitstream holding holding encoded encoded inter- inter- 43452381_1

channel decorrelated channel decorrelated feature feature maps, maps,compressed compressed video,andand video, associated associated metadata; metadata; 2024200305

[00039] Fig. 11B

[00039] Fig. 11Bisis aa schematic blockdiagram schematic block diagramshowing showing a hierarchicalarrangement a hierarchical arrangement of of ‘boxes’ 'boxes'

resulting in a presentation which encapsulates a video stream and a feature stream; resulting in a presentation which encapsulates a video stream and a feature stream;

[00040] Fig. 12

[00040] Fig. 12 is is aa schematic schematic block diagramshowing block diagram showing a tensordecoder a tensor decoder with with a configurable a configurable

tensor decompressor; tensor decompressor;

[00041] Fig. 13

[00041] Fig. 13 is is aa schematic schematic block diagramshowing block diagram showing functionalmodules functional modules of of a video a video decoder; decoder;

[00042] Fig. 14

[00042] Fig. 14 is is aa schematic schematic block diagramshowing block diagram showinganan inter-channeldecorrelation-based inter-channel decorrelation-based tensor decoder as part of a distributed machine task system; tensor decoder as part of a distributed machine task system;

[00043] Fig. 15

[00043] Fig. 15 is is aa schematic schematic block diagramshowing block diagram showinganan embodiment embodiment of a of a multi-scale multi-scale feature feature

reconstruction stage; reconstruction stage;

[00044] Fig.

[00044] Fig. 16A 16Aisis aa schematic schematicblock blockdiagrams diagramsshowing showing a head a head portion portion of of a CNN; a CNN;

[00045] Fig.

[00045] Fig. 16B 16Bisis aa schematic schematicblock blockdiagram diagramshowing showing an an upscaler upscaler module module of Fig. of Fig. 16A;16A;

[00046] Fig. 16C

[00046] Fig. 16Cisis aa schematic blockdiagram schematic block diagramshowing showing a detection a detection module module of Fig. of Fig. 16A; 16A;

[00047] Fig. 17

[00047] Fig. 17 is is aa schematic schematic block diagramshowing block diagram showinganan alternativehead alternative headportion portionofofaaCNN; CNN;

[00048] Fig. 18

[00048] Fig. 18 shows showsa amethod method forperforming for performing a firstportion a first portionof of aa CNN, CNN,compressing compressing thethe

resulting tensors, and signalling an indication the first portion of the CNN; and resulting tensors, and signalling an indication the first portion of the CNN; and

[00049] Fig. 19

[00049] Fig. 19 shows showsa amethod method fordecoding for decoding a bitstream,reconstructing a bitstream, reconstructingtensors tensorsand and completingperformance completing performance a taskaccording a task according toto network network topology topology andand network network weights weights indications. indications.

43452381_1 43452381_1

10

DETAILED DESCRIPTION DETAILED DESCRIPTION INCLUDING INCLUDING BEST BEST MODE MODE 17 Jan 2024

[00050] Where

[00050] Wherereference referenceisismade madein in any any one one or or more more of of thethe accompanying accompanying drawings drawings to steps to steps

and/or features, and/or features, which which have the same have the referencenumerals, same reference numerals,those thosesteps stepsand/or and/orfeatures features have have for for the purposes of this description the same function(s) or operation(s), unless the contrary the purposes of this description the same function(s) or operation(s), unless the contrary

intention appears. intention appears.

[00051]

[00051] AAdistributed distributed machine machinetask tasksystem systemmay may include an an edge device, such as as a network 43452381_1

include edge device, such a network 2024200305

cameraororsmartphone camera smartphone producing producing intermediate intermediate compressed compressed data.data. The distributed The distributed machine machine task task system may also include a final device, such as a server farm based (‘cloud’) application, system may also include a final device, such as a server farm based ('cloud') application,

operating on the intermediate compressed data to produce a task result. Additionally, the edge operating on the intermediate compressed data to produce a task result. Additionally, the edge

device functionality device functionality may be embodied may be embodied inin thecloud the cloudand andthe theintermediate intermediatecompressed compressed data data maymay

be stored for later processing, potentially for multiple different tasks depending on need. A be stored for later processing, potentially for multiple different tasks depending on need. A

distributed task distributed tasksystem system may performa atask may perform taskin in aa general general form in an form in an edge device and edge device andprovide provide partially processed features for performance of the task in a specific form at the server side. For partially processed features for performance of the task in a specific form at the server side. For

example,aa general example, general object object detection detection network maybebeperformed network may performed in in thethe client(edge) client (edge)device deviceand and the result used in the server to conditionally perform more specific detection networks using, the result used in the server to conditionally perform more specific detection networks using,

for example, for partially processed example, partially processed tensors tensors produced by aa first produced by first portion portionofofthe CNN the that isiscommon CNN that common

to the general task network and the specific task network. Such a first portion may be referred to the general task network and the specific task network. Such a first portion may be referred

to as to as aa ‘shared 'shared backbone’ as ititisiscommon backbone' as bothin common both in network networktopology topologyand andnetwork network weights weights to to multiple task multiple task networks. networks.

[00052]

[00052] AAconvenient convenientform form ofof intermediatecompressed intermediate compressed datadata is ais compressed a compressed video video bitstream, bitstream,

owingtoto the owing the availability availability ofofhigh-performing high-performing compression standardsand compression standards andimplementations implementations thereof. Video thereof. compressionstandards Video compression standardstypically typicallyoperate operateononinteger integer samples samplesofofsome somegiven given bit bit

depth, such as 10 bits, arranged in planar arrays. Colour video has three planar arrays, depth, such as 10 bits, arranged in planar arrays. Colour video has three planar arrays,

corresponding, for corresponding, for example, example,toto colour colour components components Y, Y, Cb,Cb, Cr,Cr, or or R,R, G,G, B,B, depending depending on on application. CNNs typically operate on floating point data in the form of tensors. Tensors application. CNNs typically operate on floating point data in the form of tensors. Tensors

generally have generally have aa relatively relatively smaller smaller spatial spatialdimensionality dimensionalitycompared to incoming compared to videodata incoming video data uponwhich upon whichthe theCNN CNN operates operates while while having having moremore channels channels than than the three the three channels channels typical typical of of colour video colour video data, data, for for example 128, 256, example 128, 256, or or 512 channels. 512 channels.

[00053] Tensorstypically

[00053] Tensors typically have havethe the following followingdimensions: dimensions:frames, frames,channels, channels,height, height,and andwidth. width. For example, a tensor of dimensions [1, 256, 76, 136] would be said to contain floating-point or For example, a tensor of dimensions [1, 256, 76, 136] would be said to contain floating-point or

integer values integer values for for one one frame frame comprising anarray comprising an array of of two-hundred two-hundredand andfifty-six fifty-six (256) (256) feature feature

43452381_1 43452381_1

11

maps(channels), maps (channels),each eachofof size size 136x76. 136×76.For Forvideo videodata, data,inferencing inferencingis is typically typically performed one performed one 17 Jan 2024

frame at a time (frame value of 1), rather than using tensors containing multiple frames. frame at a time (frame value of 1), rather than using tensors containing multiple frames.

[00054] VVC

[00054] VVC supports supports a division a division ofof a apicture pictureinto into multiple multiple subpictures, subpictures, each each of of which maybebe which may

independentlyencoded independently encodedand and independently independently decoded. decoded. In one In one approach, approach, each each subpicture subpicture is coded is coded

as one as one ‘slice’, 'slice',oror contiguous contiguoussequence sequence of of coded coded CTUs. CTUs. A A ‘tile’mechanism 'tile' mechanismis is alsoavailable also availabletoto divide aa picture divide picture into intoa anumber number of of independently decodeableregions. independently decodeable regions.Subpictures Subpicturesmay may be be 43452381_1

2024200305

specified in specified in aasomewhat flexible manner, somewhat flexible manner,with withvarious variousrectangular rectangularsets sets of of CTUs coded CTUs coded asas

respective subpictures. Flexible definition of subpicture dimensions allows efficiently holding respective subpictures. Flexible definition of subpicture dimensions allows efficiently holding

types of data requiring different areas in one picture, avoiding large ‘unused’ areas, i.e., areas of types of data requiring different areas in one picture, avoiding large 'unused' areas, i.e., areas of

a frame that are not used for reconstruction of tensor data. a frame that are not used for reconstruction of tensor data.

[00055] AAfile

[00055] file or or stream stream may beencapsulated may be encapsulatedininaafile file format format such as ISO such as BaseMedia ISO Base Media File File

Format(ISOBMFF), Format (ISOBMFF), standardised standardised as ISO/IEC as ISO/IEC 14496-12 14496-12 and and ISO ISO 14496-15 14496-15 for an extension for an extension

related to related tothe thecarriage carriageofof NAL-unit NAL-unit structured structured video video in inISOBMFF. Where ISOBMFF. Where a bitstream a bitstream is is to to contain both contain both compressed compressedvideo videodata dataand andcompressed compressed feature feature data, data, a need a need existstotodefine exists defineaa storage format storage capable of format capable of holding holding the the video video and and the the feature feature data. data. ISOBMFF defines ISOBMFF defines a a hierarchical set of ‘boxes’. Each box begins with a 32-bit type identifier, typically represented hierarchical set of 'boxes'. Each box begins with a 32-bit type identifier, typically represented

as a four 8-bit (i.e., ASCII or UTF-8) code followed by a length field and then a variable-length as a four 8-bit (i.e., ASCII or UTF-8) code followed by a length field and then a variable-length

payload, indicating the total length of the box including the associated payload. By virtue of payload, indicating the total length of the box including the associated payload. By virtue of

the hierarchical the hierarchical nature natureof ofISOBMFF, boxes ISOBMFF, boxes may may be declared be declared within within other other boxes. boxes. Box Box typestypes are are defined in an object-oriented manner, enabling one box type to be derived (or to ‘extend’) defined in an object-oriented manner, enabling one box type to be derived (or to 'extend')

another box another box type. type. Where Where a box a box type type extends extends another another boxbox type, type, thethe childbox child box includes includes all all

defined attributes of all ancestor box type, analoguous to class definition in an object-oriented defined attributes of all ancestor box type, analoguous to class definition in an object-oriented

languagesuch language suchasasC++. C++.A A collectionofofboxes collection boxes defining defining one one or or more more motion motion sequences, sequences, possibly possibly

including audio, including audio, form form aa 'presentation', ‘presentation’, which which may beembodied may be embodiedin in a a fileor file or as as streamed streameddata. data. InIn the terminology the of ISOBMFF, terminology of ISOBMFF, all all data data associated associated with with a particulartime a particular timeofofpresentation presentationis is referred to as a ‘sample’, this includes a video frame. A presentation includes a sequence of referred to as a 'sample', this includes a video frame. A presentation includes a sequence of

samples, each associated with a different time, enabling delivery of media content to a user samples, each associated with a different time, enabling delivery of media content to a user

according to the indicated times and avoiding for example jitter in the delivery of each decoded according to the indicated times and avoiding for example jitter in the delivery of each decoded

frame (or frame (or sample) sample)to to the the viewer. viewer.

[00056] Fig. 11 is

[00056] Fig. is aaschematic schematic block block diagram showingfunctional diagram showing functionalmodules modules of of a distributed a distributed

machinetask machine tasksystem system100, 100,capable capableofofperforming performing a machine a machine task task network network in aindistributed a distributed manner.The manner. The divisionofofa aparticular division particular neural neural network networkinto into two twoportions portionsrequires requires specifying specifying aa

43452381_1 43452381_1

12

‘split 'split point’ point' in in the network.Layers the network. Layers in the in the network network from from the thelayer input input up layer to theup to the split split point are point are 17 Jan 2024

performed in a first device (or ‘source device’) and resulting intermediate tensor(s) are performed in a first device (or 'source device') and resulting intermediate tensor(s) are

compressed.Layers compressed. Layers up up to to thesplit the split point point may maybebereferred referredto to as as the the ‘backbone’ howeverthis 'backbone' however this term may term maysometimes sometimes imply imply a specific a specific splitpoint. split point. AnAnalternative alternativeterm term'NN ‘NN partl'1’(neural part (neural network part one, also referred to as a first network portion) carries no implication on the network part one, also referred to as a first network portion) carries no implication on the

location of the split point in the network. Layers from the split point to the last layer may be location of the split point in the network. Layers from the split point to the last layer may be

referred to as the ’head’ and, for the avoidance of any implied split point location, may referred to as the 'head' and, for the avoidance of any implied split point location, may 43452381_1

2024200305

alternatively be referred to as ‘NN part 2’ (neural network part 2, also referred to as a second alternatively be referred to as 'NN part 2' (neural network part 2, also referred to as a second

networkportion). network portion). AAfirst first NN part22 (also NN part (also referred referred to to as asaaproxy proxyNN part 2) NN part 2) performs remaining performs remaining

layers of the network to produce an initial task result, to be included in the bitstream for layers of the network to produce an initial task result, to be included in the bitstream for

subsequentusage. subsequent usage.Based Basedon on thethe resultobtained result obtainedfrom fromthethenetwork, network, layersfrom layers from thesplit the splitpoint point up up to the last layer in an alternative network are performed in a second device (or ‘destination to the last layer in an alternative network are performed in a second device (or 'destination

device’) using decompressed tensor(s) from the first device as input to the layer(s) immediately device') using decompressed tensor(s) from the first device as input to the layer(s) immediately

following the following the split split point. point.The The alternative alternativenetwork network has has aa common common NNNN part part 1 with 1 with thethe first first

network, and network, andthus thus is is able able to to produce produce a a task task result result167 167with withperforming performing NN part 22 using NN part using the the tensors produced tensors byNNNN produced by part1 1ofofthe part thefirst first network. network.

[00057] Atthe

[00057] At the split split point pointthere theremay may be be one one or or more tensors that more tensors that need need to to be be compressed for compressed for

conveyanceover conveyance overa acommunication communication channel channel withwith limited limited bandwidth bandwidth compared compared to the to the bandwidth bandwidth

requirementfor requirement for transmission transmission of of uncompressed uncompressed tensors.Where tensors. Where a ‘feature a 'feature pyramid pyramid network’ network'

(FPN) is in use, it is common for layers in the FPN to be related in width and height such that a (FPN) is in use, it is common for layers in the FPN to be related in width and height such that a

given layer given layer is is half halfthe thewidth widthand andheight heightofofananadjacent adjacentlayer among layer among the the layers. layers.FPN FPN

architectures may also involve the width and height halving alternatively from one layer to the architectures may also involve the width and height halving alternatively from one layer to the

next layer. next layer. In In some architectures, multiple some architectures, multiple tensors tensors of ofthe thesame same width width and and height height are are produced produced

within the within the FPN. FPN. AnAn FPN FPN may may occur occur relatively relatively early early in in thethe neural neural network network topology, topology, resulting resulting in in

a necessity for a split point to occur within the FPN in order for a useful division of the network a necessity for a split point to occur within the FPN in order for a useful division of the network

workloadacross workload acrossthe theedge edgedevice deviceand andthe thecloud cloudtotobebeachieved. achieved.When When a split a split occurs occurs within within the the

FPNofofthe FPN themachine machinetask tasknetwork, network,performance performance ofvariety of a a variety of of machine machine task task networks networks where where

layers up layers up to to the thesplit splitpoint areare point common common among themachine among the machine tasknetworks task networks (‘shared ('shared backbone’ backbone'

architecture) may architecture) be achieved. may be achieved. Where Where a splitpoint a split pointoccurs occurswithin withinthe the FPN, FPN,tensor tensorcompression compression methodsmay methods may exploitredundancies exploit redundancies across across thethe FPN FPN layers layers to to improve improve compression compression performance. performance.

Compressionmethods Compression methods applicable applicable to to thethe various various network network topologies topologies used used in contemporary in contemporary

CNNs are therefore beneficial for application in a wide range of scenarios. CNNs are therefore beneficial for application in a wide range of scenarios.

43452381_1 43452381_1

13

[00058] Thesystem

[00058] The system100 100 may may be be used used for for implementing implementing methods methods for decorrelating, for decorrelating, packing packing and and 17 Jan 2024

quantising feature quantising feature maps into planar maps into planar frames for encoding frames for anddecoding encoding and decodingfeature featuremaps maps from from

encodeddata encoded datafor for various various neural neural networks. networks.Various Variousneural neuralnetworks networks maymay be split be split at at different different

points and points mayresult and may result in in intermediate intermediate tensors tensors of of various various number anddimensionality. number and dimensionality.The The multitude of possible neural networks and split points creates a need for the destination multitude of possible neural networks and split points creates a need for the destination

device 140 device 140 to to determine determinecompatibility compatibilitywith withaa repository repository of of available available NN part 22 implementations NN part implementations available in available in an an NN part 22 repository NN part repository 160. Enumeratingpotential 160. Enumerating potentialneural neuralnetworks networksisismade made 43452381_1

2024200305

difficult due to the need to maintain a centralised and agreed repository of network names. difficult due to the need to maintain a centralised and agreed repository of network names.

Furthermore,reference Furthermore, referenceimplementations implementations may may be be available available only only in in transientform transient form such such as as public public

software repositories, software repositories, which which may changeasasupdates may change updatesare arecommitted. committed.An An additional additional complication complication

results from results from the the machine learning community machine learning community having having failingtotoagree failing agreeononthe thenaming namingof of newly newly

created networks, created such as networks, such as seen seen in in the the ‘YOLO’ lineageofofnetworks, 'YOLO' lineage networks,where where multiple multiple different different

researchers may researchers claimtotohave may claim havepublished publisheda anext nextYOLO YOLO version version in the in the lineage. lineage. Open Open Neural Neural

NetworkExchange Network Exchange (ONNX) (ONNX) is oneisformat one format providing providing a device-independent a device-independent way of way of specifying specifying

the weights the and topology weights and topologyofofaa neural neural network. network.Stripped Strippedofofweights, weights,the theONNX ONNX topology topology

provides aa relatively provides relatively lightweight lightweight complete complete representation representation of of aa neural neuralnetwork network topology. topology.

Signalling an Signalling an entire entire topology topology to to identify identifya anetwork network may lack scale may lack scale as as network topology network topology

increases and increases besides, aa more and besides, summarisedindication more summarised indicationofofa anetwork networktopology topology would would be sufficient be sufficient

to uniquely identify a network without needing to reference any potentially unstable public to uniquely identify a network without needing to reference any potentially unstable public

resources. Even referencing a paper is typically insufficient as the paper will reference a public resources. Even referencing a paper is typically insufficient as the paper will reference a public

software repository software repository (such (such as as one one hosted on github.com). hosted on github.com).Thus, Thus,totoenable enableinteroperability interoperability betweendifferent between different source source devices devices and anddestination destination devices, devices, aa signalling signalling means that enables means that enables

unambiguous unambiguous compatibility compatibility determination determination between between anpart an NN NN part 1 and1 one and or onemore or more NN2 part NN part 2 implementationsisisneeded. implementations needed.InInparticular, particular, aa mechanism thatdoes mechanism that doesnot notrely relyupon upona acentralised centralised enumeration, such as a registration authority, is preferable as establishing, maintaining, and enumeration, such as a registration authority, is preferable as establishing, maintaining, and

promoting use of such a registration authority is typically burdensome and an obstacle to promoting use of such a registration authority is typically burdensome and an obstacle to

industry adoption. Instead, a signalling means that is more concise than sending an entire industry adoption. Instead, a signalling means that is more concise than sending an entire

networkweights network weightsand/or and/ortopology, topology,isisan analternative alternative that that can can be be deployed morereadily deployed more readily whilst whilst avoiding the avoiding the need need for for an an externally externally maintained andrecognised maintained and recognisedauthoritative authoritative registry registry of of network network

weights and/or weights and/or topology. topology.

[00059] Thesystem

[00059] The system100 100 may, may, in in some some implementations, implementations, convey convey both both intermediate intermediate compressed compressed

features from features an NN from an NNpart part11and andcompressed compressed video video from from a video a video source source from from a source a source device device to ato a destination device. destination Including both device. Including both video video and andcompressed compressed featuresenables features enablesuse-cases use-cases like like

overlaying task result, such as bounding boxes of specific objects, onto video for review by a overlaying task result, such as bounding boxes of specific objects, onto video for review by a

43452381_1 43452381_1

14

human. Another use case is for the task result of the source device NN part 2 or of the human. Another use case is for the task result of the source device NN part 2 or of the 17 Jan 2024

destination device NN part 2 to trigger long-term storage of video that may be of future interest. destination device NN part 2 to trigger long-term storage of video that may be of future interest.

In one In example,video one example, videomay maybeberetained retainedonly onlywhen when anyany person person is detected, is detected, reducing reducing storage storage costs costs

comparedtotostorage compared storageall all received received video. video. When When a specificperson a specific personofofinterest interest needs needsto to be be identified, an NN part 2 may be performed that is trained to detect a specific person of interest. identified, an NN part 2 may be performed that is trained to detect a specific person of interest.

Thedestination The destination device device may mayreprocess reprocessthe theNNNN part1 1tensors part tensorscorresponding correspondingto to frames frames containing containing

at least one detected person using the customised NN part 2 to further filter out video data to the at least one detected person using the customised NN part 2 to further filter out video data to the 43452381_1

2024200305

mostrelevant most relevant clips, clips, which which may thenbebereviewed may then reviewedbybya ahuman. human.In In other other implementations, implementations, video video

maynot may notbeberetained retained unless unless aa request request for for the the video video has has been been made. made.

[00060] Thesystem

[00060] The system100 100 includes includes a a sourcedevice source device 110. 110. The The source source device device 110110 includes includes a video a video

source 112 source 112for for generating generating unencoded unencodedframe frame data data 113. 113. TheThe frame frame datadata 113 113 is passed is passed to NN to NN part part

11 114 to produce 114 to tensors 115 produce tensors 115 and andto to aa video encoder150 video encoder 150totoproduce producea avideo videolayer layer bitstream 151. bitstream Thetensors 151. The tensors115 115are arepassed passedtotoaa tensor tensor encoder encoder116, 116,which whichproduces produces a feature a feature

layer bitstream layer bitstream 121. Thefeature 121. The feature layer layer bitstream bitstream 121 is passed 121 is passed to to aa box box encapsulator encapsulator 154, 154, which which

outputs aa box-encapsulated outputs bitstream155. box-encapsulated bitstream 155.The The bitstream155155 bitstream is is passedtotoa atransmitter passed transmitter122 122 whichoutputs which outputsaa transmitted transmitted bitstream bitstream 123. 123. The Thebox-encapsulated box-encapsulated bitstream bitstream 155155 includes includes a a track containing track containing video frames, obtained video frames, obtained from fromthe the video-layer video-layer bitstream bitstream 151, 151, and andanother anothertrack track containing feature containing feature frames, frames, obtained from the obtained from the feature-layer feature-layer bitstream bitstream 121. 121. The packedfeatures The packed features from the from the tensors tensors 115 fromthe 115 from the NN NNpart part1 1114 114form forma a highlytransformed highly transformed version version of of thetheframe frame data 113. data Theresolution 113. The resolutionof of the the feature feature frames frames is is dependent on the dependent on the number number ofoftensors tensorsin in the the tensors 115 and the channel count and resolution of each of the tensors 115, which is not related tensors 115 and the channel count and resolution of each of the tensors 115, which is not related

(or indirectly related) to the resolution of frames in the frame data 113. Moreover, there is no (or indirectly related) to the resolution of frames in the frame data 113. Moreover, there is no

simple spatial relationship between the abstract feature map data and the samples in the frame simple spatial relationship between the abstract feature map data and the samples in the frame

data 113. data 113.

[00061] Thesystem

[00061] The system100 100 alsoincludes also includesa adestination destinationdevice device140 140for fordecoding decodingtensor tensordata dataininthe the form of form of the the received bitstream 123. received bitstream Thedestination 123. The destinationdevice device140 140may maybe be used used forfor decoding decoding thethe

tensor data (or tensors) for content (e.g., of audio data, video data, image data, and textual data) tensor data (or tensors) for content (e.g., of audio data, video data, image data, and textual data)

of the bitstream 123. of the bitstream 123.

[00062]

[00062] AAcommunication communication channel channel 130 130 is used is used to communicate to communicate the bitstream the bitstream 123 the 123 from from the source device source device 110 110to to the the destination destination device device 140. In some 140. In somearrangements, arrangements,the thesource sourcedevice device110 110 and destination and destination device 140 may device 140 mayeither eitheror or both both comprise compriserespective respectivemobile mobiletelephone telephonehandsets handsets (e.g., “smartphones”) (e.g., or network "smartphones") or camerasand network cameras andcloud cloudapplications. applications.TheThe communication communication

43452381_1 43452381_1

15

channel 130 channel 130may maybebea awired wiredconnection, connection, such such as as Ethernet,orora awireless Ethernet, wirelessconnection, connection,such suchasas 17 Jan 2024

WiFioror 5G, WiFi 5G,including includingconnections connectionsacross acrossa aWide Wide Area Area Network Network (WAN). (WAN). The communication The communication

channel 130 channel 130may mayalso alsobebeimplemented implemented across across ad-hoc ad-hoc connections. connections. Moreover, Moreover, the source the source

device 110 device 110and andthe the destination destination device device 140 140may maycomprise comprise applications applications where where encoded encoded video video datadata

is captured is captured on on some computer-readable some computer-readable storagemedium, storage medium, suchsuch as aashard a hard disk disk drive drive in in a file a file

server or server or memory. Although memory. Although thethe system system 100 100 is described is described as as including including thethe video video source source 112, 112,

whichwould which wouldprovide provide theframe the frame data data 113 113 forfor a a neuralnetwork neural network targetinga acomputer targeting computer vision vision 43452381_1

2024200305

application, other types of source data, such as audio or text, may be input to a suitable neural application, other types of source data, such as audio or text, may be input to a suitable neural

networkimplemented network implementedin in thethe NNNN part part 1141 and 114anand NN an NN2 part part head2166. head 166.

[00063] Asshown

[00063] As shownin in Fig.1,1,the Fig. the source sourcedevice device110 110includes includesthe thevideo videosource source112, 112,the theNNNN part part

11 114, 114, the the tensor tensor encoder encoder 116, 116, the the box box encapsulator encapsulator 154, and the 154, and the transmitter transmitter 122. The video 122. The video source 112 source 112typically typically comprises comprises aa source sourceof of captured captured video videoframe framedata data(shown (shownasas 113),such 113), suchasas an image an imagecapture capturesensor, sensor, aa previously previously captured captured video videosequence sequencestored storedonona anon-transitory non-transitory recording medium, recording medium,orora avideo videofeed feedfrom froma aremote remote image image capture capture sensor. sensor. TheThe video video source source 112 112 mayalso may alsobe bean anoutput outputof of aa computer computergraphics graphicscard, card,for for example, example,displaying displayingthe thevideo videooutput outputofof an operating an operating system systemand andvarious variousapplications applicationsexecuting executingupon upona acomputing computing device device (e.g.,a atablet (e.g., tablet computer).Examples computer). Examplesof of source source devices devices 110110 that that maymay include include an image an image capture capture sensor sensor as as the the video source video source 112 112include includesmart-phones, smart-phones,video videocamcorders, camcorders, professional professional video video cameras, cameras, andand

networkvideo network videocameras. cameras.TheThe video video source source 112112 may may produce produce independent independent imagesimages or may or may produce temporally sequential images, i.e., a video. produce temporally sequential images, i.e., a video.

[00064] The

[00064] Thesystem system100 100 implements implements a task a task network network in aindistributed a distributed manner, manner, with with a division a division

into two into parts, that two parts, thatis,is, NNNNpart part1 1114 114and andNN NN part part 2166. 166.ForFor example, example, a ‘YOLOv3’ a 'YOLOv3' network network

maybebeused may usedasasone onepart partof of an an object object tracking tracking system andaa 'FasterRCNN' system and ‘FasterRCNN’ network network may may be used be used

as an as an object object detection detection system. Thenumber system. The numberandand dimensionality dimensionality of of tensors tensors 115 115 depends depends on aon a particular network performed in the system 100 and the split point of the particular task particular network performed in the system 100 and the split point of the particular task

network. network.

[00065]The

[00065] TheNNNN part part 1 1 114 114 receives receives thevideo the videoframe frame data data 113 113 andand performs performs specific specific layers layers of of

an overall an overall CNN, suchasaslayers CNN, such layerscorresponding correspondingtotothe the'backbone' ‘backbone’ofofthe theCNN, CNN, outputting outputting tensors tensors

115. Thebackbone 115. The backbone (part1)1)layers (part layersofof the the CNN CNN may may produce produce multiple multiple tensors tensors as the as the output output 115, 115,

for example, corresponding to different spatial scales of an input image represented by the for example, corresponding to different spatial scales of an input image represented by the

video frame video framedata data 113 113when when splittingthe splitting the network networkwithin withinthe theFPN. FPN.An An FPN FPN may result may result in three in three

tensors, corresponding to three layers, output from the backbone 114 as the tensors 115 (e.g., if tensors, corresponding to three layers, output from the backbone 114 as the tensors 115 (e.g., if

43452381_1 43452381_1

16

a ‘YOLOv3’ a network 'YOLOv3' network is performed is performed by the by the system system 100), 100), withwith varying varying spatial spatial resolution resolution andand 17 Jan 2024

channel count. channel count. When Whenthethe system system 100100 is performing is performing networks networks suchsuch as ‘Faster as 'Faster RCNNRCNN X101- X101- FPN’oror'Mask FPN' ’MaskRCNN RCNN X101-FPN’ X101-FPN' the tensors the tensors 115 may115 may include include tensors tensors for fourfor four layers layers (P2-P5). (P2-P5).

Use of a FPN results in a plurality of tensors forming a hierarchical representation for a single Use of a FPN results in a plurality of tensors forming a hierarchical representation for a single

frame to frame to be be encoded encodedtoto (and (anddecoded decodedfrom) from) thebitstream the bitstreamwhen when thethe splitpoint split pointofofthe the network network occurs within occurs within the the FPN, as described FPN, as describedhereafter. hereafter. The Thetensor tensorencoder encoder116 116receives receivesthe thetensors tensors115 115 and produces and producesthe thefeature feature bitstream bitstream 121, 121, containing containing aa compressed compressedrepresentation representationofofthe the 43452381_1

2024200305

tensors 115. tensors 115.

[00066] The

[00066] The feature feature bitstream bitstream 155 155 is is supplied supplied to the to the transmitter transmitter 122 for modulation 122 for modulation for for transmission over transmission over the the communications communications channel channel 130130 as part as part of of thebitstream the bitstream123. 123.Alternatively Alternatively or additionally, the bitstream 123 may be written to storage 132 for later use. or additionally, the bitstream 123 may be written to storage 132 for later use.

[00067] Although

[00067] Althoughthe thetensor tensorencoder encoder116 116 and and a tensordecoder a tensor decoder 146146 areare described described as as using using a a block-basedencoder block-based encoderand anddecoder, decoder,specifically specificallyVVC, VVC,in in conjunction conjunction with with other other processing, processing,

different categories of codec may be used instead. An alternative type of codec is an ‘end-to- different categories of codec may be used instead. An alternative type of codec is an 'end-to-

end’ learned end' learned codec. codec. End-to-end End-to-endlearned learnedcodecs codecsrely relyupon uponlearned learnedelements elements to to produce produce a sparse a sparse

tensor which tensor maybebeefficiently which may efficiently entropy entropycoded. coded.AnAn analysisandand analysis synthesisstage synthesis stageare areconstructed constructed out of out of learned learned elements such as elements such as convolutions, GDN convolutions, GDN (gradual (gradual divisivenormalisation) divisive normalisation)andand other other

stages are used to produce a sparse tensor. The sparse tensor is losslessly arithmetically stages are used to produce a sparse tensor. The sparse tensor is losslessly arithmetically

encodedtoto produce encoded producea abitstream. bitstream. Some Some end-to-end end-to-end learned learned codecs codecs makemake usea of use of a hyperprior, hyperprior, or or side-channel side-channel ofof information information that that helpshelps capture capture spatialspatial dependencies dependencies present inpresent the maininchannel. the main channel. Probability Probability distributions distributionsininthe entropy the entropycoding codingofofthe main the mainchannel channel may be adapted may be adapted based basedononthe the hyperprior to hyperprior to achieve improvedcompression achieve improved compression performance. performance. End-to-end End-to-end learned learned compression compression

schemes are typically trained for a given quality level, so instead of adapting quality by varying schemes are typically trained for a given quality level, SO instead of adapting quality by varying

a quantisation parameter, different network weights may be used to select a desired quality a quantisation parameter, different network weights may be used to select a desired quality

level. One level. exampleofofa alearned One example learnedend-to-end end-to-endcodec codecisisdescribed describedininananIEEE IEEE paper paper entitled entitled

‘Learned ImageCompression 'Learned Image Compressionwithwith Discretized Discretized Gaussian Gaussian Mixture Mixture Likelihoods Likelihoods and Attention and Attention

Modules’bybyZhengxue Modules' Zhengxue Cheng Cheng etand et al al and published published at the at the CVPR2020 CVPR2020 conference, conference, informally informally

knownasasthe known the'Cheng2020' ‘Cheng2020’ codec. codec.

[00068] Thebitstream

[00068] The bitstream123 123isistransmitted transmitted by bythe the transmitter transmitter 122 over the 122 over the communication communication

channel 130 channel 130asas encoded encodeddata. data.The The bitstream bitstream 123 123 cancan in in some some implementations implementations be stored be stored in the in the

storage memory storage 132,where memory 132, where thethe storage storage 132 132 is is a a non-transitorystorage non-transitory storagedevice devicesuch suchasasa a"Flash" “Flash” memory memory or or a a harddisk hard diskdrive, drive,until until later laterbeing being transmitted transmitted over over the thecommunication channel130 communication channel 130

43452381_1 43452381_1

17

(or in-lieu (or in-lieuofoftransmission transmissionover overthe thecommunication channel130). communication channel 130).For Forexample, example, encoded encoded video video 17 Jan 2024

data may data beserved may be servedupon upondemand demand to customers to customers overover a wide a wide areaarea network network (WAN)(WAN) for a for a video video analytics application. analytics application.

[00069] Thedestination

[00069] The destinationdevice device140 140includes includesa areceiver receiver 142, 142, aa tensor tensor decoder 146, the decoder 146, the box box extractor 144, extractor 144, the the NN part 22 166, NN part 166, a a video video decoder 170, and decoder 170, and aa task task result result renderer renderer168. 168. The The

receiver 142 receiver demodulatesthe 142 demodulates thebitstream bitstream123 123from fromthethecommunication communication channel channel 130, 130, sending sending a a 43452381_1

2024200305

demodulatedbitstream demodulated bitstream143 143 toto thebox the boxextractor extractor144. 144.The The boxbox extractor extractor 144 144 includes includes a file a file

format parser capable of parsing high-level syntax applicable to the definition of the box format parser capable of parsing high-level syntax applicable to the definition of the box

structure, such as box types and box lengths in the box hierarchy, to be described with reference structure, such as box types and box lengths in the box hierarchy, to be described with reference

to Fig. to Fig. 11B. 11B. AAfeature feature sub-bitstream sub-bitstream 145 145isis passed passed from fromthe thebox boxextractor extractor 144 144toto the the tensor tensor decoder146. decoder 146.AAvideo videosub-bitstream sub-bitstream171 171isispassed passedfrom from thebox the box extractor144 extractor 144totothe thevideo video decoder170. decoder 170.The Thetensor tensordecoder decoder 146 146 outputs outputs decoded decoded tensors tensors 149, 149, which which are are supplied supplied to the to the

NNpart NN part22166. 166.The TheNNNN part part 2 166 2 166 implements implements a network, a network, forming forming the remainder the remainder of theof the networkperformed network performedbyby theNNNN the part part 1 114. 1 114. TheThe NN NN part part 2 166 2 166 receives receives the the tensors tensors 149 149 and and

performsthe performs the later later layers layers of ofthe theneural neuralnetwork network that thatbegan began with with the theNN part 11 114 NN part 114 to to produce a produce a

task result 167. task result 167.

[00070] The

[00070] The task task result result 167 167 is presented is presented to theto the user, user, such such as as visually visually by the by the task task result result

renderer 168. renderer Thetask 168. The taskresult result renderer renderer 168 drawsbounding 168 draws bounding boxes boxes or or segmentation segmentation maps maps

highlighting detected objects or instances in accordance with the task result 167 and overlayed highlighting detected objects or instances in accordance with the task result 167 and overlayed

on decoded on decodedvideo video172 172produced produced by by thethe video video decoder decoder 170. 170. The The task task result result renderer renderer 168168 is one is one

example of presenting neural network result presentation to a user. The task result renderer 168 example of presenting neural network result presentation to a user. The task result renderer 168

may be operative only when the task result 168 indicates a detected object. may be operative only when the task result 168 indicates a detected object.

[00071]

[00071] ItItisisalso alsopossible possibleforfor thethe functionality functionality of each of each ofsource of the the source device device 110 and 110 the and the

destination device destination device 140 to be 140 to be embodied embodied ininaa single single device, device, examples ofwhich examples of whichinclude includemobile mobile telephone handsets telephone handsetsand andtablet tablet computers computersand andcloud cloudapplications. applications.

[00072] Notwithstandingthetheexample

[00072] Notwithstanding example devices devices mentioned mentioned above, above, each each of source of the the source device device 110 110

and destination and destination device 140 may device 140 maybebeconfigured configuredwithin withina ageneral-purpose general-purpose computing computing system, system,

typically through typically through a a combination of hardware combination of hardwareand andsoftware softwarecomponents. components. Fig.Fig. 2A illustrates 2A illustrates such such

a computer a system200, computer system 200,which which includes:a acomputer includes: computer module module 201;201; input input devices devices suchsuch as aas a keyboard202, keyboard 202,aamouse mousepointer pointerdevice device203, 203,a ascanner scanner226, 226,a acamera camera 227, 227, which which maymay be be configured as configured as the the video source 112, video source 112, and and aa microphone microphone280; 280;andand output output devices devices including including a a

43452381_1 43452381_1

18

printer 215, printer 215, aa display displaydevice device 214 214 and and loudspeakers 217. AnAnexternal loudspeakers 217. externalModulator-Demodulator Modulator-Demodulator 17 Jan 2024

(Modem) (Modem) transceiverdevice transceiver device216216 maymay be used be used by the by the computer computer module module 201communicating 201 for for communicating to and to and from from aa communications communications network network 220 220 via via a connection a connection 221.221. The communications The communications

network220, network 220,which whichmay may represent represent thethe communication communication channel channel 130, 130, may may be be a (WAN), a (WAN), such as such as the Internet, the Internet,a acellular cellulartelecommunications telecommunications network, network, or or aa private privateWAN. Where WAN. Where thethe

connection221 connection 221isis aa telephone telephone line, line, the the modem 216may modem 216 may be be a traditional"dial-up" a traditional “dial-up”modem. modem. Alternatively, where the connection 221 is a high capacity (e.g., cable or optical) connection, Alternatively, where the connection 221 is a high capacity (e.g., cable or optical) connection, 43452381_1

2024200305

the modem the 216 modem 216 maymay be abebroadband a broadband modem. modem. A wireless A wireless modem modem may also may alsofor be used be wireless used for wireless connectionto connection to the the communications network communications network 220. 220. The The transceiver transceiver device device 216 216 may provide may provide the the functionality of functionality of the thetransmitter transmitter122 122and andthe thereceiver receiver142 142and andthe thecommunication channel130 communication channel 130 maybebeembodied may embodiedin in theconnection the connection 221. 221.

[00073] Thecomputer

[00073] The computer module module 201 201 typically typically includes includes at at leastone least oneprocessor processor unit205, unit 205,and anda a memory memory unit206. unit 206.ForFor example, example, thethe memory memory unit unit 206 have 206 may may semiconductor have semiconductor random random access access memory(RAM) memory (RAM) and and semiconductorread semiconductor readonly only memory memory(ROM). (ROM).TheThe computer computer module module 201201 also also

includes a number of input/output (I/O) interfaces including: an audio-video interface 207 that includes a number of input/output (I/O) interfaces including: an audio-video interface 207 that

couples to couples to the the video video display display 214, 214, loudspeakers 217and loudspeakers 217 andmicrophone microphone 280; 280; an an I/OI/O interface213213 interface

that couples that couples to to the thekeyboard keyboard 202, 202, mouse 203,scanner mouse 203, scanner226, 226,camera camera 227 227 andand optionally optionally a joystick a joystick

or other human interface device (not illustrated); and an interface 208 for the external or other human interface device (not illustrated); and an interface 208 for the external

modem modem 216216 andand printer printer 215. 215. TheThe signal signal from from the the audio-video audio-video interface interface 207207 to the to the computer computer

monitor214 monitor 214isis generally generally the the output output of of aa computer graphics card. computer graphics card. In In some someimplementations, implementations,thethe

modem modem 216216 maymay be incorporated be incorporated within within the the computer computer module module 201,example 201, for for example within within the the interface 208. interface The computer 208. The computermodule module 201201 also also hashas a local a local network network interface interface 211, 211, which which permits permits

coupling of coupling of the the computer system200 computer system 200viaviaa aconnection connection223223 to to a a local-areacommunications local-area communications network222, network 222,known knownas as a Local a Local Area Area Network Network (LAN). (LAN). As illustrated As illustrated in Fig. in Fig. 2A, 2A, the local the local

communications communications network network 222 222 may may also also couple couple to the to the widewide network network 220a via 220 via a connection connection 224, 224, which would typically include a so-called “firewall” device or device of similar functionality. which would typically include a so-called "firewall" device or device of similar functionality.

Thelocal The local network networkinterface interface 211 211may maycomprise comprise an an EthernetTM EthernetTM circuit circuit BluetoothTM card,a aBluetooth card,

wireless arrangement wireless oran arrangement or anIEEE IEEE802.11 802.11 wireless wireless arrangement; arrangement; however, however, numerous numerous other other typestypes

of interfaces of interfaces may be practiced may be practiced for for the theinterface interface211. 211.The The local localnetwork network interface interface 211 211 may also may also

provide the provide the functionality functionality of of the thetransmitter transmitter122 122and andthe thereceiver receiver142 142and andcommunication communication

channel 130 channel 130may mayalso alsobebeembodied embodiedin in thethe localcommunications local communications network network 222. 222.

[00074] The

[00074] The I/OI/O interfaces interfaces 208213 208 and and 213 may mayeither afford afford or either both ofor both and serial of serial and parallel parallel

connectivity, the connectivity, the former former typically typically being being implemented accordingtotothe implemented according theUniversal UniversalSerial Serial Bus Bus

43452381_1 43452381_1

19

(USB)standards (USB) standardsand andhaving having corresponding corresponding USBUSB connectors connectors (not (not illustrated). illustrated). Storage Storage 17 Jan 2024

devices 209 devices 209 are are provided providedand andtypically typically include include aa hard hard disk disk drive drive (HDD) 210.Other (HDD) 210. Other storage storage

devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used.

An optical disk drive 212 is typically provided to act as a non-volatile source of data. Portable An optical disk drive 212 is typically provided to act as a non-volatile source of data. Portable

TM memory memory devices,such devices, such opticaldisks optical disks(e.g. (e.g. CD-ROM, CD-ROM, DVD,DVD, BluDiscTM), Blu ray ray DiscUSB-RAM, ), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate portable, external hard drives, and floppy disks, for example, may be used as appropriate

sources of sources of data data to to the thecomputer computer system 200.Typically, system 200. Typically,any anyofofthe theHDD HDD 210, 210, optical optical drive212, drive 212, 43452381_1

2024200305

networks220 networks 220and and222 222maymay also also be be configured configured to to operate operate as as thevideo the video source source 112, 112, or or asas a a destination for destination for decoded video data decoded video data to to be be stored stored for forreproduction reproduction via viathe thedisplay display214. 214. The The source source

device 110 device 110and andthe the destination destination device device 140 140of of the the system 100may system 100 maybebeembodied embodied in the in the computer computer

system200. system 200.

[00075] The

[00075] Thecomponents components205205 to 213 to 213 of the of the computer computer module module 201 typically 201 typically communicate communicate via an via an interconnected bus interconnected bus 204 204and andininaa manner mannerthat thatresults results in in aa conventional conventional mode ofoperation mode of operationofof the the computersystem computer system200 200 known known to those to those in in thethe relevantart. relevant art.For Forexample, example,thethe processor processor 205 205 is is

coupledto coupled to the the system bus 204 system bus 204using usingaaconnection connection218. 218.Likewise, Likewise, thethe memory memory 206 206 and optical and optical

disk drive disk drive 212 are coupled 212 are to the coupled to the system bus 204 system bus 204by byconnections connections219. 219.Examples Examples of computers of computers

on which on whichthe thedescribed describedarrangements arrangementscancan bebe practisedinclude practised includeIBM-PC's IBM-PC’s and and compatibles, compatibles, Sun Sun SPARCstations,Apple SPARCstations, Apple MacM orTM Mac or alike alike computer computer systems. systems.

[00076] Thevideo

[00076] The videoencoder encoder150, 150,NNNN part part 1 114, 1 114, tensor tensor encoder encoder 116, 116, boxbox encapsulator encapsulator 154,154,

modules122 modules 122and and142, 142,thethebox box extractor144, extractor 144,tensor tensordecoder decoder146, 146,NNNN part part 2 166, 2 166, video video decoder decoder

170 and result 170 and result renderer renderer 168 168 and methodstotobebedescribed, and methods described,may maybebeimplemented implemented as one as one or more or more

software application software application programs 233executable programs 233 executablewithin withinthethecomputer computer system system 200. 200. In particular, In particular,

the modules the ofthe modules of the devices devices 110 110and and140 140and andthe thesteps stepsofofthe the described described methods methodsare areeffected effectedbyby instructions 231 (see Fig. 2B) in the software 233 that are carried out within the computer instructions 231 (see Fig. 2B) in the software 233 that are carried out within the computer

system200. system 200.The Thesoftware software instructions231 instructions 231may may be be formed formed as one as one or more or more codecode modules, modules, each each for performing for oneor performing one or more moreparticular particular tasks. tasks. The Thesoftware softwaremay may alsobebedivided also dividedinto intotwo two separate parts, separate parts,in inwhich which aafirst firstpart andand part thethe corresponding code corresponding codemodules modules performs the described performs the described

methodsand methods anda asecond secondpart partand andthe thecorresponding corresponding code code modules modules manage manage a user a user interface interface

between the first part and the user. between the first part and the user.

[00077] Thesoftware

[00077] The softwaremay maybe be stored stored inin a acomputer computer readable readable medium, medium, including including the the storage storage

devices described devices described below, below,for for example. example.The The software software is is loaded loaded intothe into thecomputer computer system system 200200

from the from the computer computerreadable readablemedium, medium,andand then then executed executed by the by the computer computer system system 200. 200. A A

43452381_1 43452381_1

20

computerreadable computer readablemedium medium having having suchsuch software software or computer or computer program program recorded recorded on the on the 17 Jan 2024

computerreadable computer readablemedium medium is computer is a a computer program program product. product. Theofuse The use ofcomputer the the computer program program

product in product in the the computer system200 computer system 200preferably preferablyeffects effectsananadvantageous advantageous apparatus apparatus forfor

implementingthe implementing thesource sourcedevice device110 110 and and thedestination the destinationdevice device140 140 and and thedescribed the described methods. methods.

[00078] Thesoftware

[00078] The software233 233isistypically typically stored stored in in the the HDD 210ororthe HDD 210 thememory memory 206. 206. The The software software

is loaded is loaded into into the thecomputer computer system 200from system 200 froma acomputer computer readable readable medium, medium, and and executed executed by by the the 43452381_1

2024200305

computersystem computer system200. 200.Thus, Thus, forfor example, example, thethe software software 233233 may may be stored be stored onoptically on an an optically readable disk readable disk storage storage medium (e.g., CD-ROM) medium (e.g., CD-ROM) 225 225 that that is read is read by by thethe optical optical diskdrive disk drive212. 212.

[00079] In some

[00079] In someinstances, instances,the the application application programs programs233 233may maybe be supplied supplied to to theuser the userencoded encoded on one on one or or more moreCD-ROMs CD-ROMs 225read 225 and and via readthe viacorresponding the corresponding drive drive 212, 212, or or alternatively alternatively may may be read by the user from the networks 220 or 222. Still further, the software can also be loaded be read by the user from the networks 220 or 222. Still further, the software can also be loaded

into the into the computer system200 computer system 200from fromother othercomputer computer readable readable media. media. Computer Computer readable readable storage storage

mediarefers media refers to to any any non-transitory non-transitory tangible tangible storage storage medium that provides medium that providesrecorded recordedinstructions instructions and/or data and/or data to to the the computer system200 computer system 200for forexecution executionand/or and/orprocessing. processing.Examples Examples of such of such

TM storage media storage includefloppy media include floppydisks, disks, magnetic magnetictape, tape, CD-ROM, CD-ROM, DVD,DVD, Blu-ray Blu-ray Disc DiscTM, a hard,a hard disk drive, disk drive, aaROM ROM ororintegrated integratedcircuit, circuit, USB memory, USB memory, a magneto-optical a magneto-optical disk, disk, or or a computer a computer

readable card readable card such such as as aa PCMCIA card PCMCIA card andand the the like,whether like, whether or or notnot such such devices devices areare internaloror internal

external of external of the the computer module201. computer module 201.Examples Examples of transitory of transitory or or non-tangible non-tangible computer computer

readable transmission media that may also participate in the provision of the software, readable transmission media that may also participate in the provision of the software,

application programs, application instructions and/or programs, instructions and/or video video data data or or encoded videodata encoded video data to to the the computer computer

module201 module 201include includeradio radioororinfra-red infra-red transmission transmissionchannels, channels,as as well well as as aa network connectiontoto network connection

another computer another computerorornetworked networked device,andand device, theInternet the InternetororIntranets Intranets including including e-mail e-mail transmissions and transmissions and information informationrecorded recordedononWebsites Websitesandand thethe like. like.

[00080] Thesecond

[00080] The secondpart partofofthe the application application program program233 233and and thecorresponding the corresponding code code modules modules

mentionedabove mentioned abovemaymay be be executed executed to implement to implement onemore one or or more graphical graphical user user interfaces interfaces (GUIs) (GUIs)

to be to be rendered rendered or or otherwise represented upon otherwise represented uponthe the display display 214. 214. Through Through manipulation manipulation of of typically the typically the keyboard 202 and keyboard 202 andthe the mouse mouse203, 203,a auser userofofthe the computer computersystem system 200 200 andand thethe

application may application manipulatethe may manipulate theinterface interface in in aa functionally functionally adaptable adaptable manner to provide manner to provide controlling commands controlling and/or commands and/or inputtotothe input theapplications applicationsassociated associatedwith withthe the GUI(s). GUI(s).Other Otherforms forms of functionally of functionally adaptable adaptable user user interfaces interfacesmay may also also be be implemented, suchasasan implemented, such anaudio audiointerface interface utilizing speech utilizing speech prompts output via prompts output via the the loudspeakers 217and loudspeakers 217 anduser uservoice voicecommands commands input input viavia

the microphone the 280. microphone 280.

43452381_1 43452381_1

21

[00081] Fig.

[00081] Fig. 2B 2Bis is aa detailed detailed schematic schematic block diagramofofthe block diagram the processor processor 205 205and anda a 17 Jan 2024

“memory” "memory" 234. 234. TheThe memory memory 234 represents 234 represents a logical a logical aggregation aggregation ofthe of all all the memory memory modules modules

(including the (including the storage storage devices devices 209 209 and semiconductormemory and semiconductor memory 206)206) thatthat can can be accessed be accessed by by the the computermodule computer module201201 in in Fig.2A.2A. Fig.

[00082] When

[00082] When thecomputer the computer module module 201initially 201 is is initially powered powered up, up, a power-on a power-on self-test self-test (POST) (POST)

program250 program 250executes. executes.TheThe POST POST program program 250 is250 is typically typically stored stored in a in a ROM ROM 249 of249 the of the 43452381_1

2024200305

semiconductormemory semiconductor memory206 206 of Fig. of Fig. 2A.2A. A hardware A hardware devicedevice such such as theasROM the249 ROM 249 storing storing software is software is sometimes referred to sometimes referred to as as firmware. ThePOST firmware. The POST program program 250 examines 250 examines hardware hardware

within the within the computer module computer module 201 201 to to ensure ensure proper proper functioning functioning andand typically typically checks checks thethe

processor 205, processor 205, the the memory 234 memory 234 (209, (209, 206), 206), and and a basicinput-output a basic input-outputsystems systems software software (BIOS) (BIOS)

module251, module 251,also alsotypically typically stored stored in in the the ROM 249,for ROM 249, forcorrect correctoperation. operation. Once OncethethePOST POST program250 program 250has hasrun runsuccessfully, successfully,the theBIOS BIOS 251 251 activatesthe activates thehard harddisk diskdrive drive210 210ofofFig. Fig.2A. 2A. Activation of the hard disk drive 210 causes a bootstrap loader program 252 that is resident on Activation of the hard disk drive 210 causes a bootstrap loader program 252 that is resident on

the hard the hard disk disk drive drive 210 210 to to execute execute via via the theprocessor processor 205. 205. This This loads loads an an operating operating system 253 system 253

into the into the RAM memory RAM memory 206,206, uponupon which which the operating the operating system system 253 commences 253 commences operation. operation. The The operating system 253 is a system level application, executable by the processor 205, to fulfil operating system 253 is a system level application, executable by the processor 205, to fulfil

various high various high level level functions, functions, including including processor processor management, memory management, memory management, management, devicedevice

management, management, storage storage management, management, software software application application interface, interface, andand generic generic user user interface. interface.

[00083] Theoperating

[00083] The operatingsystem system253 253 manages manages the the memory memory 234 (209, 234 (209, 206) 206) to to ensure ensure that each that each

process or process or application application running on the running on the computer module computer module 201 201 hashas sufficientmemory sufficient memory in which in which to to execute without execute withoutcolliding colliding with with memory memory allocatedtotoanother allocated anotherprocess. process.Furthermore, Furthermore, thethe different different

types of types of memory availableininthe memory available the computer computersystem system 200200 of of Fig.2A2A Fig. need need to to be be used used properly properly SO so that each that each process process can can run run effectively. effectively. Accordingly, the aggregated Accordingly, the memory aggregated memory 234234 is is notnotintended intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but to illustrate how particular segments of memory are allocated (unless otherwise stated), but

rather to rather to provide provide aa general general view view of of the thememory accessibleby memory accessible bythe the computer computersystem system 200 200 andand howhow

such memory such memory is is used. used.

[00084] Asshown

[00084] As shownin in Fig.2B, Fig. 2B,the theprocessor processor205 205includes includesa anumber numberof of functional functional modules modules

including a control unit 239, an arithmetic logic unit (ALU) 240, and a local or internal including a control unit 239, an arithmetic logic unit (ALU) 240, and a local or internal

memory memory 248, 248, sometimes sometimes called called a cache a cache memory. memory. The cache The cache memorymemory 248 typically 248 typically includesincludes a a numberofofstorage number storageregisters registers 244-246 244-246inin aa register register section. section. One or more One or internal busses more internal 241 busses 241

functionally interconnect functionally interconnect these these functional functional modules. Theprocessor modules. The processor205 205typically typicallyalso alsohas hasone oneoror

43452381_1 43452381_1

22

moreinterfaces more interfaces 242 242 for for communicating communicating with with external external devices devices viathethesystem via system bus bus 204, 204, using using thethe 17 Jan 2024

connection218. connection 218.The Thememory memory 234 234 is coupled is coupled to the to the bus bus 204 204 using using the the connection connection 219.219.

[00085] Theapplication

[00085] The applicationprogram program 233 233 includes includes a sequence a sequence of of instructions231231 instructions thatmay that may include include

conditional branch conditional andloop branch and loopinstructions. instructions. The Theprogram program233233 maymay also also include include data data 232232 which which is is used in used in execution of the execution of the program 233.The program 233. Theinstructions instructions231 231and andthe thedata data232 232are arestored storedinin memory memory locations228, locations 228,229, 229,230 230 and and 235, 235, 236, 236, 237, 237, respectively.Depending respectively. Depending uponupon the relative the relative 43452381_1

2024200305

size of size of the theinstructions instructions231 231and andthe thememory locations 228-230, memory locations 228-230,aa particular particular instruction instructionmay may be be

stored in stored in aa single singlememory location as memory location as depicted depicted by by the the instruction instruction shown in the shown in the memory memory

location 230. location Alternately, an 230. Alternately, an instruction instruction may be segmented may be segmentedinto intoaanumber numberofof partseach parts eachofof whichisis stored which stored in in aa separate separatememory location, as memory location, as depicted by the depicted by the instruction instruction segments showninin segments shown

the memory the locations228 memory locations 228andand 229. 229.

[00086] In general, the processor 205 is given a set of instructions which are executed therein.

Theprocessor The processor205 205waits waitsfor foraa subsequent subsequentinput, input, to to which whichthe the processor processor205 205reacts reacts to to by by executing another executing another set set of of instructions. instructions. Each Each input input may be provided may be providedfrom fromone oneorormore moreof of a a numberofofsources, number sources,including includingdata datagenerated generatedbybyone oneorormore moreofofthe theinput inputdevices devices202, 202,203, 203,data data received from received froman anexternal external source source across across one oneof of the the networks 220,202, networks 220, 202,data dataretrieved retrieved from fromone one of the of the storage storage devices devices 206, 206, 209 209 or or data data retrieved retrievedfrom from aa storage storagemedium 225inserted medium 225 insertedinto into the the corresponding reader 212, all depicted in Fig. 2A. The execution of a set of the instructions corresponding reader 212, all depicted in Fig. 2A. The execution of a set of the instructions

mayininsome may somecases casesresult resultin in output output of of data. data. Execution mayalso Execution may alsoinvolve involvestoring storingdata dataor or variables to variables to the thememory 234. memory 234.

[00087] Thetensor

[00087] The tensorencoder encoder116, 116,the thetensor tensordecoder decoder146 146and and thedescribed the describedmethods methods maymay use use

input variables input variables 254, 254, which are stored which are stored in in the thememory 234inincorresponding memory 234 correspondingmemory memory locations 255, locations 255, 256, 256, 257. Thetensor 257. The tensorencoder encoder116, 116,the thetensor tensor decoder decoder146 146and andthe thedescribed described methodsproduce methods produceoutput outputvariables variables261, 261,which which areare storedininthe stored thememory memory234234 in corresponding in corresponding

memory memory locations262, locations 262,263, 263,264. 264.Intermediate Intermediate variables variables 258258 maymay be stored be stored in memory in memory

locations 259, locations 259, 260, 260, 266 and 267. 266 and 267.

[00088] Referring

[00088] Referring to the to the processor processor 205 of205 Fig.of2B, Fig. the2B, the registers registers 244, 244, 245, 246,245, 246, the arithmetic the arithmetic

logic unit logic unit (ALU) 240,and (ALU) 240, andthe the control control unit unit 239 worktogether 239 work togetherto to perform performsequences sequencesofofmicro- micro- operations needed to perform “fetch, decode, and execute” cycles for every instruction in the operations needed to perform "fetch, decode, and execute" cycles for every instruction in the

instruction set instruction setmaking making up the program up the 233.Each program 233. Each fetch,decode, fetch, decode,and andexecute execute cycle cycle comprises: comprises:

43452381_1 43452381_1

23

a fetch a fetch operation, operation, which which fetches fetches or or reads reads an an instruction instruction231 231from from aa memory memory 17 Jan 2024

location 228, location 228, 229, 229, 230; 230;

a decode a operationin decode operation in which whichthe thecontrol control unit unit 239 determineswhich 239 determines whichinstruction instructionhas hasbeen been fetched; and fetched; and

an execute an execute operation operation in in which the control which the control unit unit 239 239 and/or and/or the the ALU 240execute ALU 240 execute the the 43452381_1

instruction. 2024200305

instruction.

[00089] Thereafter,

[00089] Thereafter, a further a further fetch, fetch, decode, decode, and execute and execute cycle cycle for for the the next next instruction instruction may be may be executed. Similarly, executed. Similarly, aa store store cycle cycle may be performed may be performedbybywhich which thethe controlunit control unit239 239stores storesoror writes aa value writes value to to aamemory location 232. memory location 232.

[00090] Each

[00090] Each stepstep or sub-process or sub-process in thein the methods methods of Figs. of 18 Figs. 18toand and 19, 19, to be described, be described, is is associated with associated one or with one or more segmentsofofthe more segments theprogram program 233 233 andand is is typicallyperformed typically performedby by thethe

register section 244, 245, 246, the ALU 240, and the control unit 239 in the processor 205 register section 244, 245, 246, the ALU 240, and the control unit 239 in the processor 205

working together to perform the fetch, decode, and execute cycles for every instruction in the working together to perform the fetch, decode, and execute cycles for every instruction in the

instruction set for the noted segments of the program 233. instruction set for the noted segments of the program 233.

[00091] Fig.

[00091] Fig. 3A 3Aisis aa schematic blockdiagram schematic block diagram300 300showing showing functional functional modules modules of aof a backbone backbone

portion 310 portion of aa CNN, 310 of which CNN, which may may serve serve as as an an implementation implementation of the of the NN part NN part 114 1 114the when when the system100 system 100isis configured configuredtoto perform performaa'YOLOv3' ‘YOLOv3’ network. network. The The NN1part NN part 114 1is 114 is sometimes sometimes

referred to as ‘DarkNet-53’, although different backbones are also possible, resulting in a referred to as 'DarkNet-53', although different backbones are also possible, resulting in a

different number different of and number of and dimensionality dimensionalityofoflayers layers of of the the tensors tensors 115 115 for for each each frame. In one frame. In one

implementation,the implementation, thebackbone backbone portion310310 portion may may be used be used as aasperson a person detector detector forfor thethe purpose purpose of of

object tracking. object tracking.

[00092] As

[00092] Asshown shownin in Fig.3A, Fig. 3A,the thevideo videodata data113 113isispassed passedtotoaa resizer resizer module 304.The module 304. Theresizer resizer module304 module 304resizes resizeseach eachframe frameofofthe thevideo videodata data113 113totoaaresolution resolution suitable suitable for for processing processing by by

the CNN the backbone CNN backbone 310, 310, producing producing resized resized frame frame datadata 312.312. If the If the resolution resolution of of thethe video video

data 113 data is already 113 is already suitable suitablefor forthe CNN the CNN backbone 310,operation backbone 310, operationofofthe theresizer resizer module 304isis module 304

not needed. not Theresized needed. The resized frame framedata data312 312isis passed passedto to aa convolutional batch normalisation convolutional batch normalisationleaky leaky rectified linear rectified linear(CBL) (CBL) module 314totoproduce module 314 producetensors tensors316. 316.The TheCBL CBL module module 314 contains 314 contains

modulesasasdescribed modules describedwith withreference referencetotoaa CBL CBLmodule module 360360 as shown as shown in Fig in Fig 3D. 3D.

43452381_1 43452381_1

24

[00093] TheCBL

[00093] The CBL module module 360 360 takes takes as input as input a tensor a tensor 361361 of the of the resized resized frame frame data data 312. 312. TheThe 17 Jan 2024

tensor 361 is passed to a convolutional layer 362 to produce tensor 363. If the convolutional tensor 361 is passed to a convolutional layer 362 to produce tensor 363. If the convolutional

layer 362 has a stride of one, the tensor 363 has the same spatial dimensions as the tensor 361. layer 362 has a stride of one, the tensor 363 has the same spatial dimensions as the tensor 361.

If the convolution layer 362 has a larger stride, such as two, the tensor 363 has smaller spatial If the convolution layer 362 has a larger stride, such as two, the tensor 363 has smaller spatial

dimensionscompared dimensions comparedto to thetensor the tensor361, 361,for forexample, example,halved halved in in width width and and height height forthe for thestride stride of two. Regardless of the stride, the size of channel dimension of the tensor 363 may vary of two. Regardless of the stride, the size of channel dimension of the tensor 363 may vary

comparedtotothe compared thechannel channeldimension dimensionof of thetensor the tensor361 361for fora aparticular particular CBL CBLblock. block.The The 43452381_1

2024200305

tensor 363 tensor is passed 363 is passed to to aa batch batch normalisation normalisation module 364,which module 364, whichoutputs outputsa atensor tensor365. 365.The The batch normalisation batch normalisation module module364 364 normalises normalises thethe input input tensor363 tensor 363 and and applies applies a a scalingfactor scaling factor and an offset value to produce the output tensor 365. The scaling factor and offset value are and an offset value to produce the output tensor 365. The scaling factor and offset value are

derived from a training process. The tensor 365 is passed to a leaky rectified linear activation derived from a training process. The tensor 365 is passed to a leaky rectified linear activation

(“LeakyReLU”) ("LeakyReLU") module module 366produce 366 to to produce a tensor a tensor 367. 367. The module The module 366 provides 366 provides a ‘leaky’ a 'leaky'

activation function activation function whereby positive values whereby positive values in in the the tensor tensor are arepassed passed through through and and negative negative

values are values are severely severely reduced in magnitude, reduced in for example, magnitude, for example,toto 0.1X 0.1Xtheir their former formervalue. value.

[00094] Returning

[00094] ReturningtotoFig. Fig. 3A, 3A,the the tensor tensor 316 316 is is passed fromthe passed from the CBL CBLblock block 314 314 to to a a residual residual

block module block module320, 320,such suchasasa a'res1+2+8' ‘res1+2+8’module module (also (also referred referred toto asasa ares res11 module)containing 11 module) containing a concatenation of three residual blocks, each residual block containing one (1) residual unit, a concatenation of three residual blocks, each residual block containing one (1) residual unit,

two (2) residual units, and eight (8) residual units, respectively. The spatial resolution of the two (2) residual units, and eight (8) residual units, respectively. The spatial resolution of the

tensors is halved horizontally and halved vertically in each of the residual blocks (see Fig. 3B) tensors is halved horizontally and halved vertically in each of the residual blocks (see Fig. 3B) by aa convolution by withstride convolution with stride equal equal to to two two in in aaCBL block344. CBL block 344.

[00095]

[00095] AAresidual residual block blockis is described with reference described with reference to to aa ResBlock 340asasshown ResBlock 340 shownin in Fig.3B. Fig. 3B. TheResBlock The ResBlock 340 340 receives receives a tensor341 a tensor 341 (e.g.,the (e.g., the tensor tensor 316). 316). The tensor 341 The tensor 341is is zero-padded by zero-padded by

a zero-padding a module342 zero-padding module 342 to to produce produce a tensor343. a tensor 343.TheThe tensor tensor 343 343 is is passed passed toto a aCBL CBL module344 module 344totoproduce producea atensor tensor345. 345.TheThe CBLCBL module module 344 contains 344 contains a convolution a convolution (for (for example similar to 362) with a stride parameter set to two, resulting in the tensor 345 having example similar to 362) with a stride parameter set to two, resulting in the tensor 345 having

half the width and half the height of the tensor 343. The tensor 345 is passed to a residual half the width and half the height of the tensor 343. The tensor 345 is passed to a residual

unit 346. The residual unit 346 contains a series of concatenated residual units, based on the unit 346. The residual unit 346 contains a series of concatenated residual units, based on the

number of residual block (for example, eleven (11) units for the block 320). The last residual number of residual block (for example, eleven (11) units for the block 320). The last residual

unit of the residual units 346 outputs a tensor 347. unit of the residual units 346 outputs a tensor 347.

[00096] AAresidual

[00096] residualunit unit is is described described with with reference reference to to aaResUnit ResUnit 350 as shown 350 as in Fig. shown in Fig. 3C. 3C. The The ResUnit 350takes ResUnit 350 takesaatensor tensor 351 351(for (for example examplethe thetensor tensor345) 345)asas input. input. The tensor 351 The tensor 351is is passed passed

to aa CBL to module CBL module 352 352 to to produce produce a tensor a tensor 353. 353. The The tensor tensor 353353 is is passed passed to to a a second second CBL CBL

unit 354 unit 354 to to produce produce aa tensor tensor 355. 355. An addmodule An add module356356 sums sums thethe tensor tensor 355355 with with thethe tensor tensor 351351

43452381_1 43452381_1

25

to produce to produce aa tensor tensor 357. 357. The addmodule The add module356 356 may may also also be be referred referred to to asasa a'shortcut' ‘shortcut’ as as the the input input 17 Jan 2024

tensor 351 substantially influences the output tensor 357. For an untrained network, tensor 351 substantially influences the output tensor 357. For an untrained network,

ResUnit350 ResUnit 350acts actsto to pass-through pass-throughtensors. tensors. As Astraining training is is performed, performed, the the CBL modules CBL modules 352352 andand

354 act 354 act to to deviate deviate the thetensor tensor357 357 away from the away from the tensor tensor 351 in accordance 351 in withtraining accordance with training data data and and ground truth data. ground truth data.

[00097]Returning

[00097] ReturningtotoFig. Fig. 3A, 3A,the the Res11 Res11module module320320 outputs outputs a tensor a tensor 322. 322. TheThe tensor tensor 322322 is is 43452381_1

output from from the the backbone backbonemodule module310310 as as oneone of of thethe layersandand alsoprovided provided to to a a Res8 2024200305

output layers also Res8

module324. module 324.The TheRes8 Res8 module module 324 324 is aisresidual a residual block block (i.e.,340), (i.e., 340),which whichincludes includeseight eightresidual residual units (i.e. units (i.e. 350). The 350). TheRes8 Res8 module 324produces module 324 producesa atensor tensor326. 326.The Thetensor tensor326 326isispassed passedtoto aa Res4 Res4 module328 module 328and andoutput outputfrom from thethe backbone backbone module module 310oneas of 310 as onethe of layers. the layers. The The Res4Res4 module module is is a residual block (i.e., 340), which includes four residual units (i.e., 350). The Res4 module 328 a residual block (i.e., 340), which includes four residual units (i.e., 350). The Res4 module 328

producesaa tensor produces tensor 329. 329. The Thetensor tensor 329 329isis output output from fromthe the backbone backbonemodule module310310 as as oneone of of thethe

layers. Collectively, the layer tensors 322, 326, and 329 are output as the tensors 115 and may layers. Collectively, the layer tensors 322, 326, and 329 are output as the tensors 115 and may

be referred be referred to to as aslayers layers0-2 0-2oror L0, L0,L1, L1,and andL2, L2,respectively. The respectively. Thebackbone backbone CNN 310maymay CNN 310 take take

as input as input aa video video frame frame of of resolution resolution 1088×608 andproduce 1088x608 and produce threetensors, three tensors,corresponding correspondingtoto three layers, with the following dimensions: [1, 256, 76, 136], [1, 512, 38, 68], [1, 1024, 19, three layers, with the following dimensions: [1, 256, 76, 136], [1, 512, 38, 68], [1, 1024, 19,

34]. Another 34]. exampleofofthe Another example thethree three tensors tensors 115 115corresponding correspondingtotothree threelayers layers may maybebe[1,

[1, 512, 512, 34, 34, 19], [1, 256, 19], [1, 68,38], 256, 68, 38],[1,

[1,128, 128,136, 136, 76]76] which which are respectively are respectively separated separated at layeratindex layer75,index 90, 75, 90,

and 105 and 105when whenthe thelayers layersare areenumerated enumerated according according to to theYOLOv3 the YOLOv3 software software implementation implementation of of the backbone the 300and backbone 300 anda ahead head1200. 1200.

[00098] Eachofofthe

[00098] Each theRes11 Res11320, 320,Res8 Res8 324 324 andand Res4 Res4 328 328 operates operates in ainsimilar a similar manner manner to to

ResBlock340. ResBlock 340.Each Eachofof theCBL the CBL 314, 314, thethe CBLCBL 344 the 344 and andCBL the 354 CBLoperate 354 operate in a similar in a similar

mannertotothe manner the CBL CBL 360. 360.

[00099] Fig. 44 is

[00099] Fig. is aaschematic schematic block block diagram showingfunctional diagram showing functionalmodules modules of of an an alternative alternative

backboneportion backbone portion400 400ofofa aCNN, CNN, which which may may serve serve asimplementation as an an implementation ofNNthepart of the NN1part 114 1 114 whenthe when thesystem system100 100isisconfigured configuredtotoperform performa a"FasterRCNN" “FasterRCNN” or “MaskRCNN” or "MaskRCNN" ResNet ResNet 101 101 network. Frame network. Frame data data 113 113 is is inputand input andpasses passesthrough through a stem a stem network network 408, 408, a res2 a res2 module module 412,412, a a res3 module res3 416,aares4 module 416, res4 module module420, 420,and and a a res5module res5 module424424 viavia tensors tensors 409, 409, 413, 413, 417, 417, 421, 421, 425425

respectively. The respectively. backboneportion The backbone portion400 400 may may be be used used as part as part of of a a generalobject general objectdetector detectorororfor for instance segmentation, with various classes of object supported. instance segmentation, with various classes of object supported.

[000100]The

[000100] Thestem stemnetwork network 408408 includes includes a 7x7 a 7x7 convolution convolution withwith a stride a stride of of twotwo (2)(2) andand a max a max

pooling operation. pooling operation. The Theres2 res2module module 412, 412, theres3 the res3module module 416, 416, thethe res4 res4 module module 420 420 and and the the

43452381_1 43452381_1

26

res5 module res5 424perform module 424 perform convolution convolution operations, operations, such such as as LeakyReLU LeakyReLU activations. activations. Each Each 17 Jan 2024

module412, module 412,416, 416,420 420and and424424 alsoperforms also performs oneone halving halving of of thethe width width andand height height of of thethe

processed tensors via a stride setting of two. Each of the tensors 413, 417, 421 and 425 are processed tensors via a stride setting of two. Each of the tensors 413, 417, 421 and 425 are

passed to passed to one of 1x1 one of lateral convolution 1x1 lateral convolution modules 446,444, modules 446, 444,442 442and and440 440respectively. respectively.The The modules446, modules 446,444, 444,442, 442,and and440 440produce produce tensors tensors 447, 447, 445, 445, 443443 andand 441441 respectively. respectively. The The

tensor 441 tensor is passed 441 is passed to to aa 3x3 3x3 output output convolution module470, convolution module 470,which which produces produces an an output output

tensor P5 tensor 471. P5 471. 43452381_1

2024200305

[000101]The

[000101] Thetensor tensor441 441isisalso also passed passedto to upsampler upsamplermodule module450450 to to produce produce an upsampled an upsampled

tensor 451. tensor 451. AAsummation summation module module 460 460 sums sums the tensors the tensors 443451 443 and andto451 to produce produce a tensor a tensor 461. 461. Thetensor The tensor 461 461is is passed to an passed to an upsampler module upsampler module 452 452 andand a 3x3 a 3x3 lateralconvolution lateral convolution module472. module 472.TheThe module module 472 472 outputs outputs a P4a tensor P4 tensor 473.473. The upsampler The upsampler modulemodule 452 produces 452 produces an an upsampledtensor upsampled tensor453. 453.A A summation summation module module 462tensors 462 sums sums tensors 445 445 and 453and to 453 to produce produce a a tensor 463. tensor 463. The tensor 463 The tensor 463 is is passed to aa 3x3 passed to 3x3 lateral lateralconvolution convolution module 474and module 474 andananupsampler upsampler module454. module 454.TheThe module module 474 474 outputs outputs a P3a tensor P3 tensor 475.475. The upsampler The upsampler modulemodule 454 outputs 454 outputs an an upsampledtensor upsampled tensor455. 455.A A summation summation module module 464the 464 sums sums the tensors tensors 447 447 and 455and to 455 to produce produce

tensor 465, tensor 465, which is passed which is to aa 3x3 passed to 3x3 lateral lateralconvolution convolution module 476. The module 476. Themodule module 476476 outputs outputs a a P2 tensor P2 tensor 477. 477. The Theupsampler upsampler modules modules 450,450, 452,452, and and 454 454 use use nearest nearest neighbour neighbour interpolation interpolation

for low for low computational complexity.The computational complexity. The tensors471, tensors 471,473, 473,475, 475,and and477477 form form thethe output output

tensor 115 tensor of the 115 of the CNN backbone CNN backbone 400400 (NN(NN part part 1 114). 1 114). Although Although Fig. Fig. 4 4 shows shows a particular a particular

backboneportion backbone portionofofthe the Faster Faster RCNN RCNN network network architecture architecture (a ‘P-layer (a 'P-layer splitpoint), split point), different different divisions into backbone and head are possible. Splitting the network at tensor 409 is termed a divisions into backbone and head are possible. Splitting the network at tensor 409 is termed a

‘stem’ split point. 'stem' split point.Splitting Splittingthe thenetwork network at tensors at tensors 447, 447, 445,and 445, 443, 443, 441and 441 isatermed is termed a ‘C-layer’ 'C-layer'

split point. split point.

[000102]Fig.

[000102] Fig. 55 is is aa schematic schematic block diagram500 block diagram 500ofofthe thetensor tensor encoder encoder116 116using usinga a configurable tensor configurable tensor compressor compressorstage. stage.Fig. Fig. 55 also also includes includes aa metadata encoder544, metadata encoder 544,which which provides an provides an implementation implementationofofthe thebox boxencapsulator encapsulator154. 154.InInother otherimplementations, implementations, thebox the box encapsulator 154 encapsulator 154may maybebeexternal externaltotothe the tensor tensor encoder encoder116. 116.Fig. Fig.1818shows shows a method a method 1800 1800 for for

performingananNNNN performing part1,1,encoding part encodinganan indicationofofthe indication thecodec codecused usedfor forthe thetensor tensorcompression compression stage, i.e., stage, i.e.,116, and 116, encoding and encodingresulting resultingcompressed compressed tensors tensors along along with with compressed videoand compressed video andanan indication of the codec used for the compressed video. The tensors are related to the video as indication of the codec used for the compressed video. The tensors are related to the video as

outlined in relation to operation of the video encoder 150 and the tensor encoder 116. The outlined in relation to operation of the video encoder 150 and the tensor encoder 116. The

method1800 method 1800maymay be be implemented implemented as or as one onemore or more software software application application programs programs 233 233 executable within executable within the the computer computersystem system200. 200.TheThe method method 18001800 may may be be effected effected by by instructions 231 (see Fig. 2B) in the software 233 that are carried out within the computer instructions 231 (see Fig. 2B) in the software 233 that are carried out within the computer

43452381_1 43452381_1

27

system200. system 200.The Thesoftware software instructions231 instructions 231may may be be formed formed as one as one or more or more codecode modules, modules, each each 17 Jan 2024

for performing for oneor performing one or more moreparticular particular tasks. tasks. The Themethod method1800 1800 commences commences at anatencode an encode videovideo

frame step frame step 1810. 1810.

[000103] Atthe

[000103] At thestep step 1810 1810the thevideo videoencoder encoder150, 150,under underexecution execution ofof theprocessor the processor205, 205, encodesthe encodes the frame framedata data113 113totoproduce producethe thevideo videobitstream bitstream151. 151.Example Example operation operation of of an an

encoder is described in relation to Fig. 8. Control in the processor 205 progresses from the encoder is described in relation to Fig. 8. Control in the processor 205 progresses from the 43452381_1

step 1810 to a perform neural network first portion step 1820. 2024200305

step 1810 to a perform neural network first portion step 1820.

[000104] Atthe

[000104] At thestep step 1820 1820the the NN NNpart part1 1114, 114,under underexecution executionofofthe theprocessor processor205, 205,performs performs the first portion of a neural network using frame data 113 from the video source 112 as input, the first portion of a neural network using frame data 113 from the video source 112 as input,

producingthe producing the tensors tensors 115 115as as output. output. The tensors 115 The tensors 115are are generated generatedby byperforming performingatatleast least one one convolutionon convolution onthe the video videodata data 113. 113. Examples Examplesofofoperation operationofofthe theNNNN part1 1114 part 114 aredescribed are described inin

relation to Figs. 3 and 4. Control in the processor 205 progresses from the step 1820 to a relation to Figs. 3 and 4. Control in the processor 205 progresses from the step 1820 to a

performtensor perform tensor compression compressionstep step1830. 1830.

[000105]At At

[000105] thethe step1830, step 1830,thethetensor tensorencoder encoder116116 operates operates to to compress compress thethe tensors tensors 115. 115. Fig.5 Fig. 5 showsananexample shows example implementation implementation 500 500 of the of the tensor tensor encoder encoder 116.116. In the In the example example of Fig. of Fig. 5, 5, at at the step the step 1830, 1830, aa compressor 530,under compressor 530, underexecution executionofofthe theprocessor processor205, 205,compresses compresses the the

tensors 115 tensors to produce 115 to compressedtensors produce compressed tensors532. 532.TheThe compressed compressed tensors tensors 532 532 are the are the samesame or or fewer in number than the tensors 115 and reduced in dimensionality (i.e., reduced in either or fewer in number than the tensors 115 and reduced in dimensionality (i.e., reduced in either or

both of both of channel count and channel count andfeature feature map mapwidth widthand andheight). height).The The tensor tensor compressor compressor may may perform perform

dimensionality reduction using a set of trained network layers, as described with reference to dimensionality reduction using a set of trained network layers, as described with reference to

Fig. 66 or Fig. or using using methods basedon methods based onlow-rank low-rankapproximations approximationsof of thethe tensors115, tensors 115,asasdescribed describedwith with reference to Fig. 7. Further example operation of the tensor encoder 116 is described in relation reference to Fig. 7. Further example operation of the tensor encoder 116 is described in relation

to Fig. 5. Control in the processor 205 progresses from the step 1830 to a write file header to Fig. 5. Control in the processor 205 progresses from the step 1830 to a write file header

step 1840. step 1840.

[000106]AtAtthe

[000106] thestep step 1840, 1840,the the metadata metadataencoder encoder544 544encodes encodes a movie a movie header header

MovieHeaderBox MovieHeaderBox 11214 11214 into into a presentation a presentation 11200, 11200, as shown as shown in Fig. in Fig. 11B. 11B. The The movie movie header 11214 header 11214includes includescreation creationand andmodification modificationtimes timesfor forthe thepresentation presentationand andthe the next next track track ID for the track to be added to the presentation, typically a video track associated with a feature ID for the track to be added to the presentation, typically a video track associated with a feature

track for track for the thesystem system 100. Control in 100. Control in the the processor processor 205 progresses from 205 progresses fromthe thestep step 1840 1840toto an an encodevideo encode videoframe framecodec codecindication indicationstep step1850. 1850.

[000107] Atthe

[000107] At thestep step 1850, 1850, the the metadata metadataencoder encoder544 544encodes encodes a trackTrackBox a track TrackBox 11218 11218 and all and all

child boxes child as described boxes as with reference described with reference to to Fig. Fig. 11B. 11B. The TrackBox11218 The TrackBox 11218 andand child child boxes boxes areare

43452381_1 43452381_1

28

encodedsuch encoded suchthat thatthe the video video codec codecused usedtotoencode encodethe thevideo videolayer layerbitstream bitstream151 151isis indicated indicated by by 17 Jan 2024

virtue of virtue of the thetype typeof ofthe thechild boxes child boxeswithin withina a sample sampletable tableSampleTableBox 11230,i.e., SampleTableBox 11230, i.e., boxes 11234a boxes 11234aand and11234b 11234b which which indicate indicate use use of the of the VVCVVC codeccodec in Fig. in Fig. 11B.11B. Other Other block-block-

based codecs, based codecs, such suchas as AVC AVC or or HEVC, HEVC, may may be be indicated indicated for in for use usestoring in storing samples samples of the of the video video

data. The step 1850 can be considered to encode first information into the bitstream, the first data. The step 1850 can be considered to encode first information into the bitstream, the first

information being used to determine a first codec for the video. Control in the processor 205 information being used to determine a first codec for the video. Control in the processor 205

progresses from progresses fromthe the step step 1850 1850to to an an encode encodetrack trackgrouping groupingfor forvideo videotrack trackstep step 1855. 1855. 43452381_1

2024200305

[000108] Atthe

[000108] At thestep step 1855 1855the themetadata metadataencoder encoder544544 encodes encodes a VideoFeatureGroupBox a VideoFeatureGroupBox 11225, 11225,

which is which is an aninstance of aofVideoFeatureGroupBox instance (track_group_type a VideoFeatureGroupBox set to (track_group_type ‘vftg’), which 'vftg'), which

extends TrackGroupTypeBox extends TrackGroupTypeBox and indicates and indicates that that the the containing containing track track is is oneone of of twotwo tracks tracks

associated with each other as a video track and a feature track. A flag, such as a associated with each other as a video track and a feature track. A flag, such as a

‘FeatureStreamFlag’ 'FeatureStreamFlag' ininthe the VideoFeatureGroupBox VideoFeatureGroupBox indicate indicate thatthat the the track track containing containing this this

instance of instance of the the VideoFeatureGroupBox is video VideoFeatureGroupBox is a a video track(flag track (flagvalue valueofofzero). zero). Control Controlininthe the processor 205 processor 205progresses progressesfrom fromthe thestep step1855 1855totoaa write write encoded encodedvideo videoframe frame step1860. step 1860.

[000109] Atthe

[000109] At thestep step 1860, 1860, the the box box encapsulator encapsulator154 154inserts inserts the the video bitstream 151 video bitstream 151 into into aa mediabox media boxMediaBox MediaBox 11260 11260 in the in the presentation presentation 11200 11200 of Fig. of Fig. 11B.11B. NAL of NAL units units theof the video video

bitstream 151 are separately stored such that each one is addressable by sample entries in a bitstream 151 are separately stored such that each one is addressable by sample entries in a

sampledescription sample description box, box, such suchas as SampleDescriptionBox SampleDescriptionBox 11230. 11230. Control Control in processor in the the processor 205 205 progresses from progresses fromthe the step step 1860 1860to to an an encode encodefeature feature frame framecodec codecindication indicationstep step1870. 1870.

[000110] Atthe

[000110] At thestep step 1870, 1870, the the metadata metadataencoder encoder544 544encodes encodes a trackTrackBox a track TrackBox 11238 11238 and all and all

child boxes as described with reference to Fig. 11B such that the feature codec used for the child boxes as described with reference to Fig. 11B such that the feature codec used for the

feature bitstream 121 is indicated by virtue of the type of the child boxes within a sample table feature bitstream 121 is indicated by virtue of the type of the child boxes within a sample table

SampleTableBox 11240. SampleTableBox 11240. Boxes Boxes 11242a 11242a and 11242b and 11242b indicateindicate a codec, a codec, such assuch VVC as or VVC a or a specific end-to-end specific end-to-end learned learned codec, codec, used for feature used for feature compression. Thestep compression. The step1870 1870can canbebe considered to considered to encode encodesecond secondinformation informationinto intothe thebitstream, bitstream, the the second secondinformation informationbeing beingused used to determine to determine aa second codecfor second codec for the the tensors, tensors, the the second second codec capable of codec capable of decoding decodingtensors tensors generated by a neural network. The first information is encoded into a first box in the bitstream generated by a neural network. The first information is encoded into a first box in the bitstream

at step 1850 and the second information is encoded into a second box different from the first at step 1850 and the second information is encoded into a second box different from the first

box in the bitstream at step 1870. The first information defines the first codec used to decode box in the bitstream at step 1870. The first information defines the first codec used to decode

the video the video from the bitstream from the bitstream and and the the second informationdefines second information definesthe the second secondcodec codecused usedtoto decodethe decode the tensor tensor from fromthe the bitstream. bitstream.

43452381_1 43452381_1

29

[000111]

[000111] As As the the feature feature track track is associated, is associated, or references, or references, the track the video videoa track a 17 Jan 2024

TrackReferenceBox TrackReferenceBox 11223a 11223a (type (type 'tref')isisincluded 'tref') includedininthe the TrackBox TrackBox11238. 11238. A A TrackReferenceTypeBox TrackReferenceTypeBox indicates indicates the the typetype of reference of reference viavia a ‘reference_type’ a 'reference_type' uint32 uint32 (unsigned (unsigned

integer 32), integer 32), which which may beananauxiliary may be auxiliary 'auxl' ‘auxl’ value value or or another another custom definedreference_type custom defined reference_type indicative of the nature of the track as a feature track, such as ‘axft’, shown as 11223b. Within indicative of the nature of the track as a feature track, such as 'axft', shown as 11223b. Within

the TrackReferenceTypeBox the 11223b TrackReferenceTypeBox 11223b the track_id the track_id of the of the video video track track 11218 11218 is included. is included. Due Due to to the use of independent codecs for the video track and the feature track, and the nature of the the use of independent codecs for the video track and the feature track, and the nature of the 43452381_1

2024200305

feature data, there is no coding dependency of the feature track on the video track, feature data, there is no coding dependency of the feature track on the video track,

notwithstandingthe notwithstanding the presence presenceofofthe the TrackReferenceBox TrackReferenceBox 11223a 11223a in the in the feature feature stream stream

TrackBox 11238. TrackBox 11238.

[000112] Controlininthe

[000112] Control the processor processor 205 205progresses progressesfrom fromthe thestep step1870 1870totoananencode encodetrack track grouping for feature track step 1875. grouping for feature track step 1875.

[000113] Atthe

[000113] At thestep step 1875, 1875, the the metadata metadataencoder encoder544 544encodes encodes a VideoFeatureGroupBox a VideoFeatureGroupBox 11241,11241,

whichisis an which an instance instance of of aa VideoFeatureGroupBox (track_group_type VideoFeatureGroupBox (track_group_type set to ‘vftg’), set 'vftg'), whichwhich

extends TrackGroupTypeBox extends TrackGroupTypeBox and indicates and indicates that that the the containing containing track track is one is one of of twotwo tracks tracks

‘FeatureStreamFlag’ 'FeatureStreamFlag' ininthe the VideoFeatureGroupBox VideoFeatureGroupBox indicates indicates thatthat thisthis track track containing containing that that

instance of instance of the the VideoFeatureGroupBox is feature VideoFeatureGroupBox is a a featuretrack track(flag (flagvalue valueof of one). one). Control Controlininthe the processor 205 processor 205progresses progressesfrom fromthe thestep step1875 1875totoaa write write encoded encodedfeature featureframe framestep step1880. 1880.

[000114]

[000114] At At thethe step step 1880, 1880, theencapsulator the box box encapsulator 154the 154 inserts inserts thebitstream feature feature bitstream 121 into the 121 into the

MediaBox MediaBox 11260 11260 in the in the presentation presentation 11200. 11200. Compressed Compressed data corresponding data corresponding to tensors to tensors for a for a given frame, given frame, which whichmay maybebe inin theform the formofofNAL NAL units, units, areare separatelystored separately storedsuch suchthat thateach eachone one is is

addressable by addressable by sample sampleentries entries in in aa SampleDescriptionBox 11240. SampleDescriptionBox 11240.

[000115] Thebox

[000115] The boxencapsulator encapsulator 154 154 uses uses thethe resultofofoperations result operationsofofsteps steps 1840 1840toto 1880 1880totoform form the box-encapsulated the bitstream155 box-encapsulated bitstream 155including includingboth boththe thefeature feature bitstream bitstream 121 121and andthe thevideo video bitstream 151. bitstream 151.

[000116]The

[000116] Thestep step1860 1860operates operatestotoencode encodethethevideo videointo intothe thebitstream bitstream155 155according accordingtotothe the first codec encoded at step 1850. The step 1880 operates to encode the tensors 121 into the first codec encoded at step 1850. The step 1880 operates to encode the tensors 121 into the

bitstream 155 bitstream accordingtoto the 155 according the codec codec encoded encodedatatstep step1870. 1870.The Thefirst first information encodedatat information encoded

step 1850 step and the 1850 and the second secondinformation informationencoded encodedat at 1870 1870 areare independent independent of of each each other other andand thethe

video and the tensors are associated with each other at step 1855 and 1875 respectively. video and the tensors are associated with each other at step 1855 and 1875 respectively.

43452381_1 43452381_1

30

[000117]The

[000117] Themethod method 1800 1800 terminates terminates and and processing processing in the in the processor processor 205 205 progresses progresses to the to the 17 Jan 2024

next instance of the source data 113 (e.g., the next frame from the video source 112). next instance of the source data 113 (e.g., the next frame from the video source 112).

[000118]Fig.

[000118] Fig. 66 is is aa schematic schematic block diagram600 block diagram 600showing showingoneone type type of of multi-scale multi-scale feature feature

fusion (MSFF) fusion module (MSFF) module , which , which maymay serve serve as the as the tensor tensor compressor compressor 530.530. The The MSFF MSFF module600 module 600takes takesthe thetensors tensors115 115and andproduces producesa a compressed compressed tensor tensor 532, 532, having having reduced reduced

dimensionality compared dimensionality comparedtoto thetensors the tensors115 115and andthus thusresulting resultingin in aa reduction reduction in in bitrate bitratewhen when 43452381_1

encodedasaspart part of of aa packed frame. The TheMSFF MSFF module 600 trained uses trained network layerslayers and 2024200305

encoded packed frame. module 600 uses network and

requires aa corresponding requires moduleininthe corresponding module thetensor tensordecoder decoder146 146totorestore restore tensor tensor dimensionality dimensionalitySO so the tensors the tensors 149 149 may besupplied may be suppliedtoto the the CNN CNN head head 150. 150. TheThe MSFCMSFC modulemodule 600four 600 takes takes four tensors as input and requires each one the tensors to have two-hundred and fixty-six (256) tensors as input and requires each one the tensors to have two-hundred and fixty-six (256)

channels, SO channels, so that that the theMSFF module MSFF module 600 600 is is compatible compatible with with thethe P-layers P-layers of of theFasterRCNN the FasterRCNNor or MaskRCNN MaskRCNN networks. networks. However, However, variants variants of the of themodule MSFF MSFF600 module 600 compatible compatible with with different different numbers of layers and different channel counts are possible. numbers of layers and different channel counts are possible.

[000119]The

[000119] TheMSFF MSFF module module 600 produces 600 produces one tensor one tensor 532 as532 as output output with sixty-four with sixty-four (64) (64) channels and channels andaa feature feature map size corresponding map size correspondingtotothe theP5 P5layer layer seen seen at at the the input, input,however however

variants with variants with different differentchannel channel count count are are also alsopossible. possible.Each Each variant variantof ofthe theMSFF module600 MSFF module 600 requires different weights to be used for proper operation. requires different weights to be used for proper operation.

[000120]The

[000120] TheMSFC MSFC module module 600 includes 600 includes an block an MSFF MSFF610 block 610inshown shown in which Fig. 6, Fig. 6,produces which produces a single tensor 629 from the plurality of tensors 115 using one or more downsampling filters. a single tensor 629 from the plurality of tensors 115 using one or more downsampling filters.

TheMSFF The MSFF block block 610, 610, under under execution execution of the of the processor processor 205, 205, combines combines each each tensor tensor of a of a first first setset

of tensors of tensors (i.e., (i.e.,602, 603, 602, 604, 603, 605), 604, to to 605), produce thethe produce combined combinedtensor tensor629. 629.The The combined tensor combined tensor

629 forms 629 formsaarepresentation representation of of the the FPN layertensors. FPN layer tensors. Downsample Downsample modules modules 622a,622a, 622b,622b, and and 622c operate on the tensors having larger spatial scale, i.e., P4 604 at (2h, 2w, 256), and P3 603 622c operate on the tensors having larger spatial scale, i.e., P4 604 at (2h, 2w, 256), and P3 603

at (4h, at (4h, 4w, 4w, 256), 256), and and P2 P2 602 at (8h, 602 at (8h, 8w, 8w, 256), 256), respectively. respectively. Modules 622a,622b, Modules 622a, 622b,and and622c 622c perform downsampling to match the spatial scale of the smallest tensor, i.e., P5 605 at (h, w, perform downsampling to match the spatial scale of the smallest tensor, i.e., P5 605 at (h, W,

256), producing 256), downscaled producing downscaled P5 P5 tensors tensors 623a, 623a, 623b, 623b, 623c, 623c, respectively.A concatenation respectively. A concatenation module624 module 624performs performs a channel-wise a channel-wise concatenation concatenation of the of the tensors tensors 605, 605, 623a, 623a, 623b, 623b, andand 623c 623c to to produceconcatenated produce concatenatedtensor tensor625, 625,ofofdimensions dimensions (h,W,w,1024). (h, 1024).TheThe concatenated concatenated tensor tensor 625625 is is passed to passed to aa squeeze and excitation squeeze and excitation (SE) module626 (SE) module 626totoproduce producea a tensor627. tensor 627.TheThe SE SE module626 module 626sequentially sequentiallyperforms performsa a globalpooling, global pooling,a afully-connected fully-connectedlayer layerwith withreduction reductioninin channel count, a rectified linear unit activation unit, a second fully-connected layer restoring the channel count, a rectified linear unit activation unit, a second fully-connected layer restoring the

channel count, channel count, and and aa sigmoid sigmoidactivation activation function function to to produce produce aa scaling scaling tensor. tensor. The The tensor tensor 625 625is is scaled according scaled to the according to the scaling scaling tensor tensor to toproduce produce the theoutput outputas asthe thetensor 627. tensor 627.The The SE SE

43452381_1 43452381_1

31

block 626 is capable of being trained to adaptively alter the weighting of different channels in block 626 is capable of being trained to adaptively alter the weighting of different channels in 17 Jan 2024

the tensor passed through, based on a first fully-connected layer output. the tensor passed through, based on a first fully-connected layer output.

[000121] Thefirst

[000121] The first fully-connected layer output fully-connected layer output reduces each feature reduces each feature map for each map for eachchannel channeltotoaa single value. Each single value is passed through a non-linear activation unit (ReLU) to create a single value. Each single value is passed through a non-linear activation unit (ReLU) to create a

conditional representation of the single value, suitable for weighting of other channels, with conditional representation of the single value, suitable for weighting of other channels, with

restoration to restoration tothe thefull fullchannel channelcount countperformed performed by by the the second second fully-connected layer. The fully-connected layer. SE The SE 43452381_1

block 626 is thus capable of extracting non-linear inter-channel correlation in producing the 2024200305

block 626 is thus capable of extracting non-linear inter-channel correlation in producing the

tensor 627 from the tensor 625, to a greater extent than is possible purely with convolutional tensor 627 from the tensor 625, to a greater extent than is possible purely with convolutional

(linear) layers. The tensor 627 is passed to a convolutional layer 628. The convolutional layer (linear) layers. The tensor 627 is passed to a convolutional layer 628. The convolutional layer

628 implements 628 implementsone one oror more more convolutional convolutional layers layers to to produce produce thethe combined combined tensor tensor 629,629, withwith

channel count reduced to F channels, typically 256 channels (i.e., F = 256). Further reduction channel count reduced to F channels, typically 256 channels (i.e., F = 256). Further reduction

in the in the channel channel count count is is achieved achieved by by a a single-scale single-scalefeature featurecompression compression (SSFC) module (SSFC) module 650. 650.

[000122]The

[000122] TheSSFC SSFC module module 650 650 receives receives the tensor the tensor 629 629 and and applies applies a convolution a convolution 652 652 to to reduce the reduce the channel channel count countfrom fromF F(256) (256)down downto to C'C’ (nominally (nominally setset toto 6464 channels) channels) toto produce produce

tensor 653. tensor Thetensor 653. The tensor653 653isis then then passed passedto to aa batch batch normalisation module654 normalisation module 654totoproduce produce batch normalised tensor 655, which is passed to a hyperbolic tangent activation layer 656 to batch normalised tensor 655, which is passed to a hyperbolic tangent activation layer 656 to

producethe produce the compressed compressedtensor tensor532. 532.TheThe output output of of thethe MSFC MSFC module module 600 is600 is atensor a one one tensor per per frame with frame with aa fixed fixed feature feature map size and map size and fixed fixed channel channel count. count.

[000123]Fig.

[000123] Fig. 77 is is aa schematic schematic block diagramshowing block diagram showingan an example example inter-channel inter-channel decorrelation- decorrelation-

based tensor based tensor compressor compressor700, 700,which which may may serve serve as as thethe tensor tensor compressor compressor 530. 530. The The tensor tensor

compressor 700 operates without need for any trained layers and relies upon extracting inter- compressor 700 operates without need for any trained layers and relies upon extracting inter-

channel redundancy channel redundancyatatruntime runtime(i.e., (i.e., using using the the incoming tensor data incoming tensor data from from the the CNN CNN backbone114 backbone 114rather ratherthan thanrelying relyingupon uponpretrained pretrainedweights). weights).The The tensor tensor compressor compressor 700 700 outputs outputs

three types three types of of data: data:mean, mean, basis basis vectors, vectors,and andcoefficients. coefficients.The Thetensor tensorcompressor compressor 700 700

compressesone compresses onetensor tensorand andmultiple multipleinstances instancesofofthe thetensor tensor compressor compressor700 700maymay be be instantiated instantiated

as the as the tensor tensor compressor 530to compressor 530 to compress compressmultiple multipletensors. tensors.When When multiple multiple instances instances of of thethe

tensor compressor tensor 700are compressor 700 areinstantiated, instantiated, tensors tensors of of the thesame same type type from each instance from each instance may maybebe packedinto packed into the the same region. For same region. Forexample, example,basis basisvectors vectorsfrom fromeach each instancemaymay instance be be packed packed intointo

a single region. These types of data are held in tensors and packed into three separate regions a single region. These types of data are held in tensors and packed into three separate regions

within aa frame. within Thenumber frame. The numberof of basisvectors basis vectorstotobebeused usedmay maybe be varied varied during during thethe course course ofof

processing frames processing framesand andthe thenumber numberofof coefficientsused coefficients usedmay may alsobebevaried, also varied,although althoughthe thenumber number of coefficients of coefficientsmust must not not exceed exceed the the number ofbasis number of basis vectors. vectors. As Assuch, such, dimensionality dimensionalityofofthe the basis vectors and/or coefficients may be updated from time to time to signal to the decoder the basis vectors and/or coefficients may be updated from time to time to signal to the decoder the

43452381_1 43452381_1

32

contents of contents of aa decoded packedframe. decoded packed frame.AsAs shown shown in Fig. in Fig. 7, 7, anan extractchannel extract channelmean mean module module 710,710, 17 Jan 2024

under execution under executionof of the the processor processor 205, 205, performs performsananaverage averageoperation operationonona atensor tensorofofthe the tensors 115 tensors across the 115 across the spatial spatialdimensions dimensions (i.e., (i.e., width widthand andheight) height)toto produce producea aper-channel per-channelmean mean

712 as 712 as aa 1D (one-dimensional)tensor 1D (one-dimensional) tensorofofCC(channel (channelcount countofofa atensor tensorofof the the tensors tensors 115). The 115). The

extract channel extract channel mean module mean module 710 710 maymay operate operate lessless frequently frequently than than on on each each received received

tensors 115, tensors 115, in in which case the which case the channel mean712 channel mean 712isisupdated updatedonly onlyononspecific specificframes. frames.The The packedframes packed framescontaining containingananupdated updated channel channel mean mean 712 712 may may be indicated be indicated via avia a nonzero nonzero channel channel 43452381_1

2024200305

count for the tensor 712 via tensor information 1195 or via a separate tensor update flag count for the tensor 712 via tensor information 1195 or via a separate tensor update flag

included in included in the the SEI SEI message 1113(see message 1113 (seeFig. Fig.11A). 11A).

[000124]For

[000124] Foreach eachchannel channelinina atensor tensorof of the the tensors tensors 115, 115, aa DC shift is DC shift isperformed performed by a by a

subtraction module subtraction 714that module 714 thatsubtracts subtracts aa value, value, constant constant for forthe thefeature featuremap map and and obtained obtained from from

the mean the channel712, mean channel 712,from fromeach eachspatial spatiallocation locationin in the the feature feature map. Thesubtracted map. The subtractedvalue valueisis aa respective value respective value in in the the mean reconstructed channel mean reconstructed channel712. 712.AsAsa aresult resultof of the the subtraction subtraction module714, module 714,a azero-centred zero-centredtensor tensor716 716isis output output with with DC DCcomponent component found found within within eacheach feature feature

mapremoved map removed from from thethe respective respective featuremap. feature map.

[000125]AAdecomposition

[000125] decomposition module module 718,718, under under execution execution of processor of the the processor 205,205, operates operates to to produceaa set produce set of of basis basis vectors vectors720 720 for foraatensor tensorofof thethetensors 115. tensors The 115. Thedecomposition decomposition

module 718 receives the zero-centred tensor 716 as an input and generates a set of basis module 718 receives the zero-centred tensor 716 as an input and generates a set of basis

vectors 720 vectors by performing 720 by performinga aprincipal principal component component analysis analysis (PCA) (PCA) method, method, suchsuch as singular as singular

value decomposition value decomposition(SVD) (SVD)or or thethe like.One like. One basis basis vector vector maps maps allall channels channels onto onto a singlevalue a single value using a dot product operation, so with two-hundred and fifty-six (256) channels in a tensor of using a dot product operation, SO with two-hundred and fifty-six (256) channels in a tensor of

the tensors the tensors 115, 115, one one basis basis vector vector has has dimensions 256×1.If Ifthe dimensions 2561. thedecomposition decomposition module module 718 718 produces the first N basis vectors, such as twenty-five (25), the result basis vectors have produces the first N basis vectors, such as twenty-five (25), the result basis vectors have

dimensions256xN dimensions 256×Nor or 256×25. 256x25. Basis Basis vectors vectors are are relative relative to to theorigin the originpoint pointand andSOsothe thezero- zero- meantensor mean tensor712 712isis used usedto to ensure ensure an an orthonormal orthonormalbasis basiscan canbebefound. found.Each Each basis basis vector vector isisa a

vector relating all the channels to a reduced set of channels. As such, the basis vectors vector relating all the channels to a reduced set of channels. As such, the basis vectors

collectively enable representing tensor data spanning all channels into a smaller set of basis collectively enable representing tensor data spanning all channels into a smaller set of basis

vectors. Each basis vector is derived with all samples in each feature map for a given channel vectors. Each basis vector is derived with all samples in each feature map for a given channel

being considered. being considered. The Thevectors vectors720 720contain containfewer fewer basisvectors basis vectorsthan thanthere thereare arechannels channelsininaa tensor of the tensors 115, corresponding to a reduction in the dimensionality of a tensor of the tensor of the tensors 115, corresponding to a reduction in the dimensionality of a tensor of the

tensors 115. The basis vectors 720 represent a tensor of the tensors 115 in a subspace that tensors 115. The basis vectors 720 represent a tensor of the tensors 115 in a subspace that

accounts for, accounts for, or or ‘explains’, 'explains',the themaximum amount maximum amount of of varianceinina atensor variance tensorofofthe the tensors tensors 115 115 for for the number the ofcomponents number of componentsin in thebasis the basisvectors vectors720. 720.Basis Basisvectors vectorsare areordered orderedfrom from thevector the vector with the greatest explained variance down to the vector with the least explained variance. In with the greatest explained variance down to the vector with the least explained variance. In

43452381_1 43452381_1

33

other words, the basis vectors 720 enable representation of a tensor of the tensors 115 with other words, the basis vectors 720 enable representation of a tensor of the tensors 115 with 17 Jan 2024

minimaldegradation minimal degradationininquality qualityfor for aa given numberofofcomponents, given number components, with with thethe components components being being

the first N ranked basis vectors. the first N ranked basis vectors.

[000126] Asshown

[000126] As shownin in Fig.7,7,aadot Fig. dotproduct productmodule module 722, 722, under under execution execution of of thethe processor processor 205, 205,

performs a dot product of each channel in the tensor 716 against each basis vector 720 to performs a dot product of each channel in the tensor 716 against each basis vector 720 to

produceaa coefficients produce coefficients tensor tensor 724. Thecoefficients 724. The coefficients form form aa tensor tensor 724 havingthe 724 having the same samewidth width 43452381_1

and height as a tensor of the tensors 115 but a channel count corresponding, at most, to the 2024200305

and height as a tensor of the tensors 115 but a channel count corresponding, at most, to the

numberofofcomponents number components(or(or basis basis vectors)produced vectors) produced by by thethe decomposition decomposition module module 718, 718, whichwhich is is fewer than fewer than the the number number ofofchannels channelsininaatensor tensor of of the the tensors tensors 115. Thetensor 115. The tensor compressor compressor700 700 mayvary may varythe thenumber numberofof channels channels in in thecoefficients the coefficientstensor tensor 716 716for for aa given given frame. frame. The Thenumber number of channels present in in the coefficients tensor 716 for a given frame is included in the tensor of channels present in in the coefficients tensor 716 for a given frame is included in the tensor

information 1195, information 1195,coded codedininananinstance instanceof of the the SEI SEI message message1113 1113 associated associated with with thegiven the given frame. Each value in the coefficients tensor 724 represents the contribution of each basis vector frame. Each value in the coefficients tensor 724 represents the contribution of each basis vector

in reproducing in each value reproducing each valuein in each each feature feature map. Thetensors map. The tensors724, 724,720 720and and 712 712 provide provide thethe

compressedtensor compressed tensor532. 532.

[000127]Returning

[000127] ReturningtotoFig. Fig.5, 5, aa quantiser quantiser module 534ofofthe module 534 theencoder encoder116, 116,under underexecution executionofof the processor 205, quantises floating-point values in each tensor of the compressed tensors 532 the processor 205, quantises floating-point values in each tensor of the compressed tensors 532

to produce to quantised compressed produce quantised compressed tensors536. tensors 536.TheThe quantised quantised compressed compressed tensors tensors 536 536 have have integer values integer values and and occupy occupy aa range rangewithin withinthe the sample samplerange rangeasasdefined definedbybythe theoperational operationalbit bit depth of depth of aa video video encoder 542. For encoder 542. Forexample, example,when when encoding encoding video video using using 10-bit 10-bit samples, samples, integer integer

values in values in the the interval interval[0, 1023]

[0, 1023]are permitted. are permitted.For Foreach eachtensor, tensor,thetheminimum and maximum minimum and maximum floating-point values floating-point values form form a a quantisation quantisation range range 526 provided to 526 provided to the the metadata encoder544. metadata encoder 544.

[000128]

[000128] AApacker packermodule module 538, 538, under under execution execution of the of the processor processor 205, 205, packs packs feature feature maps maps of of

each tensor each tensor of of the the compressed quantisedtensors compressed quantised tensors536 536into intothe the video video frame frame540. 540.Feature Featuremaps maps of of a tensor are generally packed in left-to-right then top-to-bottom manner into a region in the a tensor are generally packed in left-to-right then top-to-bottom manner into a region in the

feature frame feature 540 to frame 540 to which the tensor which the tensor is is assigned. assigned. A video encoder A video encoder542, 542,under underexecution executionofofthe the processor 205, processor 205, compresses compressesthe thefeature featureframe frame540 540totoproduce producethethefeature featurebitstream bitstream121. 121.

[000129] Fig. 88 is

[000129] Fig. is aa schematic schematic block diagramshowing block diagram showing functionalmodules functional modules of of a video a video

encoder800. encoder 800.The Thevideo videoencoder encoder800800 provides provides an an example example implementation implementation of each of each of video of the the video encoders 542 encoders 542and and150. 150.TheThe video video encoder encoder 800800 may may be implemented be implemented using using a general-purpose a general-purpose

computersystem computer system200, 200,asasshown shownin in Figs.2A2A Figs. andand 2B,2B, where where the the various various functional functional modules modules may may be implemented be implementedbyby dedicated dedicated hardware hardware within within thethe computer computer system system 200, 200, by software by software

43452381_1 43452381_1

34

executable within executable within the the computer computersystem system200200 such such as as one one or or more more software software code code modules modules of of the the 17 Jan 2024

software application program 233 resident on the hard disk drive 205 and being controlled in its software application program 233 resident on the hard disk drive 205 and being controlled in its

execution by execution by the the processor processor 205. 205. Alternatively, Alternatively, the the video video encoder 800may encoder 800 maybebeimplemented implemented by aby a combinationofofdedicated combination dedicatedhardware hardware and and software software executable executable within within thethe computer computer system system 200. 200.

Thevideo The videoencoder encoder800 800and and thedescribed the describedmethods methods maymay alternatively alternatively be be implemented implemented in in dedicated hardware, dedicated hardware,such suchasasone oneorormore moreintegrated integratedcircuits circuits performing performingthe thefunctions functions or or sub sub functions of functions of the the described described methods. Suchdedicated methods. Such dedicatedhardware hardware maymay include include graphic graphic processing processing 43452381_1

2024200305

units (GPUs), units digital signal (GPUs), digital signal processors processors (DSPs), (DSPs), application-specific application-specific standard standard products products (ASSPs), (ASSPs),

application-specific integrated application-specific integratedcircuits circuits(ASICs), (ASICs),FPGAs or one FPGAs or one or or more moremicroprocessors microprocessors and and

associated memories. associated memories. InInparticular, particular, the the video encoder800 video encoder 800comprises comprisesmodules modules 810-890 810-890 which which

mayeach may eachbebeimplemented implementedas as oneone or or more more software software codecode modules modules ofsoftware of the the software application application

program233. program 233.

[000130]Although

[000130] Although thevideo the videoencoder encoder 542542 of of Fig. Fig. 8 isananexample 8 is exampleof of a a versatilevideo versatile videocoding coding (VVC)video (VVC) videoencoder, encoder, othervideo other videocodecs codecs maymay alsoalso be be used used to perform to perform the the processing processing stages stages

described herein described herein and and the the video video frame framecodec codecindication indicationencoded encodedatatstep step1850. 1850.For Forexample, example, HEVC HEVC maymay be used. be used. The The examples examples described described generate generate a bitstream a bitstream of encoded of encoded data. data. If If other other

codecs were codecs wereused, used,some someimplementations implementations maymay packpack data data into into a different a different format format such such asframe as a a frame format or format or the the like. like. The The video encoder800 video encoder 800receives receivesframe framedata data802, 802,each eachframe frameincluding includingone one oror

morecolour more colourchannels. channels.The The frame frame data data 802802 maymay be any be in in any chroma chroma format format anddepth and bit bit depth supported by the profile in use, for example 4:0:0, 4:2:0 for the “Main 10” profile of the VVC supported by the profile in use, for example 4:0:0, 4:2:0 for the "Main 10" profile of the VVC

standard, at eight (8) to ten (10) bits in sample precision. The frame data my relate to the video standard, at eight (8) to ten (10) bits in sample precision. The frame data my relate to the video

data 113 of the tensor 540. The frame data 802 is also characterised by a ‘level’, specifying data 113 of the tensor 540. The frame data 802 is also characterised by a 'level', specifying

aspects such aspects as the such as the maximum luma maximum luma sample sample rate, rate, frame frame aspect aspect ratio ratio constraints,and constraints, andslice sliceand andtile tile count limits. Worst-case compressed bitrates are also indicated by a combination of the ‘level’ count limits. Worst-case compressed bitrates are also indicated by a combination of the 'level'

and a ‘tier’ parameter. and a 'tier' parameter.

[000131]

[000131] As As seen seen in Fig. in Fig. 8, a 8, a block block partitioner partitioner 810 firstly 810 firstly divides divides thedata the frame frame 802 data 802 into CTUs, into CTUs,

generally square generally square in in shape shape and and configured configured such such that that a particular a particular size for size for the the CTUs CTUsTheis used. The is used.

maximum maximum enabled enabled sizesize of of thethe CTUs CTUs may may be 32×32, be 32x32, 64x64,64×64, or 128×128 or 128x128 luma samples luma samples for for example,configured example, configuredbybya a'sps_log2_ctu_size_minus5' ‘sps_log2_ctu_size_minus5’ syntax syntax element element present present in the in the ‘sequence 'sequence

parameterset'. parameter set’. The TheCTU CTU sizealso size alsoprovides providesa amaximum maximum CU size, CU size, as a as a CTU CTU with with no no further further

splitting will splitting willcontain containone oneCU. CU. The blockpartitioner The block partitioner 810 further divides 810 further divides each each CTU intoone CTU into oneoror moreCBs more CBsaccording according to to a a luma luma coding coding tree tree and and a chroma a chroma coding coding tree. tree. TheThe luma luma channel channel may may also be also be referred referred to toas asa aprimary primarycolour colourchannel. channel.Each Each chroma channelmay chroma channel mayalso alsobebereferred referredtotoas as

43452381_1 43452381_1

35

a secondary a colourchannel. secondary colour channel.The TheCBs CBs have have a varietyofofsizes, a variety sizes, and andmay mayinclude includeboth bothsquare squareandand 17 Jan 2024

non-squareaspect non-square aspectratios. ratios. However, However,ininthe theVVC VVC standard, standard, CBs, CBs, CUs, CUs, PUs,PUs, and always and TUs TUs always have side lengths that are powers of two. Thus, a current CB, represented as 812, is output have side lengths that are powers of two. Thus, a current CB, represented as 812, is output

from the block partitioner 810, progressing in accordance with an iteration over the one or more from the block partitioner 810, progressing in accordance with an iteration over the one or more

blocks of blocks of the the CTU, in accordance CTU, in accordancewith withthe theluma lumacoding coding treeand tree andthe thechroma chroma coding coding tree tree of of the the

CTU. CTU. 43452381_1

2024200305

[000132]The

[000132] TheCTUs CTUs resulting resulting from from thethe firstdivision first divisionofof the the frame framedata data 802 802may maybebescanned scanned in in

raster scan order and may be grouped into one or more ‘slices’. A slice may be an ‘intra’ (or raster scan order and may be grouped into one or more 'slices'. A slice may be an 'intra' (or

‘I’) 'I') slice. slice. An intra slice An intra slice (I (I slice) slice) indicates that every indicates that everyCUCU in the in the slice slice is intra is intra predicted. predicted.

Generally, the first picture in a coded layer video sequence (CLVS) contains only I slices, and Generally, the first picture in a coded layer video sequence (CLVS) contains only I slices, and

is referred to as an ‘intra picture’. The CLVS may contain periodic intra pictures, forming is referred to as an 'intra picture'. The CLVS may contain periodic intra pictures, forming

‘random accesspoints' 'random access points’(i.e., (i.e., intermediate intermediateframes frames in inaavideo videosequence sequence upon whichdecoding upon which decodingcan can commence). Alternatively, a slice may be uni- or bi-predicted (‘P’ or ‘B’ slice, respectively), commence). Alternatively, a slice may be uni- or bi-predicted ('P' or 'B' slice, respectively),

indicating additional availability of uni- and bi-prediction in the slice, respectively. indicating additional availability of uni- and bi-prediction in the slice, respectively.

[000133] Thevideo

[000133] The videoencoder encoder 800 800 encodes encodes sequences sequences of pictures of pictures according according to atopicture a picture structure. structure.

One picture structure is ‘low delay’, in which case pictures using inter-prediction may only One picture structure is 'low delay', in which case pictures using inter-prediction may only

reference pictures reference pictures occurring occurring previously previously in in the the sequence. Lowdelay sequence. Low delayenables enableseach eachpicture picturetotobebe output as soon as the picture is decoded, in addition to being stored for possible reference by a output as soon as the picture is decoded, in addition to being stored for possible reference by a

subsequentpicture. subsequent picture. Another Anotherpicture picturestructure structure is is ‘random access’, whereby 'random access', wherebythe thecoding codingorder orderofof pictures differs from the display order. Random access allows inter-predicted pictures to pictures differs from the display order. Random access allows inter-predicted pictures to

reference other pictures that, although decoded, have not yet been output. A degree of picture reference other pictures that, although decoded, have not yet been output. A degree of picture

buffering is needed so the reference pictures in the future in terms of display order are present buffering is needed SO the reference pictures in the future in terms of display order are present

in the decoded picture buffer, resulting in a latency of multiple frames. in the decoded picture buffer, resulting in a latency of multiple frames.

[000134] When a chroma format other than 4:0:0 is in use, in an I slice, the coding tree of each

CTUmay CTU may diverge diverge below below the the 64×64 64x64 level level intointo two two separate separate coding coding trees, trees, oneone forfor luma luma and and

another for another for chroma. Useofofseparate chroma. Use separatetrees trees allows allowsdifferent different block block structure structure to toexist existbetween between luma luma

and chroma and chromawithin withina aluma luma64x64 64×64 area area of of a CTU. a CTU. For For example, example, a large a large chroma chroma CB mayCB be may be collocated with collocated numeroussmaller with numerous smallerluma luma CBs CBs and and vicevice versa. versa. In aInPaor P or B slice,a asingle B slice, singlecoding coding tree of tree of aaCTU defines aa block CTU defines block structure structure common common toto luma luma andand chroma. chroma. The The resulting resulting blocks blocks of of the single tree may be intra predicted or inter predicted. the single tree may be intra predicted or inter predicted.

43452381_1 43452381_1

36

[000135] In addition to a division of pictures into slices, pictures may also be divided into

[000135] In addition to a division of pictures into slices, pictures may also be divided into 17 Jan 2024

‘tiles’. 'tiles'.AA tile tileis is a sequence of of a sequence CTUs CTUscovering covering aa rectangular rectangularregion regionof ofa apicture. picture.CTU CTU scanning scanning

occurs in a raster-scan manner within each tile and progresses from one tile to the next. A slice occurs in a raster-scan manner within each tile and progresses from one tile to the next. A slice

can be can be either either an an integer integernumber of tiles, number of tiles,oror ananinteger number integer numberof ofconsecutive consecutiverows rows of of CTUs CTUs

within a given tile. within a given tile.

[000136]For

[000136] Foreach eachCTU, CTU,thethe video video encoder encoder 800800 as seen as seen in in Fig. Fig. 8 operates 8 operates inin twostages. two stages.InInthe the 43452381_1

2024200305

first stage (referred to as a ‘search’ stage), the block partitioner 810 tests various potential first stage (referred to as a 'search' stage), the block partitioner 810 tests various potential

configurations of a coding tree. Each potential configuration of a coding tree has associated configurations of a coding tree. Each potential configuration of a coding tree has associated

‘candidate’ 'candidate' CBs. Thefirst CBs. The first stage stage involves involves testing testing various various candidate candidate CBs to select CBs to select CBs providing CBs providing

relatively high compression efficiency with relatively low distortion. The testing generally relatively high compression efficiency with relatively low distortion. The testing generally

involves aa Lagrangian involves optimisationwhereby Lagrangian optimisation whereby a candidate a candidate CB CB is evaluated is evaluated based based onweighted on a a weighted combination of rate (i.e., coding cost) and distortion (i.e., error with respect to the input frame combination of rate (i.e., coding cost) and distortion (i.e., error with respect to the input frame

data 712). ‘Best’ candidate CBs (i.e., the CBs with the lowest evaluated rate/distortion) are data 712). 'Best' candidate CBs (i.e., the CBs with the lowest evaluated rate/distortion) are

selected for selected for subsequent encodinginto subsequent encoding into aa bitstream bitstream portion portion 816. Includedinin evaluation 816. Included evaluation of of candidate CBs is an option to use a CB for a given area or to further split the area according to candidate CBs is an option to use a CB for a given area or to further split the area according to

various splitting options and code each of the smaller resulting areas with further CBs, or split various splitting options and code each of the smaller resulting areas with further CBs, or split

the areas the areas even even further. further. As As a a consequence, boththe consequence, both the coding codingtree tree and and the the CBs CBsthemselves themselvesareare selected in the search stage. selected in the search stage.

[000137] Thevideo

[000137] The videoencoder encoder 800 800 produces produces a prediction a prediction block block (PB), (PB), indicated indicated by by an an arrow arrow 820, 820,

for each for each CB, for example, CB, for CB812. example, CB 812.TheThe PB PB 820 820 is aisprediction a prediction of of thethe contents contents ofof theassociated the associated CB812. CB 812.A A subtractermodule subtracter module 822822 produces produces a difference, a difference, indicated indicated as as 824824 (or(or ‘residual’, 'residual',

referring to referring tothe thedifference differencebeing beingininthe spatial the domain), spatial between domain), betweenthe thePB PB820 820 and and the the CB 812. CB 812.

Thedifference The difference 824 824is is aa block-size block-size difference difference between correspondingsamples between corresponding samplesinin thePBPB the 820 820 andand

the CB the 812.The CB 812. The difference824 difference 824 isistransformed, transformed,quantised quantisedandand represented represented as as a a transform transform block block

(TB), indicated (TB), indicated by by an an arrow arrow836. 836.The ThePBPB 820820 andand associated associated TB TB 836 836 are typically are typically chosen chosen fromfrom

one of one of many possiblecandidate many possible candidateCBs, CBs,for forexample, example, based based on on evaluated evaluated cost cost or or distortion. distortion.

[000138]AAcandidate

[000138] candidatecoding coding block block (CB) (CB) is is a CB a CB resulting resulting from from oneone of of thethe prediction prediction modes modes

available to available to the thevideo video encoder encoder 800 for the 800 for the associated associated PB PB and the resulting and the resulting residual. residual.When When

combinedwith combined withthe thepredicted predictedPBPB inin thevideo the videoencoder encoder 800,thetheTBTB 800, 836836 reduces reduces thethe difference difference

betweenaadecoded between decodedCBCB andand thethe original original CBCB 812812 at the at the expense expense of additional of additional signalling signalling inin a a bitstream. bitstream.

43452381_1 43452381_1

37

[000139] Eachcandidate

[000139] Each candidatecoding coding block block (CB) (CB) (i.e.,prediction (i.e., predictionblock block(PB) (PB)inincombination combination with with a a 17 Jan 2024

transform block (TB)), has an associated coding cost (or ‘rate’) and an associated difference (or transform block (TB)), has an associated coding cost (or 'rate') and an associated difference (or

‘distortion’). Thedistortion 'distortion'). The distortion of of thethe CB CB is typically is typically estimated estimated as a difference as a difference invalues, in sample sample values, such as such as aa sum of absolute sum of absolute differences differences (SAD), (SAD), aa sum sumofofsquared squareddifferences differences(SSD) (SSD)or or a a Hadamard Hadamard transform transform applied applied to to thedifferences. the differences.The Theestimate estimateresulting resultingfrom fromeach eachcandidate candidatePBPB maybebedetermined may determinedbyby a a mode mode selector selector 886886 using using thethe difference difference 824 824 to to determine determine a prediction a prediction

mode887. mode 887.TheThe prediction prediction mode mode 887 887 indicates indicates the the decision decision to to useuse a particularprediction a particular predictionmode mode 43452381_1

2024200305

for the current CB, for example, intra-frame prediction or inter-frame prediction. Estimation of for the current CB, for example, intra-frame prediction or inter-frame prediction. Estimation of

the coding the costs associated coding costs associated with with each each candidate prediction mode candidate prediction andcorresponding mode and corresponding residual residual

coding may coding maybebeperformed performedat at significantlylower significantly lowercost costthan thanentropy entropycoding codingofofthe theresidual. residual. Accordingly,aa number Accordingly, numberofofcandidate candidatemodes modes maymay be evaluated be evaluated to determine to determine an optimum an optimum mode mode in in a rate-distortion sense even in a real-time video encoder. a rate-distortion sense even in a real-time video encoder.

[000140] Determining

[000140] Determining a preferredmode a preferred mode in in terms terms of of rate-distortionisis typically rate-distortion typically achieved using aa achieved using

variation of variation of Lagrangian optimisation. Lagrangian Lagrangian optimisation. Lagrangianororsimilar similaroptimisation optimisationprocessing processingcan canbebe employed to both select a preferred partitioning of a CTU into CBs (by the block partitioner employed to both select a preferred partitioning of a CTU into CBs (by the block partitioner

810) as well as the selection of a prediction mode from a plurality of possibilities. Through 810) as well as the selection of a prediction mode from a plurality of possibilities. Through

application of application of aa Lagrangian optimisation process Lagrangian optimisation process of of the the candidate modesininthe candidate modes the mode modeselector selector module886, module 886,the theintra intra prediction prediction mode withthe mode with thelowest lowestcost costmeasurement measurementis is selectedasasa a'best' selected ‘best’ mode.The mode. The lowest lowest costmode cost mode includes includes a selected a selected secondary secondary transform transform index index 888,888, which which is also is also

encodedininthe encoded the bitstream bitstream 816 816by byan anentropy entropyencoder encoder838. 838.

[000141]

[000141] In In thethe second second stage stage of operation of operation of the of theencoder video video encoder 800 to 800 (referred (referred to as a ‘coding’ as a 'coding'

stage), an iteration over the determined coding tree(s) of each CTU is performed in the video stage), an iteration over the determined coding tree(s) of each CTU is performed in the video

encoder800. encoder 800. For Fora aCTU CTU using using separate separate trees,for trees, foreach each64x64 64×64 luma luma region region of of thethe CTU, CTU, a luma a luma

coding tree coding tree is is firstly firstlyencoded encodedfollowed followed by by aachroma codingtree. chroma coding tree. Within Withinthe the luma lumacoding codingtree, tree, only luma only lumaCBs CBsare areencoded encodedandand within within thethe chroma chroma coding coding treetree only only chroma chroma CBsencoded. CBs are are encoded. For a CTU using a shared tree, a single tree describes the CUs (i.e., the luma CBs and the For a CTU using a shared tree, a single tree describes the CUs (i.e., the luma CBs and the

chromaCBs) chroma CBs) according according to to thecommon the common block block structure structure of the of the shared shared tree. tree.

[000142] Theentropy

[000142] The entropyencoder encoder 838 838 supports supports bitwise bitwise coding coding of of syntax syntax elements elements using using variable- variable-

length and length fixed-length codewords, and fixed-length andananarithmetic codewords, and arithmeticcoding codingmode modeforfor syntax syntax elements. elements.

Portions of the bitstream such as ‘parameter sets’, for example, the sequence parameter set Portions of the bitstream such as 'parameter sets', for example, the sequence parameter set

(SPS), the (SPS), the picture picture parameter parameter set set (PPS), (PPS), and and the the picture pictureheader header (PH) (PH) use use aa combination of fixed- combination of fixed- length codewords length andvariable-length codewords and variable-lengthcodewords. codewords. Slices,also Slices, alsoreferred referredtoto as as contiguous contiguous

43452381_1 43452381_1

38

portions, have a slice header that uses variable length coding followed by slice data, which uses portions, have a slice header that uses variable length coding followed by slice data, which uses 17 Jan 2024

arithmetic coding. The slice header defines parameters specific to the current slice, such as arithmetic coding. The slice header defines parameters specific to the current slice, such as

slice-level quantisation parameter offsets, and may include an instance of the PH. The slice slice-level quantisation parameter offsets, and may include an instance of the PH. The slice

data includes data includes the the syntax syntax elements of each elements of CTUininthe each CTU theslice. slice. Use Useofofvariable variable length length coding codingand and arithmetic coding requires sequential parsing within each portion of the bitstream. The portions arithmetic coding requires sequential parsing within each portion of the bitstream. The portions

may be delineated with a start code to form ‘network abstraction layer units’ or ‘NAL units’. may be delineated with a start code to form 'network abstraction layer units' or 'NAL units'.

Arithmetic coding Arithmetic codingisis supported supportedusing usingaa context-adaptive context-adaptivebinary binaryarithmetic arithmeticcoding codingprocess. process. 43452381_1

2024200305

[000143] Arithmeticallycoded

[000143] Arithmetically codedsyntax syntaxelements elements consistofofsequences consist sequences of of one one or or more more ‘bins’. 'bins'.

Bins, like bits, have a value of ‘0’ or ‘1’. However, bins are not encoded in the bitstream Bins, like bits, have a value of '0' or '1'. However, bins are not encoded in the bitstream

portion 816 as discrete bits. Bins have an associated predicted (or ‘likely’ or ‘most probable’) portion 816 as discrete bits. Bins have an associated predicted (or 'likely' or 'most probable')

value and value and an an associated associated probability, probability, known as aa 'context'. known as ‘context’. When Whenthe theactual actualbin bintoto be be coded coded matchesthe matches thepredicted predicted value, value, aa ‘most probablesymbol' 'most probable symbol’(MPS) (MPS)is is coded. coded. Coding Coding a most a most probable probable

symbol is relatively inexpensive in terms of consumed bits in the bitstream portion 816, symbol is relatively inexpensive in terms of consumed bits in the bitstream portion 816,

including costs that amount to less than one discrete bit. When the actual bin to be coded including costs that amount to less than one discrete bit. When the actual bin to be coded

mismatchesthe mismatches thelikely likely value, value, aa ‘least 'leastprobable probable symbol’ (LPS)is symbol' (LPS) is coded. coded. Coding Codinga aleast least probable probable symbol hasaarelatively symbol has relatively high high cost cost in interms terms of ofconsumed bits. The consumed bits. Thebin bin coding codingtechniques techniquesenable enable efficient coding of bins where the probability of a ‘0’ versus a ‘1’ is skewed. For a syntax efficient coding of bins where the probability of a '0' versus a '1' is skewed. For a syntax

element with two possible values (i.e., a ‘flag’), a single bin is adequate. For syntax elements element with two possible values (i.e., a 'flag'), a single bin is adequate. For syntax elements

with many with manypossible possiblevalues, values,aa sequence sequenceofofbins binsisis needed. needed. The Theconvention convention forfor converting converting values values

of a syntax element into a sequence of bins is termed ‘binarisation’. Where the values ‘0’ and of a syntax element into a sequence of bins is termed 'binarisation'. Where the values '0' and

‘1’ '1' for for a a bin are equally bin are equally(or(ornear near equally) equally) likely, likely, it is it is possible possible to omit to omit usea of use of a context context and and assumeananequiprobable assume equiprobabledistribution. distribution. Bins Binswith witha acontext contextare aretermed termed'context-coded ‘context-coded bins’and bins' and bins omitting bins a context omitting a context are are termed termed ‘bypass-coded bins’. The 'bypass-coded bins'. Thebinarization binarizationofofaa syntax syntax element element into one into one or or more bins may more bins mayresult result in in aa combination of context-coded combination of context-codedand andbypass-coded bypass-coded bins. bins.

Unlike directly coding one bit into the bitstream, a bypass-coded bin uses the arithmetic coding Unlike directly coding one bit into the bitstream, a bypass-coded bin uses the arithmetic coding

engine, which engine, facilitates mixing which facilitates mixing context-coded andbypass-coded context-coded and bypass-coded binsinto bins intosyntax syntaxelement element binarisations. binarisations.

[000144] Foraagiven

[000144] For givenbinarization, binarization, the the presence of later presence of laterbins binsininthe sequence the sequencemay may be be determined determined

based on the value of earlier bins in the sequence, resulting in variable-length binarisations. based on the value of earlier bins in the sequence, resulting in variable-length binarisations.

Additionally, each Additionally, bin may each bin beassociated may be associatedwith withmore morethan thanone onecontext, context,with withone onecontext contextselected selected for use in coding a specific instances of the bin. The selection of a particular context may be for use in coding a specific instances of the bin. The selection of a particular context may be

dependentononearlier dependent earlier bins bins in in the the syntax syntax element, element, the the decoded values of decoded values of neighbouring syntax neighbouring syntax

elements (i.e., elements (i.e., those thosefrom from neighbouring blocks) and neighbouring blocks) and the the like. like. Each time aa context-coded Each time binis context-coded bin is

43452381_1 43452381_1

39

encoded, the context that was selected for that bin (if any) is updated in a manner reflective of encoded, the context that was selected for that bin (if any) is updated in a manner reflective of 17 Jan 2024

the new the bin value. new bin value. As Assuch, such,the thebinary binaryarithmetic arithmetic coding codingscheme schemeisissaid saidtoto be be adaptive. adaptive.

[000145]The

[000145] Theabsence absenceofof a acontext contextfor forbypass-coded bypass-coded bins bins saves saves memory memory and and reduces reduces

complexity, and thus bypass bins are used where the distribution of values for the particular bin complexity, and thus bypass bins are used where the distribution of values for the particular bin

is not is not skewed. Oneexample skewed. One exampleof of an an entropy entropy coder coder employing employing context context and and adaption adaption is known is known in in the art the artas asCABAC (contextadaptive CABAC (context adaptive binaryarithmetic binary arithmeticcoder) coder)and andmany many variants variants of of thiscoder this coder 43452381_1

2024200305

have been have beenemployed employedin in video video coding. coding.

[000146]

[000146] AAQPQP controller890 controller 890determines determines a quantisation a quantisation parameter parameter 892, 892, used used to to establisha a establish

quantisation step size for use by a quantiser 834 and a dequantiser 840. A larger quantisation quantisation step size for use by a quantiser 834 and a dequantiser 840. A larger quantisation

step size results in primary transform coefficients 828 being quantised into smaller values, step size results in primary transform coefficients 828 being quantised into smaller values,

reducing bitrate of the bitstream portion 816 at the expense of a reduction in the fidelity of reducing bitrate of the bitstream portion 816 at the expense of a reduction in the fidelity of

inverse transform coefficients 846. inverse transform coefficients 846.

[000147] Theentropy

[000147] The entropyencoder encoder 838 838 encodes encodes the the quantisation quantisation parameter parameter 892 892 and,and, if in if in useuse forfor the the

current CB, current the LFNST CB, the index LFNST index 888, 888, using using a combination a combination of context-coded of context-coded and and bypass-coded bypass-coded

bins. The bins. Thequantisation quantisation parameter parameter892 892isisencoded encodedatatthe thebeginning beginningofofeach eachslice sliceand andchanges changesinin the quantisation the quantisation parameter 892within parameter 892 withinaa slice slice are are coded coded using using a a ‘delta 'deltaQP’ QP' syntax syntax element. The element. The

delta QP delta syntax element QP syntax elementisis signalled signalled at at most most once in each once in area known each area known asasaa'quantisation ‘quantisation group'. group’. Thequantisation The quantisation parameter parameter892 892isisapplied appliedto to residual residual coefficients coefficients of ofthe theluma luma CB. Anadjusted CB. An adjusted quantisation parameter is applied to the residual coefficients of collocated chroma CBs. The quantisation parameter is applied to the residual coefficients of collocated chroma CBs. The

adjusted quantisation adjusted quantisation parameter mayinclude parameter may includemapping mapping from from the the luma luma quantisation quantisation

parameter892 parameter 892according accordingtotoa amapping mapping tableand table and a a CU-level CU-level offset,selected offset, selectedfrom froma alist list of of

offsets. The offsets. secondarytransform The secondary transformindex index888 888isissignalled signalledwhen whenthe theresidual residualassociated associatedwith withthe the transform block includes significant residual coefficients only in those coefficient positions transform block includes significant residual coefficients only in those coefficient positions

subject to transforming into primary coefficients by application of a secondary transform. subject to transforming into primary coefficients by application of a secondary transform.

[000148] Residualcoefficients

[000148] Residual coefficients of of each each TB TBassociated associatedwith witha aCBCBarearecoded coded using using a residual a residual

syntax. The syntax. Theresidual residual syntax syntaxis is designed to efficiently designed to efficiently encode encode coefficients coefficientswith withlow low magnitudes, magnitudes,

using mainly arithmetically coded bins to indicate significance of coefficients, along with using mainly arithmetically coded bins to indicate significance of coefficients, along with

lower-valuedmagnitudes lower-valued magnitudes and and reserving reserving bypass bypass bins bins forfor higher higher magnitude magnitude residual residual coefficients. coefficients.

Accordingly,residual Accordingly, residual blocks blocks comprising comprisingvery verylow lowmagnitude magnitude values values andand sparse sparse placement placement of of significant coefficients significant coefficientsare areefficiently compressed. efficiently compressed.Moreover, Moreover, two residual coding two residual schemesare coding schemes are present. A regular residual coding scheme is optimised for TBs with significant coefficients present. A regular residual coding scheme is optimised for TBs with significant coefficients

43452381_1 43452381_1

40

predominantly located in the upper-left corner of the TB, as is seen when a transform is applied. predominantly located in the upper-left corner of the TB, as is seen when a transform is applied. 17 Jan 2024

A transform-skip A transform-skipresidual residual coding codingscheme schemeisisavailable availablefor for TBs TBswhere wherea a transform transform isisnot not performed and is able to efficiently encode residual coefficients regardless of their distribution performed and is able to efficiently encode residual coefficients regardless of their distribution

throughoutthe throughout the TB. TB.

[000149]

[000149] AAmultiplexer multiplexermodule module884884 outputs outputs thethe PB PB 820 820 fromfrom an intra-frame an intra-frame prediction prediction

module864 module 864according according toto thedetermined the determined bestintra best intraprediction predictionmode, mode,selected selectedfrom fromthe thetested tested 43452381_1

2024200305

prediction mode prediction ofeach mode of eachcandidate candidateCB. CB.TheThe candidate candidate prediction prediction modes modes needneed not include not include every every

conceivableprediction conceivable prediction mode modesupported supported byby thevideo the video encoder encoder 800. 800. Intra Intra prediction prediction fallsinto falls into three types, first, “DC intra prediction”, which involves populating a PB with a single value three types, first, "DC intra prediction", which involves populating a PB with a single value

representing the representing the average of nearby average of reconstructed samples; nearby reconstructed samples;second, second,"planar “planarintra intraprediction", prediction”, whichinvolves which involvespopulating populatinga aPBPBwith withsamples samples according according to to a plane,with a plane, witha aDCDC offset offset andand a a vertical and vertical and horizontal horizontal gradient gradientbeing being derived derived from from nearby reconstructed neighbouring nearby reconstructed neighbouringsamples. samples. Thenearby The nearbyreconstructed reconstructedsamples samplestypically typicallyinclude includea arow rowofofreconstructed reconstructedsamples samples above above thethe

current PB, current extending to PB, extending to the the right right of ofthe thePB PB to toan anextent extentand anda acolumn column of of reconstructed reconstructed samples samples

to the left of the current PB, extending downwards beyond the PB to an extent; and, third, to the left of the current PB, extending downwards beyond the PB to an extent; and, third,

“angular intra "angular intra prediction”, prediction",which which involves involves populating a PB populating a withreconstructed PB with reconstructedneighbouring neighbouring samples filtered and propagated across the PB in a particular direction (or ‘angle’). In VVC, samples filtered and propagated across the PB in a particular direction (or 'angle'). In VVC,

sixty-five (65) angles are supported, with rectangular blocks able to utilise additional angles, sixty-five (65) angles are supported, with rectangular blocks able to utilise additional angles,

not available to square blocks, to produce a total of eighty-seven (87) angles. not available to square blocks, to produce a total of eighty-seven (87) angles.

[000150]

[000150] AAfourth fourthtype typeofofintra intra prediction prediction is isavailable availabletoto chroma chroma PBs, PBs, whereby the PB whereby the PBisis generated from generated fromcollocated collocatedluma lumareconstructed reconstructedsamples samples according according to to a ‘cross-component a 'cross-component linear linear

model’(CCLM) model' (CCLM) mode. mode. ThreeThree different different CCLM CCLM modes modes are are available, available, each each mode mode using a using a different model different derived from model derived fromthe the neighbouring neighbouringluma luma and and chroma chroma samples. samples. The derived The derived modelmodel

is used is used to to generate generate aablock block of ofsamples samples for forthe thechroma chroma PB fromthe PB from thecollocated collocatedluma lumasamples. samples. Lumablocks Luma blocksmay may be be intrapredicted intra predictedusing usinga amatrix matrixmultiplication multiplicationofofthe thereference referencesamples samples using one using one matrix matrix selected selected from fromaa predefined predefinedset set of of matrices. This matrix matrices. This matrix intra intra prediction prediction (MIP) (MIP)

achieves gain by using matrices trained on a large set of video data, with the matrices achieves gain by using matrices trained on a large set of video data, with the matrices

representing relationships between reference samples and a predicted block that are not easily representing relationships between reference samples and a predicted block that are not easily

captured in angular, planar, or DC intra prediction modes. captured in angular, planar, or DC intra prediction modes.

[000151] Themodule

[000151] The module864864 maymay alsoalso produce produce a prediction a prediction unitunit by copying by copying a block a block fromfrom nearby nearby

the current the current frame frame using using an an ‘intra 'intrablock block copy’ copy' (IBC) (IBC) method. Thelocation method. The locationofofthe thereference reference block is block is constrained constrained to to an an area areaequivalent equivalent to toone oneCTU, divided into CTU, divided into 64x64 regionsknown 64x64 regions knownas as

43452381_1 43452381_1

41

VPDUs, VPDUs, with with thearea the areacovering covering theprocessed the processed VPDUs VPDUs of current of the the current CTU CTU and VPDUs and VPDUs of the of the 17 Jan 2024

previous CTU(s) previous CTU(s)within withineach eachrowrow or or CTUs CTUs and and within within eacheach slice slice or tileupup or tile toto thearea the arealimit limit correspondingtoto one corresponding one128x128 128×128 luma luma samples, samples, regardless regardless of of thethe configured configured CTUCTU size size for for the the

bitstream. This area is known as an ‘IBC virtual buffer’ and limits the IBC reference area, thus bitstream. This area is known as an 'IBC virtual buffer' and limits the IBC reference area, thus

limiting the required storage. The IBC buffer is populated with reconstructed samples 854 (i.e., limiting the required storage. The IBC buffer is populated with reconstructed samples 854 (i.e.,

prior to prior to loop loop filtering), filtering),andand so SO a separate buffer a separate to atoframe buffer buffer a frame 872872 buffer is needed. When is needed. Whenthe CTU the CTU

size isis128×128 size the virtual 128x128 the virtual buffer bufferincludes includessamples samples only only from the CTU from the adjacentand CTU adjacent andtotothe theleft left 43452381_1

2024200305

of the of the current current CTU. When CTU. When thethe CTUCTU sizesize is 32×32 is 32x32 or 64×64 or 64x64 the virtual the virtual buffer buffer includes includes CTUs CTUs

from up from upto to the the four four or or sixteen sixteen CTUs to the CTUs to the left leftof ofthe thecurrent CTU. current CTU. Regardless of the Regardless of the CTU size, CTU size,

access to access to neighbouring CTUs neighbouring CTUs forobtaining for obtainingsamples samples forfor IBC IBC reference reference blocks blocks is is constrained constrained by by

boundaries such as edges of pictures, slices, or tiles. Particularly for feature maps of FPN boundaries such as edges of pictures, slices, or tiles. Particularly for feature maps of FPN

layers having layers smaller dimensions, having smaller useof dimensions, use of aa CTU CTUsize sizesuch suchasas32x32 32×32or or 64×64 64x64 results results in in a a reference area reference area more aligned to more aligned to cover cover aa set set of of previous previous feature featuremaps. Wherefeature maps. Where featuremap map placementisis ordered placement ordered based basedononSAD, SAD,SSESSE or other or other difference difference metric, metric, access access toto similarfeature similar feature mapsfor maps for IBC IBCprediction predictionoffers offers coding codingefficient efficient advantage. advantage.

[000152] Theresidual

[000152] The residualfor for aa predicted predicted block block when whenencoding encoding featuremap feature map data data is is differenttoto the different the residual seen for natural video. Natural video is typically captured by an image sensor, or residual seen for natural video. Natural video is typically captured by an image sensor, or

screen content, as generally seen in operating system user interfaces and the like. Feature map screen content, as generally seen in operating system user interfaces and the like. Feature map

residuals tend to contain much detail. The level of detail in feature map residuals is amenable to residuals tend to contain much detail. The level of detail in feature map residuals is amenable to

transform skip transform skip coding codingmore morethan thanpredominantly predominantly low-frequency low-frequency coefficients coefficients of various of various

transforms. An transforms. Anintra-predicted intra-predictedluma lumacoding codingblock block may may be be partitioned partitioned intoa aset into setofofequal-sized equal-sized prediction blocks, prediction blocks, either eithervertically verticallyoror horizontally, which horizontally, each which block each having block havinga aminimum area of minimum area of sixteen (16) sixteen (16) luma samples. luma samples.

[000153] Where

[000153] Where previously previously reconstructed reconstructed neighbouring neighbouring samples samples are unavailable, are unavailable, for for example example at at

the edge of the frame, a default half-tone value of one half the range of the samples is used. For the edge of the frame, a default half-tone value of one half the range of the samples is used. For

example,for example, for 10-bit 10-bit video a value video a value of of five-hundred and twelve five-hundred and twelve(512) (512)is is used. Asnonoprevious used. As previous samples are available for a CB located at the top-left position of a frame, angular and planar samples are available for a CB located at the top-left position of a frame, angular and planar

intra-prediction modes intra-prediction producethe modes produce thesame sameoutput outputasasthe theDCDC predictionmode prediction mode (i.e.,aaflat (i.e., flat plane plane of of

sampleshaving samples havingthe thehalf-tone half-tone value value as as magnitude). magnitude).

[000154] Forinter-frame

[000154] For inter-frameprediction prediction aa prediction prediction block block 882 882is is produced usingsamples produced using samplesfrom from one one

or two or frames preceding two frames precedingthe thecurrent current frame frameinin the the coding codingorder order frames framesinin the the bitstream bitstream by by aa motioncompensation motion compensation module module 880 880 and and output output as the as the PB by PB 820 820the bymultiplexer the multiplexer module module 884. 884.

43452381_1 43452381_1

42

Moreover, for inter-frame prediction, a single coding tree is typically used for both the luma Moreover, for inter-frame prediction, a single coding tree is typically used for both the luma 17 Jan 2024

channel and channel andthe the chroma chromachannels. channels.TheThe order order of of coding coding frames frames in the in the bitstream bitstream maymay differ differ from from

the order the order of of the theframes frames when capturedor when captured or displayed. displayed. When Whenoneone frame frame is used is used forfor prediction,the prediction, the block is block is said said to tobe be‘uni-predicted’ 'uni-predicted'and andhas hasone oneassociated associatedmotion motion vector. vector. When twoframes When two frames are are

used for prediction, the block is said to be ‘bi-predicted’ and has two associated motion vectors. used for prediction, the block is said to be 'bi-predicted' and has two associated motion vectors.

For aa P For P slice, slice,each eachCU maybebeintra CU may intra predicted predicted or or uni-predicted. uni-predicted. For For aa B B slice, slice,each eachCU CU may be may be

intra predicted, uni-predicted, or bi-predicted. intra predicted, uni-predicted, or bi-predicted. 43452381_1

2024200305

[000155] Framesarearetypically

[000155] Frames typicallycoded codedusing usinga a'group ‘groupofofpictures' pictures’structure, structure, enabling enabling a a temporal temporal

hierarchy of hierarchy of frames. Framesmay frames. Frames may be be divided divided into into multiple multiple slices,each slices, eachofofwhich whichencodes encodesa a portion of portion of the the frame. A temporal frame. A temporalhierarchy hierarchyofofframes framesallows allowsa aframe frametotoreference referenceaapreceding preceding and a subsequent picture in the order of displaying the frames. The images are coded in the and a subsequent picture in the order of displaying the frames. The images are coded in the

order necessary order to ensure necessary to the dependencies ensure the for decoding dependencies for decodingeach eachframe frameare aremet. met.AnAn affine affine inter inter

prediction mode prediction is available mode is available where instead of where instead of using using one one or or two two motion motionvectors vectorstotoselect select and and

filter reference sample blocks for a prediction unit, the prediction unit is divided into multiple filter reference sample blocks for a prediction unit, the prediction unit is divided into multiple

smaller blocks smaller blocks and and aa motion motionfield field is is produced so each produced SO each smaller smaller block block has has aa distinct distinct motion motion

vector. The motion field uses the motion vectors of nearby points to the prediction unit as vector. The motion field uses the motion vectors of nearby points to the prediction unit as

‘control points’.Affine 'control points'. Affine prediction prediction allows allows codingcoding of different of motion motion different to translation to translation with less with less

need to need to use use deeply split coding deeply split coding trees. trees. A A bi-prediction bi-prediction mode available to mode available to VVC performs VVC performs a a geometric blend of the two reference blocks along a selected axis, with angle and offset from geometric blend of the two reference blocks along a selected axis, with angle and offset from

the centre the centre of of the theblock blocksignalled. signalled.This Thisgeometric geometric partitioning partitioningmode (“GPM”) mode ("GPM") allows allows larger larger

coding units coding units to to be be used used along along the the boundary betweentwo boundary between two objects,with objects, withthe thegeometry geometryof of the the

boundarycoded boundary codedfor forthe thecoding codingunit unitasasan anangle angleand andcentre centreoffset. offset. Motion Motionvector vectordifferences, differences, instead of using cartesian (x, y) offset, may be coded as a direction (up/down/left/right) and a instead of using cartesian (x, , y) offset, may be coded as a direction (up/down/left/right) and a

distance, with distance, with aa set setofofpower-of-two power-of-two distances distances supported. Themotion supported. The motionvector vectorpredictor predictorisis obtained from obtained fromaa neighbouring neighbouringblock block('merge (‘merge mode’) mode') as as if if nono offsetisisapplied. offset applied. The Thecurrent current block will block will share share the the same motionvector same motion vectoras as the the selected selected neighbouring block. neighbouring block.

[000156]The

[000156] Thesamples samples areselected are selectedaccording accordingtotoa amotion motion vector878878 vector and and reference reference picture picture

index. The index. Themotion motionvector vector878 878 and and reference reference pictureindex picture indexapplies appliestotoall all colour colour channels channelsand and thus inter prediction is described primarily in terms of operation upon Pus rather than PBs. The thus inter prediction is described primarily in terms of operation upon Pus rather than PBs. The

decompositionofofeach decomposition eachCTU CTU into into oneone or or more more inter-predicted inter-predicted blocks blocks is is described described with with a single a single

coding tree. coding tree. Inter Inter prediction prediction methods mayvary methods may varyininthe the number numberofofmotion motion parameters parameters andand their their

precision. Motion precision. Motionparameters parameterstypically typicallycomprise comprisea areference referenceframe frame index,indicating index, indicatingwhich which reference frame(s) from lists of reference frames are to be used plus a spatial translation for reference frame(s) from lists of reference frames are to be used plus a spatial translation for

43452381_1 43452381_1

43

each of each of the the reference reference frames, frames, but but may include more may include moreframes, frames,special specialframes, frames,oror complex complexaffine affine 17 Jan 2024

parameterssuch parameters suchasas scaling scaling and and rotation. rotation. In In addition, addition, aapre-determined pre-determined motion refinement motion refinement

process may process maybebeapplied appliedtotogenerate generatedense densemotion motionestimates estimatesbased based onon referenced referenced sample sample blocks. blocks.

[000157] Havingdetermined

[000157] Having determined andand selected selected thethe PB PB 820820 and and subtracted subtracted the the PB 820 PB 820 from from the the

original sample block at the subtractor 822, a residual with lowest coding cost, represented original sample block at the subtractor 822, a residual with lowest coding cost, represented

as 824, as 824, is isobtained obtained and and subjected subjected to to lossy lossycompression. compression. The lossy compression The lossy compressionprocess process 43452381_1

2024200305

comprisesthe comprises the steps steps of of transformation, transformation, quantisation quantisation and and entropy coding. AAforward entropy coding. forwardprimary primary transform module transform module826 826 appliesa aforward applies forward transform transform to to thedifference the difference824, 824,converting convertingthe the difference 824 difference fromthe 824 from the spatial spatial domain to the domain to the frequency domain,and frequency domain, andproducing producing primary primary

transform coefficients transform coefficients represented represented by by an an arrow 828. The arrow 828. Thelargest largestprimary primarytransform transformsize sizeininone one dimensionisis either dimension either aa 32-point 32-point DCT-2 DCT-2 ororaa64-point 64-pointDCT-2 DCT-2 transform, transform, configured configured by by a a ‘sps_max_luma_transform_size_64_flag’ in the 'sps_max_luma_transform_size_64_flag' in the sequence sequence parameter parameter set. set. If the If the CB being CB being

encoded is larger than the largest supported primary transform size expressed as a block size encoded is larger than the largest supported primary transform size expressed as a block size

(e.g. (e.g. 64×64 64x64 oror 32×32), 32x32), the the primary primary transform transform 826 is in 826 is applied applied a tiledinmanner a tiledtomanner toall transform transform all samplesof samples of the the difference difference 824. Wherea anon-square 824. Where non-squareCB CB is used, is used, tilingisis also tiling also performed performedusing using the largest the largestavailable availabletransform transformsize sizeinin each eachdimension dimension of ofthe theCB. CB. For example,when For example, whena a maximum maximum transform transform size size of of thirty-two thirty-two (32)isisused, (32) used,aa64x16 64×16CBCB uses uses twotwo 32×16 32x16 primary primary

transforms arranged transforms arrangedin in aa tiled tiled manner. When manner. When a CB a CB is is largerininsize larger sizethan than the the maximum maximum supported transform supported transformsize, size, the the CB is filled CB is filledwith withTBs TBs in in aatiled tiledmanner. manner. For For example, example, aa 128×128 128x128

CBwith CB with64-pt 64-pttransform transformmaximum maximumsize size is filledwith is filled withfour four64x64 64×64 TBsTBs in ain2x2 a 2×2 arrangement. arrangement. A A 64×128CBCB 64x128 with with a 32-pt a 32-pt transform transform maximum maximum sizefilled size is is filled with with eight eight 32×32 32x32 TBs TBs in a in a 2x4 2×4 arrangement. arrangement.

[000158]Application

[000158] Applicationofofthe thetransform transform826 826results results in in multiple multiple TBs TBsfor for the the CB. CB.Where Where each each

application of the transform operates on a TB of the difference 824 larger than 32×32, e.g., application of the transform operates on a TB of the difference 824 larger than 32x32, e.g.,

64×64, all resulting primary transform coefficients 828 outside of the upper-left 32×32 area of 64x64, all resulting primary transform coefficients 828 outside of the upper-left 32x32 area of

the TB are set to zero (i.e., discarded). The remaining primary transform coefficients 828 are the TB are set to zero (i.e., discarded). The remaining primary transform coefficients 828 are

passed to passed to the the quantiser quantiser module 834. The module 834. Theprimary primarytransform transformcoefficients coefficients828 828are arequantised quantised according to according to the the quantisation quantisation parameter 892associated parameter 892 associated with withthe the CB CBtotoproduce produceprimary primary transform coefficients transform coefficients 832. In addition 832. In addition to to the the quantisation quantisationparameter parameter 892, 892, the the quantiser quantisermodule module

834 may 834 mayalso alsoapply applya a'scaling ‘scaling list' list’ totoallow allownon-uniform non-uniform quantisation quantisation within within the the TB by further TB by further scaling residual coefficients according to their spatial position within the TB. The quantisation scaling residual coefficients according to their spatial position within the TB. The quantisation

parameter892 parameter 892may may differfor differ foraa luma lumaCBCB versus versus each each chroma chroma CB. CB. The primary The primary transform transform

coefficients 832 coefficients 832 are are passed passed to to aaforward forward secondary transformmodule secondary transform module830 830 to to produce produce thethe

43452381_1 43452381_1

44

transform coefficients transform coefficients represented represented by by the the arrow 836 by arrow 836 byperforming performingeither eitheraanon-separable non-separable 17 Jan 2024

secondarytransform secondary transform(NSST) (NSST) operation operation or or bypassing bypassing thethe secondary secondary transform. transform. The The forward forward

primary transform 826 is typically separable, transforming a set of rows and then a set of primary transform 826 is typically separable, transforming a set of rows and then a set of

columnsofofeach columns eachTB. TB.The The forward forward primary primary transform transform module module 826 uses 826 uses either either a type-II a type-II discrete discrete

cosine transform (DCT-2) in the horizontal and vertical directions, or bypass of the transform cosine transform (DCT-2) in the horizontal and vertical directions, or bypass of the transform

horizontally and horizontally and vertically, vertically,ororcombinations combinations of of aatype-VII type-VII discrete discretesine sinetransform transform(DST-7) (DST-7) and and a a

type-VIII discrete cosine transform (DCT-8) in either horizontal or vertical directions for luma type-VIII discrete cosine transform (DCT-8) in either horizontal or vertical directions for luma 43452381_1

2024200305

TBsnot TBs notexceeding exceeding1616samples samples in in width width and and height.UseUse height. of of combinations combinations of aof a DST-7 DST-7 and and DCT- DCT- 8 is referred to as ‘multi transform selection set’ (MTS) in the VVC standard. 8 is referred to as 'multi transform selection set' (MTS) in the VVC standard.

[000159] Theforward

[000159] The forwardsecondary secondary transform transform of of thethe module module 830 830 is generally is generally a non-separable a non-separable

transform, which transform, whichisis only only applied applied for for the the residual residualof ofintra-predicted CUs intra-predicted CUsand and may nonetheless may nonetheless

also be also be bypassed. Theforward bypassed. The forwardsecondary secondary transform transform operates operates either either on on sixteen sixteen (16)samples (16) samples (arranged as the upper-left 4×4 sub-block of the primary transform coefficients 828) or forty- (arranged as the upper-left 4x4 sub-block of the primary transform coefficients 828) or forty-

eight (48) samples (arranged as three 4×4 sub-blocks in the upper-left 8×8 coefficients of the eight (48) samples (arranged as three 4x4 sub-blocks in the upper-left 8x8 coefficients of the

primarytransform primary transformcoefficients coefficients 828) 828) to to produce produceaa set set of of secondary transformcoefficients. secondary transform coefficients. The The set of set of secondary secondary transform coefficients may transform coefficients be fewer may be fewerinin number numberthan thanthe theset setof of primary primary transform coefficients transform coefficients from whichthey from which theyare are derived. derived. Due Duetotoapplication applicationofofthe the secondary secondary transform to only a set of coefficients adjacent to each other and including the DC coefficient, transform to only a set of coefficients adjacent to each other and including the DC coefficient,

the secondary the transformisis referred secondary transform referred to to as asaa‘low 'lowfrequency frequency non-separable secondarytransform' non-separable secondary transform’ (LFNST).Such (LFNST). Such secondary secondary transforms transforms may may be obtained be obtained through through a training a training process process andtodue and due to their non-separable nature and trained origin, exploit additional redundancy in the residual their non-separable nature and trained origin, exploit additional redundancy in the residual

signal not signal not able able to tobe becaptured captured by by separable separable transforms transforms such such as as variants variantsof ofDCT andDST. DCT and DST. Moreover,when Moreover, when theLFNST the LFNST is applied, is applied, allall remaining remaining coefficients coefficients in in theTBTB the areare zero,both zero, bothininthe the primarytransform primary transformdomain domain and and thethe secondary secondary transform transform domain. domain.

[000160]The

[000160] Thequantisation quantisationparameter parameter892892 is is constantfor constant foraagiven givenTBTBand and thusresults thus resultsinin aa uniformscaling uniform scaling for for producing residualcoefficients producing residual coefficients in in the the primary transform domain primary transform domainfor foraaTB. TB. Thequantisation The quantisation parameter parameter892 892may may vary vary periodically periodically with with a signalled'delta a signalled ‘deltaquantisation quantisation parameter’. The parameter'. Thedelta delta quantisation quantisation parameter parameter(delta (delta QP) is signalled QP) is signalled once once for for Cus Cus contained contained

within a given area, referred to as a ‘quantisation group’. If a CU is larger than the quantisation within a given area, referred to as a 'quantisation group'. If a CU is larger than the quantisation

group size, delta QP is signalled once with one of the TBs of the CU. That is, the delta QP is group size, delta QP is signalled once with one of the TBs of the CU. That is, the delta QP is

signalled by signalled by the the entropy entropy encoder 838once encoder 838 oncefor forthe the first first quantisation quantisationgroup group of ofthe theCU CU and and not not

signalled signalled for for any any subsequent quantisation groups subsequent quantisation of the groups of the CU. CU. A A non-uniform non-uniform scaling scaling is is also also

possible by application of a ‘quantisation matrix’, whereby the scaling factor applied for each possible by application of a 'quantisation matrix', whereby the scaling factor applied for each

43452381_1 43452381_1

45

residual coefficient residual coefficientisisderived derivedfrom fromaacombination combination of of the the quantisation quantisationparameter parameter 892 892 and the and the 17 Jan 2024

corresponding entry corresponding entry in ain a scaling scaling matrix. matrix. The scaling The scaling matrix matrix may may have a sizehave that a issize thatthan smaller is smaller than the size the size of ofthe theTB, TB, and and when applied to when applied to the the TB TB aa nearest nearest neighbour approachisisused neighbour approach usedtoto provide provide scaling values for each residual coefficient from a scaling matrix smaller in size than the TB scaling values for each residual coefficient from a scaling matrix smaller in size than the TB

size. The residual coefficients 836 are supplied to the entropy encoder 838 for encoding in the size. The residual coefficients 836 are supplied to the entropy encoder 838 for encoding in the

bitstream portion 816. Typically, the residual coefficients of each TB with at least one bitstream portion 816. Typically, the residual coefficients of each TB with at least one

significant residual coefficient of the TU are scanned to produce an ordered list of values, significant residual coefficient of the TU are scanned to produce an ordered list of values, 43452381_1

2024200305

according to according to aa scan scan pattern. pattern. The scan pattern The scan pattern generally generally scans scans the the TB as aa sequence TB as of 4x4 sequence of 4×4'sub- ‘sub- blocks’, providing a regular scanning operation at the granularity of 4×4 sets of residual blocks', providing a regular scanning operation at the granularity of 4x4 sets of residual

coefficients, with coefficients, withthe thearrangement arrangement of of sub-blocks sub-blocks dependent onthe dependent on thesize size of of the the TB. Thescan TB. The scan within each within each sub-block sub-blockand andthe theprogression progressionfrom fromone onesub-block sub-block to to thenext the nexttypically typicallyfollow followaa backwarddiagonal backward diagonalscan scanpattern. pattern.Additionally, Additionally,the thequantisation quantisationparameter parameter892 892 isisencoded encoded into into

the bitstream portion 816 using a delta QP syntax element, and a slice QP for the initial value in the bitstream portion 816 using a delta QP syntax element, and a slice QP for the initial value in

a given a given slice slice or orsubpicture subpictureand and the thesecondary secondary transform transform index 888 is index 888 is encoded in the encoded in the bitstream bitstream

portion 816. portion 816.

[000161] Asdescribed

[000161] As describedabove, above,the thevideo videoencoder encoder 800 800 needs needs access access to to a frame a frame representation representation

corresponding to the corresponding to the decoded decodedframe framerepresentation representationseen seenininthe thevideo videodecoder. decoder.Thus, Thus,thethe residual coefficients residual coefficients836 836 are arepassed passed through through an an inverse inverse secondary transformmodule secondary transform module844, 844, operating in operating in accordance withthe accordance with the secondary secondarytransform transformindex index888 888 toto produce produce intermediate intermediate

inverse transform inverse coefficients, represented transform coefficients, represented by by an an arrow arrow 842. Theintermediate 842. The intermediateinverse inverse transform coefficients transform coefficients 842 are inverse 842 are inverse quantised quantised by by the the dequantiser dequantiser module 840according module 840 accordingtoto the quantisation parameter 892 to produce inverse transform coefficients, represented as 846. the quantisation parameter 892 to produce inverse transform coefficients, represented as 846.

Thedequantiser The dequantisermodule module840840 maymay alsoalso perform perform an inverse an inverse non-uniform non-uniform scaling scaling of residual of residual

coefficients using a scaling list, corresponding to the forward scaling performed in the quantiser coefficients using a scaling list, corresponding to the forward scaling performed in the quantiser

module834. module 834.TheThe inverse inverse transform transform coefficients846846 coefficients areare passed passed to to anan inverseprimary inverse primary transform transform

module848 module 848totoproduce produceresidual residualsamples, samples,represented representedbyby anan arrow arrow 850, 850, of of theTU.TU. the TheThe inverse inverse

primarytransform primary transformmodule module 848 848 applies applies DCT-2 DCT-2 transforms transforms horizontally horizontally and and vertically, vertically,

constrained by constrained by the the maximum available maximum available transform transform size size as as described described with with reference reference to to the the

forwardprimary forward primarytransform transformmodule module 826. 826. The The types types of inverse of inverse transform transform performed performed by by the the inverse secondary inverse transformmodule secondary transform module 844 844 correspond correspond withwith the the types types of of forward forward transform transform

performedbybythe performed theforward forwardsecondary secondary transform transform module module 830.830. The types The types of inverse of inverse transform transform

performedbybythe performed theinverse inverseprimary primarytransform transformmodule module 848848 correspond correspond withwith the the types types of primary of primary

transform performed transform performedbybythe theprimary primarytransform transform module module 826.826. A summation A summation module module 852 adds852 theadds the

43452381_1 43452381_1

46

residual samples residual 850and samples 850 andthe thePB PB820 820totoproduce produce reconstructed reconstructed samples samples (indicated (indicated by by an an 17 Jan 2024

arrow 854) arrow 854)of of the the CU. CU.

[000162] Thereconstructed

[000162] The reconstructedsamples samples 854 854 areare passed passed to to a referencesample a reference sample cache cache 856856 and and an in- an in-

loop filters loop filters module module 868. Thereference 868. The referencesample samplecache cache856, 856,typically typicallyimplemented implemented using using static static

RAM RAM on on an an ASIC ASIC to avoid to avoid costly costly off-chip off-chip memory memory access, access, provides provides minimal minimal samplesample storagestorage

neededtoto satisfy needed satisfy the the dependencies for generating dependencies for generating intra-frame intra-frame PBs for subsequent PBs for subsequentCUs CUsininthe the 43452381_1

2024200305

frame. The frame. Theminimal minimal dependencies dependencies typically typically include include a ‘linebuffer' a 'line buffer’ofofsamples samplesalong alongthe thebottom bottom of aa row of row of of CTUs, for use CTUs, for use by by the the next next row rowof of CTUs CTUs and and column column buffering buffering the the extent extent of of which which is is set by set by the the height height of ofthe theCTU. Thereference CTU. The referencesample samplecache cache856856 supplies supplies reference reference samples samples

(represented (represented byby an an arrow arrow 858) 858) to a reference to a reference sample sample filter filter 860. The860. The sample sample filter filter 860 860 applies a applies a

smoothingoperation smoothing operationtotoproduce producefiltered filtered reference reference samples samples(indicated (indicatedby byan anarrow arrow862). 862).The The filtered reference filtered referencesamples samples 862 862 are are used used by by the the intra-frame intra-frame prediction prediction module 864to module 864 to produce produceanan intra-predicted block intra-predicted block of of samples, samples, represented represented by by an an arrow 866. For arrow 866. Foreach eachcandidate candidateintra intra prediction mode prediction theintra-frame mode the intra-frameprediction prediction module module864 864produces produces a block a block of of samples, samples, that that

is 866. is 866. The blockof The block of samples samples866 866isisgenerated generatedbybythe themodule module 864 864 using using techniques techniques such such as as DC,DC,

planar or planar or angular angular intra intraprediction. prediction. The block of The block of samples samples866 866may may alsobebeproduced also produced using using a a matrix-multiplication approach matrix-multiplication approachwith withneighbouring neighbouringreference referencesample sample as as input input andand a matrix a matrix

selected from a set of matrices by the video encoder 800, with the selected matrix signalled in selected from a set of matrices by the video encoder 800, with the selected matrix signalled in

the bitstream 816 using an index to identify which matrix of the set of matrices is to be used by the bitstream 816 using an index to identify which matrix of the set of matrices is to be used by

the video the video decoder. decoder.

[000163]

[000163] TheThe in-loop in-loop filters filters module module 868 applies 868 applies several several filteringfiltering stages tostages to the reconstructed the reconstructed

samples854. samples 854.The Thefiltering filtering stages stages include include aa ‘deblocking filter’ (DBF) 'deblocking filter' whichapplies (DBF) which applies smoothing smoothing aligned to aligned to the the CU boundariestoto reduce CU boundaries reduceartefacts artefacts resulting resulting from from discontinuities. discontinuities. Another Another

filtering filtering stage presentininthe stage present thein-loop in-loop filtersmodule filters module 868 868 is an is an ‘adaptive 'adaptive loop filter’ loop filter' (ALF), (ALF), which which applies a Wiener-based adaptive filter to further reduce distortion. A further available filtering applies a Wiener-based adaptive filter to further reduce distortion. A further available filtering

stage in stage in the the in-loop in-loopfilters filtersmodule module868 868isisa ‘sample a 'sampleadaptive adaptiveoffset’ offset'(SAO) (SAO)filter. filter.The TheSAO SAO

filter operates by firstly classifying reconstructed samples into one or multiple categories and, filter operates by firstly classifying reconstructed samples into one or multiple categories and,

according to the allocated category, applying an offset at the sample level. according to the allocated category, applying an offset at the sample level.

[000164] Filtered

[000164] Filtered samples, samples, represented represented by an870, by an arrow arrow are 870, outputare output from from the the in-loop in-loop filters filters

module868. module 868.The Thefiltered filtered samples samples870 870are arestored storedininthe the frame framebuffer buffer 872. 872. The Theframe frame buffer872872 buffer

typically has the capacity to store several (e.g., up to sixteen (16)) pictures and thus is stored in typically has the capacity to store several (e.g., up to sixteen (16)) pictures and thus is stored in

the memory the 206.TheThe memory 206. frame frame buffer buffer 872872 is not is not typically typically storedusing stored usingon-chip on-chip memory memory due due to to the the

43452381_1 43452381_1

47

large memory large consumption memory consumption required. required. As such, As such, access access to the to the frame frame buffer buffer 872872 is costly is costly in in terms terms 17 Jan 2024

of memory of bandwidth. memory bandwidth. The The frame frame buffer buffer 872 872 provides provides reference reference frames frames (represented (represented by anby an arrow 874) arrow 874)to to aa motion motionestimation estimationmodule module 876 876 andand thethe motion motion compensation compensation module module 880. 880. The The reference frames reference frames 874 874are are output output as as aa reconstructed frame 818 reconstructed frame 818ofofthe the encoder encodermodule module 542. 542. In In thethe

exampleofofFig. example Fig. 8, 8, the the reconstructed reconstructed frame 818 is frame 818 is aa result resultof ofoperation operationofoflossy VVC lossy VVC encoding, encoding,

that is due to operation of the modules 810 to 890. that is due to operation of the modules 810 to 890. 43452381_1

2024200305

[000165]The

[000165] Themotion motion estimation estimation module module 876 876 estimates estimates a number a number of ‘motion of 'motion vectors’ vectors' (indicated (indicated

as 878), each being a Cartesian spatial offset from the location of the present CB, referencing a as 878), each being a Cartesian spatial offset from the location of the present CB, referencing a

block in one of the reference frames in the frame buffer 872. A filtered block of reference block in one of the reference frames in the frame buffer 872. A filtered block of reference

samples(represented samples (representedas as 882) 882)is is produced for each produced for eachmotion motionvector. vector.The The filteredreference filtered reference samples882 samples 882form formfurther furthercandidate candidatemodes modes available available forpotential for potentialselection selection by bythe the mode mode selector 886. selector Moreover,for 886. Moreover, foraa given givenCU, CU,the thePUPU820820 maymay be formed be formed using using one reference one reference blockblock

(‘uni-predicted’) or ("uni-predicted') or may be formed may be formedusing usingtwo tworeference referenceblocks blocks('bi-predicted'). (‘bi-predicted’). For Forthe the selected motion selected vector, the motion vector, the motion compensationmodule motion compensation module 880880 produces produces the the PB in PB 820 820 in accordancewith accordance withaafiltering filtering process process supportive supportive of of sub-pixel sub-pixel accuracy accuracy in in the the motion motion vectors. vectors. As As

such, the such, the motion estimation module motion estimation module876 876(which (which operates operates on on many many candidate candidate motion motion vectors) vectors)

mayperform may performa asimplified simplifiedfiltering filtering process process compared compared totothat that of of the the motion compensation motion compensation

module880 module 880(which (which operates operates on on thethe selectedcandidate selected candidateonly) only)totoachieve achievereduced reduced computational computational

complexity.When complexity. Whenthethe video video encoder encoder 800 800 selects selects inter inter predictionforfora aCUCU prediction thethe motion motion

vector 878 is encoded into the bitstream portion 816. vector 878 is encoded into the bitstream portion 816.

[000166]Although

[000166] Although thevideo the videoencoder encoder 800800 of of Fig. Fig. 8 8isisdescribed describedwith withreference referencetotoversatile versatile video coding video coding(VVC), (VVC),other othervideo videocoding coding standards standards or or implementations implementations may may also also employ employ the the processing stages processing stages of of modules 810-890.TheThe modules 810-890. frame frame data data 802802 (and(and bitstream bitstream 816)816) may may also also be be TM read from read from (or (or written written to) to) memory 206,the memory 206, thehard harddisk diskdrive drive 210, 210, aa CD-ROM, CD-ROM, a Blu-ray a Blu-ray diskor diskTM or other computer other readablestorage computer readable storagemedium. medium. Additionally, Additionally, thethe frame frame data data 802802 (and (and bitstream bitstream 816) 816)

may be received from (or transmitted to) an external source, such as a server connected to the may be received from (or transmitted to) an external source, such as a server connected to the

communications communications network network 220 220 or aorradio-frequency a radio-frequency receiver. receiver. TheThe bitstream bitstream 816 816 relates relates to to oneone of of the bitstreams the bitstreams 121 and 151. 121 and 151. The Thecommunications communications network network 220 220 may provide may provide limited limited bandwidth, bandwidth,

necessitating the use of rate control in the video encoder 800 to avoid saturating the network at necessitating the use of rate control in the video encoder 800 to avoid saturating the network at

times when the frame data 802 is difficult to compress. times when the frame data 802 is difficult to compress.

[000167]The

[000167] Thebitstream bitstream816 816may may be be constructed constructed from from one one or more or more slices, slices, representing representing spatial spatial

sections (collections sections (collectionsof ofCTUs) of the CTUs) of the frame frame data data 802, 802, produced byone produced by oneorormore moreinstances instancesofofthe the

43452381_1 43452381_1

48

video encoder video encoder800, 800,each eachproducing producinga abitstream bitstreamportion portion816 816and andoperating operating inin a aco-ordinated co-ordinated 17 Jan 2024

mannerunder manner undercontrol controlofofthe theprocessor processor205. 205.The The bitstream bitstream portion816816 portion may may also also contain contain oneone

slice that corresponds to one region to be output as a collection of subpictures forming one slice that corresponds to one region to be output as a collection of subpictures forming one

picture, each picture, each being being independently encodableand independently encodable andindependently independently decodable decodable with with respect respect to to anyany

of the other slices or subpictures in the picture. of the other slices or subpictures in the picture.

[000168] Figs. 9A

[000168] Figs. 9A&&9B9B area aschematic are schematic block block diagrams diagrams showing showing a division a division of aofpicture a picture into into 43452381_1

2024200305

regions. Fig. 9A shows a picture 900 divided into regions 910, 912, and 916, suitable for regions. Fig. 9A shows a picture 900 divided into regions 910, 912, and 916, suitable for

packingdata packing data from fromthe theinter-channel inter-channel decorrelation decorrelation (PCA-based) (PCA-based)encoder encoder network network topology topology 700.700.

Feature maps Feature mapsofofeach eachtensor tensorare are packed packedbybythe thepacker packer538. 538.The Theexample example of of Figs. Figs. 9A 9A andand 9B 9B showsthree shows threeregions regions corresponding correspondingtotomean mean channel, channel, basisvectors, basis vectors,and andcoefficients. coefficients.

[000169]Referring

[000169] ReferringtotoFig. Fig. 9B, 9B, picture picture 900b correspondstotothe 900b corresponds the picture picture 900, 900, region region 910b 910b correspondsto corresponds to 910 910and and912b 912bcorresponds corresponds to to 912.TheThe 912. region region 910b 910b holds holds mean mean channel channel data,data,

such as such as mean channel920 mean channel 920forforeach eachtensor tensorofofthe thetensors tensors 115. 115. InInthe the example exampleofofFig. Fig.9B9Bthe the basis vectors include basis vector 922 amongst others, with the basis vectors packed into the basis vectors include basis vector 922 amongst others, with the basis vectors packed into the

area of area of the the region region 912b 912b in in aa non-overlapping manner.The non-overlapping manner. The basisvectors basis vectors922 922 form form an an CxC’ CxC'

array of integer-quantised values, where C is the number of channels in a tensor of the array of integer-quantised values, where C is the number of channels in a tensor of the

tensors 115 tensors and C' 115 and C’ is is the the number of basis number of basis vectors vectors resulting resulting from from the the decomposition performed decomposition performed

by the by the decomposition module decomposition module 718. 718. Where Where multiple multiple tensors tensors are are decomposed decomposed into of into sets setsbasis of basis vectors, such as when the tensors 115 comprise a plurality of tensors, the basis vectors 922 vectors, such as when the tensors 115 comprise a plurality of tensors, the basis vectors 922

include each set of basis vectors arranged in a non-overlapping left-aligned vertical include each set of basis vectors arranged in a non-overlapping left-aligned vertical

concatenation. The concatenation. Themean mean channel channel 920920 corresponds corresponds to the to the channel channel 712 712 and and the basis the basis vectors vectors 922 922

correspondto correspond to some someofofthe thebasis basis vectors vectors 720, 720, for for example. example.

[000170] Area

[000170] Area in the in the picture picture 900900b) 900 (or (or 900b) that isthat not is nottoused used storetoany store dataany may data may be occupied be occupied

by sample values corresponding to the value ‘0’ after application of inverse quantisation to by sample values corresponding to the value '0' after application of inverse quantisation to

convert from convert fromsample samplevalues valuesback backtotothe thefloating-point floating-point domain. domain.Similarly, Similarly,area areaininregions regions910 910 (or 910b), (or 910b), 912 (or 912b), 912 (or 912b), 916 that isisnot 916 that notused usedtotostore any store data, any may data, maybebeoccupied occupiedby by sample sample

values corresponding to the value ‘0’ after application of inverse quantisation to convert from values corresponding to the value '0' after application of inverse quantisation to convert from

samplevalues sample valuesback backtotothe the floating-point floating-point domain. Theregion domain. The region916 916holds holdscoefficient coefficienttensor tensor data data corresponding to 724 of Fig. 7, with coefficient for each basis vector forming a width by height corresponding to 724 of Fig. 7, with coefficient for each basis vector forming a width by height

feature map. The coefficient region 916 has coefficients packed as, e.g., twenty-four feature feature map. The coefficient region 916 has coefficients packed as, e.g., twenty-four feature

maps (applicable to basis vectors [0..23]), such as feature map 930. maps (applicable to basis vectors [0..23]), such as feature map 930.

43452381_1 43452381_1

49

[000171] Eachregion

[000171] Each region(910/910b, (910/910b, 912/912b, 912/912b, andand 916/916b) 916/916b) is aligned is aligned to the to the CTUCTU grid, grid, 17 Jan 2024

normally128x128. normally 128×128.TheThe location location of of each each region region is is specifiedasasthe specified thetop-left top-left and bottom-right and bottom-right

cartesian CTU cartesian addresses,when CTU addresses, when addressed addressed using using cartesian cartesian co-ordinates co-ordinates in in unitsofofCTU units CTU size. size.

For example, For example,region region916 916occupies occupiesCTUs CTUs fromfrom CTU CTU at (0,1), at (0,1), corresponding corresponding to top-left to top-left lumaluma

sampleatat location sample location (0, (0, 128) 128) down to the down to the CTU CTU atat(6,3), (6,3), corresponding to bottom corresponding to bottomright right luma luma sample location sample locationatat (6×128 + 127, (6x128 3×128 + 127, 127)+ or 127) or (895, (895, 511). 511). 43452381_1

[000172]Figs. Figs. 10A 10A& & 10B areare schematic block diagrams showing a division of a of a picture intointo 2024200305

[000172] 10B schematic block diagrams showing a division picture

one region. In Fig.10A a picture 1000 is divided into one region 1010, suitable for packing the one region. In Fig. 10A a picture 1000 is divided into one region 1010, suitable for packing the

compressedtensor compressed tensor532 532asasmay maybe be produced produced by the by the tensor tensor compressor compressor 600.600. Fig.Fig. 10B 10B showsshows a a picture 1000b, picture correspondingtotothe 1000b, corresponding the picture picture 1000 in which 1000 in whichfeature feature maps mapsare arepacked packedinto intoa aregion region 1010b correspondingtoto1010. 1010b corresponding 1010.Each Each feature feature mapmap of the of the compressed compressed tensor, tensor, suchsuch as feature as feature

map 1030, is packed in the region 1010b, such that feature maps are packed in a left-to-right map 1030, is packed in the region 1010b, such that feature maps are packed in a left-to-right

manner,progressing manner, progressingtotothe the next next row rowonce onceavailable availablespace spaceononthe thecurrent current row rowhas hasbeen been exhausted. ByByvirtue exhausted. virtueofofthe the tensor tensor compressor compressor600, 600,the therequired requiredarea areato to pack pack the the compressed compressed tensor 532 is smaller than the required area that would be required were each tensor of the tensor 532 is smaller than the required area that would be required were each tensor of the

tensors 115 tensors packedinto 115 packed into the the picture picture 1000. Theregion 1000. The region1010 1010isisdefined definedbased basedonontop-left top-left and and bottom-right cartesian bottom-right cartesian CTU addresses. CTU addresses.

[000173]Fig.

[000173] Fig. 11A 11Aisisaa schematic schematicblock blockdiagram diagram showing showing an example an example structure structure 11001100 of of the the bitstreams 121 bitstreams 121 and and151 151holding holdingencoded encoded packed packed feature feature maps maps and and associated associated metadata metadata and and compressedvideo compressed videoframes. frames.TheThe bitstreams bitstreams 121121 and and 151 151 contains contains groups groups of syntax of syntax elements elements each each

prefaced by prefaced by aa 'network ‘networkabstraction abstraction layer' layer’ (NAL) unitheader. (NAL) unit header.For Forexample, example, a NAL a NAL unitunit

header precedes header precedesaa sequence sequenceparameter parametersetset(SPS) (SPS)1110a 1110a

[000174] AnSPS

[000174] An SPS 1110b 1110b includes includes signalling signalling of of thethe bit-depth,chroma bit-depth, chroma format, format, andand resolution resolution of of

the feature picture (e.g., the picture 900 or the picture 1000) for the feature layer 121, the feature picture (e.g., the picture 900 or the picture 1000) for the feature layer 121,

dependingononthe depending theselected selectedcompressor compressorandand decompressor decompressor at the at the step step 1810. 1810.

[000175] In applications

[000175] In applications involving involving streaming streamingretransmissions retransmissionsofofthe the FCVCM FCVCMSEI SEI 1113,1113, the the

SPS1110a SPS 1110aand and1110b 1110b maymay be performed be performed to enable to enable decoding decoding to commence to commence at an intermediate at an intermediate

point in the bitstream. The SPS 1110a also indicates the chroma format, the bit depth, the point in the bitstream. The SPS 1110a also indicates the chroma format, the bit depth, the

resolution of resolution of the the frame frame data data represented represented by by the the bitstream bitstream 121. 121. A codedsubpicture A coded subpicture1122 1122 encodingsubpicture encoding subpicture912, 912,includes includesaa slice slice header 1130followed header 1130 followedbybyslice slice data data 1140. 1140. The Theslice slice data 1140 data includes aa sequence 1140 includes sequenceofofCTUs, CTUs,providing providing thethe coded coded representation representation of of theframe the frame data. data.

Codedsubpicture Coded subpicture1124 1124encodes encodes subpicture subpicture 916, 916, corresponding corresponding to the to the coefficients.TheThe coefficients.

43452381_1 43452381_1

50

bitstream 123 bitstream 123 includes includes the the FCVCM FCVCM decoder decoder network network indication indication SEI message SEI message 1113, 1113, written written by by 17 Jan 2024

the metadata the encoder544. metadata encoder 544.The The SEI SEI message message 11131113 encodes encodes metadata metadata neededneeded to convert to convert a a decoded frame into a set of tensors suitable for use by the NN part 2 166. decoded frame into a set of tensors suitable for use by the NN part 2 166.

[000176] Fig. 11B

[000176] Fig. 11Bisis aa schematic schematicblock blockdiagram diagramshowing showing a hierarchical a hierarchical arrangement arrangement of ‘boxes’ of 'boxes'

resulting in the presentation 11200, which encapsulates a video stream and a feature stream, resulting in the presentation 11200, which encapsulates a video stream and a feature stream,

that is, bitstreams 151 and 121, respectively to provide the box-encapsulated bitstream 155. that is, bitstreams 151 and 121, respectively to provide the box-encapsulated bitstream 155. 43452381_1

Thepresentation presentation 11200 11200extends extendsthe theISOBMFF ISOBMFF specification (ISO/IEC 14496-12) of the of the year 2024200305

The specification (ISO/IEC 14496-12) year

2022and 2022 andthe thecarriage carriage of of structured structured NAL unitdata NAL unit datain in ISOBMFF ISOBMFF specification specification (ISO/IEC (ISO/IEC 14496- 14496-

15) of the 15) of theyear year2022. 2022.

[000177] Oneinstance

[000177] One instanceofofa aMovieBox MovieBox 11210 11210 (type (type ‘moov’) 'moov') or a or a CompressedMovieBox CompressedMovieBox (type (type ‘moov’) 'moov') isis present present in in thethe presentation presentation 11200, 11200, identified identified via ‘moov’. via 'moov'. Oneofinstance One instance the of the MovieHeaderBox MovieHeaderBox 11214 11214 (type(type ‘mvhd’) 'mvhd') is present is present in the in the MovieBox MovieBox 11210.11210.

[000178] Oneorormore

[000178] One more instancesofofa aTrackBox instances TrackBox (type (type ‘trak’),such 'trak'), suchasasa aTrackBox TrackBox 11218, 11218, are are

present in present in the the MovieBox 11210. MovieBox 11210. TheThe TrackBox TrackBox 1121811218 holds holds information information relating relating to thetovideo the video bitstream 151. bitstream 151. AATrackBox TrackBox 11238 11238 holds holds information information relating relating to the to the feature feature bitstream bitstream 121. 121.

Other Trackboxes Other Trackboxesmay may also also be be present,such present, suchasasfor foraudio audiotracks. tracks. Each EachTrackBox TrackBox contains contains one one

TrackHeaderBox TrackHeaderBox (type (type ‘tkhd’) 'tkhd') , such such as aasTrackHeaderBox a TrackHeaderBox 11222.11222. For the For thetrack, video video track, presentation to presentation to the the consumer of the consumer of the video is intended video is intended or or expected expected and so aa ‘track_in_movie’ and SO 'track_in_movie'

flag in flag in the theTrackHeaderBox 11222 TrackHeaderBox 11222 is is settotoone. set one.

[000179] Forthe

[000179] For thefeature feature stream stream 121, 121, the the 'track_in_movie' ‘track_in_movie’flag flagin in aa corresponding corresponding

TrackHeaderBox, contained within the Track 11238, is set to zero as the feature track is not TrackHeaderBox, contained within the Track 11238, is set to zero as the feature track is not

directly expected directly expected to to be be presented presented to to aaconsumer of the consumer of the video. EachTrackBox video. Each TrackBox includes includes oneone

SampleTableBox, SampleTableBox, such such as as SampleTableBox SampleTableBox 11226,11226, containing containing time time and andindexing data data indexing for for samplesin samples in the the track. track. Each TrackBox Each TrackBox may may also also include include oneone SyncSamplesBox, SyncSamplesBox, such such as as SyncSamplesBox SyncSamplesBox 11228, 11228, which which provides provides a list a list of sample of sample indices indices which which are CRA are CRA (clean(clean randomrandom

access) or access) or IDR (instantaneousdecoder IDR (instantaneous decoderrefresh) refresh) points points into into the the video video sequence. In other sequence. In other words, words, each SyncSamplesBox each SyncSamplesBox contains contains a list a list ofof entrypoints entry pointsatat which whichdecoding decoding can can commence commence in in the the correspondingtrack. corresponding track.

[000180] EachSample

[000180] Each SampleTableBox includes TableBox includes one one SampleDescriptionBox SampleDescriptionBox (type ‘stsd’), (type 'stsd'), such such as theas the

SampleDescriptionBox 11230, SampleDescriptionBox 11230, which which defines defines the the coding coding type type usedused and where and where to locate to locate the the

correspondingcoded corresponding codedvideo videosequence sequence in in one one or or more more contained contained boxes boxes of aofclass a class derived derived from from an an abstract SampleEntry abstract class. AAVvcSampleEntry SampleEntry class. VvcSampleEntry (type (type ‘vvc1’) 'vvcl') sample sample box box is a is a child child of aof a 43452381_1 43452381_1

51

‘VisualSampleEntry’ box 'VisualSampleEntry' box type,which type, which is is a a childofofaaSampleEntry child SampleEntryboxbox type. type. Each Each 17 Jan 2024

VvcSampleEntry VvcSampleEntry instance instance includes includes a VvcConfigurationBox a VvcConfigurationBox (identified (identified via ‘vvcC’). via 'vvcC'). The The ‘compressor_name’ fieldresulting 'compressor_name' field resultingfrom fromVisualSampleEntry VisualSampleEntry is set is set as as ‘VVC 'VVC Coding’ Coding' (with(with

length of 10 bytes). This hierarchy permits samples to be categorised firstly as video, audio, length of 10 bytes). This hierarchy permits samples to be categorised firstly as video, audio,

subtitle, other, categories, and within each category for the specific codec used to be identified, subtitle, other, categories, and within each category for the specific codec used to be identified,

such as such as AVC (AdvancedVideo AVC (Advanced VideoCoding), Coding), HEVC HEVC(High (HighEfficiency Efficiency Video Video Coding), Coding), or orVVC VVC

(Versatile Video (Versatile Coding).The Video Coding). The'SampleEntry' ‘SampleEntry’ class class is is ‘abstract’,meaning 'abstract', meaninginstances instancesofof 43452381_1

2024200305

SampleEntryboxbox SampleEntry arenotnotpermitted, are permitted,however however child child classesofofSampleEntry classes SampleEntry may may be instantiated be instantiated

and thus and thus may maybebepresent presentinin the the presentation presentation 11200. Boxes 11200. Boxes 11242a 11242a andand 11242b 11242b are child are child types types

of aa ‘FeatureStreamSampleEntry’ of class,which 'FeatureStreamSampleEntry' class, whichis is a achild childofofthe the SampleEntry SampleEntry class.When class. When the the

features are features are to tobe becompressed using VVC, compressed using VVC,a a'VvcFeatureSampleEntry' ‘VvcFeatureSampleEntry’(type(type ‘vvcf’) 'vvcf') box box type type is is used for used for the the boxes boxes 11242a and11242b. 11242a and 11242b. Other Other child child types types of of FeatureStreamSampleEntry FeatureStreamSampleEntry are are used for used for other other feature feature compression schemes.ForForexample, compression schemes. example, if if a a learnedend-to-end learned end-to-end compression compression

schemeisis used scheme usedfor for feature feature compression, anEndToEndFeatureSampleEntry compression, an EndToEndFeatureSampleEntry(type (type ‘e2ef’) 'e2ef') box box type may type maybebeused. used.

[000181]

[000181] AATrackGroupBox, TrackGroupBox,suchsuch as TrackGroupBox as TrackGroupBox 11224 11224 or 11239or(type 11239 (type ‘trgr’), 'trgr'), with onewith one instance in each track, is used to associate multiple tracks with each other with each instance instance in each track, is used to associate multiple tracks with each other with each instance

containing an containing an instance instance of of aa child child class classofofTrackGroupTypeBox, indicatinga atype TrackGroupTypeBox, indicating typeofofthe thetrack track group via group via 'track_group_type'. ‘track_group_type’.For Fora agrouped grouped video video trackandand track featuretrack feature trackthe the track_group_type track_group_type may may be setbe toset to ‘vfgr’ 'vfgr' with with the thetrack video videoandtrack and thetrack the feature feature track having the having the

samevalue same valuefor for track_group_id. track_group_id.A A separatetrack_group_id separate track_group_id value value enables enables multiple multiple groups groups and and

two tracks two tracks are are only only associated associated if iftheir theirrespective VideoFeatureGroupBox respective instanceshave VideoFeatureGroupBox instances havethe the samevalue same valuefor for track_group_id. track_group_id.

[000182]

[000182] AAMediaDataBox MediaDataBox 11260 11260 (type(type ‘mdat’) 'mdat') contained contained in theinpresentation the presentation 1120011200 contains contains

the actual the actual compressed NAL compressed NAL unit unit dataforforVCL-layer data VCL-layerandand non-VCL-layer non-VCL-layer NALofunits NAL units the of the video video track and track and the the feature feature track. track. NAL units contained NAL units containedin in the the MediaDataBox MediaDataBox areare addressed addressed by by thethe

boxes within boxes withinthe the SampleDescriptionBox SampleDescriptionBox of each of each track. track.

[000183]Fig.

[000183] Fig. 19 19shows showsa amethod method 1900 1900 forfor decoding decoding the the bitstream bitstream 123, 123, reconstructing reconstructing tensors tensors

using aa tensor using tensor decompressor, andperforming decompressor, and performingananNNNN part part 2 to 2 to complete complete execution execution of aofneural a neural network. AsAsdescribed, network. described,the themethod method 1900 1900 is is configured configured forfor determining determining a set a set ofof compatible compatible NN NN

part 2 candidates from a library of available NN part 2 implementations in the destination part 2 candidates from a library of available NN part 2 implementations in the destination

device 140 device 140 and andselecting selecting one one of of the the compatible NN compatible NN part2 2candidates part candidatesfor forexecution executionininthe the destination device 140. The bitstream 123 includes a video track (extractable as bitstream 171) destination device 140. The bitstream 123 includes a video track (extractable as bitstream 171)

43452381_1 43452381_1

52

and aa feature and feature track track (extractable (extractableasasbitstream bitstream145), 145),each eachofof which whichare aredecoded decoded to toproduce produce video video 17 Jan 2024

frames 172 frames 172and anddecoded decoded tensors149 tensors 149 respectively. respectively.

[000184] Thedestination

[000184] The destinationdevice device140 140and andthe themethod method 1900 1900 may may be implemented be implemented as oneasorone or more more

software application software application programs 233executable programs 233 executablewithin withinthethecomputer computer system system 200. 200. The The destination destination

device 140 device 140 and andthe the method method1900 1900 maymay be effected be effected by by instructions instructions 231231 (see (see Fig. Fig. 2B)2B) in in the the

software 233 software 233that that are are carried carried out out within within the thecomputer computer system 200. The system 200. Thesoftware software 43452381_1

2024200305

instructions 231 instructions 231 may beformed may be formedasasone oneorormore more code code modules, modules, each each for for performing performing one one or more or more

particular tasks. particular tasks. The The bitstream bitstream 123 123 can comprisetensors can comprise tensors (feature (feature stream 121) and stream 121) andvideo video(video (video stream 151) stream 151) among among otherinformation other information fordecoding, for decoding, thetensors the tensorsbeing beingrelated relatedtotothe the video, video, as as described hereinbefore. described hereinbefore. The Themethod method1900 1900 begins begins at at a a decode decode fileheader file headerstep step1910. 1910.

[000185]Fig.

[000185] Fig. 12 12is is aa schematic block diagram schematic block diagramshowing showingan an implementation implementation 12001200 of the of the tensor tensor

decoder146. decoder 146. The Thedecoder decoder1200 1200 hashas a metadata a metadata decoder decoder 1230, 1230, a picture a picture decoder decoder 1204, 1204, an an unpacker1214, unpacker 1214,ananinverse inversequantiser quantiser1218, 1218,aatensor tensor storage storage module module1222 1222 and and a tensor a tensor

decompressor1250. decompressor 1250.InInthe thearrangements arrangements described, described, themetadata the metadata decoder decoder 1230 1230 provides provides an an implementationofofthe implementation thebox boxextractor extractor144. 144.InInother otherimplementations, implementations,the thebox boxextractor extractor144 144may may be external be external to to the thetensor tensordecoder decoder 146. 146. At At the the step step1910 1910 the themetadata metadata decoder 1230decodes decoder 1230 decodesthe the MovieHeaderBox MovieHeaderBox 11214 11214 from from the presentation the presentation 11200, 11200, indicating indicating various various creation creation and and modification time information for the presentation and a track_ID for the main (video) track for modification time information for the presentation and a track_ID for the main (video) track for

presentation to presentation to an an end end user. user. Control Control in in the the processor processor 205 205 progresses progresses from the step from the step 1910 to aa 1910 to

decodevideo decode videotrack track codec codecindication indicationstep step 1920. 1920.

[000186] Atthe

[000186] At thestep step 1920 1920the themetadata metadatadecoder decoder1230 1230 decodes decodes thethe TrackBox TrackBox 1121811218 and all and all

child boxes as described with reference to Fig. 11B such that the video codec used is indicated child boxes as described with reference to Fig. 11B such that the video codec used is indicated

by virtue by virtue of of the the type typeof ofthe thechild childboxes boxeswithin withina asample sampletable tableSampleTableBox 11230, SampleTableBox 11230, i.e., i.e.,

boxes 11234a boxes 11234aand and11234b 11234b which which indicate indicate use use ofVVC of a a VVC codeccodec in example in the the example of Fig. of Fig. 11B. 11B. In In other implementations, other other types implementations, other types of of codes codes may maybebedecoded, decoded,forforexample, example, HEVC HEVC and and AVC AVC codecs. By codecs. Bydecoding decodingthe theTrackBox TrackBox 11218, 11218, stepstep 1920 1920 operates operates to decode to decode information information from from the the bitstream 123. bitstream 123. AA type type of of codec codec for for the the video video frames can be frames can be determined determinedbased basedononthethedecoded decoded information. The information. TheTrackBox TrackBox 11218 11218 can can be considered be considered first first information information andand maymay beinteger be in in integer format or format or another format. Control another format. Controlinin the the processor processor 205 205progresses progressesfrom fromthe thestep step1920 1920totoaa decodevideo decode videotrack track grouping groupingindication indicationstep step 1930. 1930.

43452381_1 43452381_1

53

[000187] Atthe

[000187] At thestep step 1930 1930the themetadata metadatadecoder decoder1230 1230 decodes decodes thethe 17 Jan 2024

VideoFeatureGroupBox 11225, VideoFeatureGroupBox 11225, whichwhich is anisinstance an instance of a of a VideoFeatureGroupBox VideoFeatureGroupBox

(track_group_typeset (track_group_type setto to 'vftg'). ‘vftg’). The VideoFeatureGroupBox The VideoFeatureGroupBox 11225extends 11225extends

TrackGroupTypeBox TrackGroupTypeBox and indicates and indicates thatthat the the containing containing track track is is oneone of of two two tracks tracks associated associated

with each other as a video track and a feature track. A flag, such as a ‘FeatureStreamFlag’ in with each other as a video track and a feature track. A flag, such as a 'FeatureStreamFlag' in

the VideoFeatureGroupBox the VideoFeatureGroupBox can can indicate indicate thatthat thethe trackcontaining track containing thisinstance this instanceofofthe the VideoFeatureGroupBox VideoFeatureGroupBox is aisvideo a video track track (flag (flag value value ofof zero).Decoding zero). Decoding thethe video video track track 43452381_1

2024200305

groupingindication grouping indication by by decoding decodingthe theVideoFeatureGroupBox VideoFeatureGroupBox1122511225 identifies identifies the portions the portions of of the bitstream the bitstream 143 whichform 143 which formthe thevideo videobitstream bitstream171 171(output (outputfrom fromthethebox boxextractor extractor144). 144). Control in Control in the the processor processor 205 progresses from 205 progresses fromthe the step step 1930 1930toto aa decode decodevideo videoframe framestep step1940. 1940.

[000188] Step1920

[000188] Step 1920can canbebeconsidered considered toto definea afirst define first codec used to codec used to decode decodethe the video videofrom fromthe the bitstream 123. bitstream 123. At the step At the step 1940, 1940, the the video video decoder 170, under decoder 170, under execution executionof of the the processor processor 205, 205, decodesaa frame decodes framefrom fromthe thevideo videobitstream bitstream171 171according according toto thecodec the codecdetermined determined based based on on step step

1920. Thevideo 1920. The videodecoder decoder170 170uses usesthe thecodec codectype typeororindication indicationasasdetermined determinedatatthe thestep step 1920 1920toto decodeaa frame decode framefrom fromthe thebitstream bitstream171 171and andproduce produce thethe decoded decoded frame frame 172.172. The video The video

decoder170 decoder 170may mayimplement implement a VVC a VVC decoder decoder as described as described with reference with reference to Fig. to Fig. 13 or13 anorHEVC an HEVC decoder. Control decoder. Control in in the the processor processor 205 progresses from 205 progresses fromthe thestep step 1940 1940toto aa decode decodefeature feature track track codec indication codec indication step step 1950. 1950.

[000189] Atthe

[000189] At thestep step 1950 1950the the metadata metadatadecoder decoder1230 1230 decodes decodes thethe track track TrackBox TrackBox 11238 11238 and and

all child boxes as described with reference to Fig. 11B such that the feature codec used for the all child boxes as described with reference to Fig. 11B such that the feature codec used for the

SampleTableBox 11240. SampleTableBox 11240. Boxes Boxes 11242a 11242a and 11242b and 11242b indicateindicate a codec, a codec, such assuch VVC as or VVC a or a specific end-to-end specific end-to-end learned learned codec, codec, used for feature used for feature compression. Bydecoding compression. By decoding the the

TrackBox11238, TrackBox 11238, step1920 step 1920 operates operates to to decode decode information information from from the the bitstream bitstream 123.123. A type A type of of codec for decoding the tensors (i.e., feature data) can be determined based on the decoded codec for decoding the tensors (i.e., feature data) can be determined based on the decoded

information. The information. TheTrackBox TrackBox 11238 11238 information information can can be considered be considered second second information information and and may may be in be in integer integer format format or or another another format. format. The The codec determinedfrom codec determined fromoperation operationofofstep step1950 1950isis capable of capable of decoding tensors/features generated decoding tensors/features generated aa neural neural network, network, for for example examplethe thefeatures features of of the the feature feature stream stream 121 generated by 121 generated bythe the neural neural network networkpart part 11 114. 114. The Theinformation information decoded decoded at at

step 1950 is included independently in the bitstream 123 to the information decoded at the step step 1950 is included independently in the bitstream 123 to the information decoded at the step

1920 andboth 1920 and bothsets sets of of information can be information can be independent independentofofeach eachother. other. In In other other words, the codecs words, the codecs

used for the video and feature streams can be different. The information at each of steps 1920 used for the video and feature streams can be different. The information at each of steps 1920

and 1950 and 1950isis decoded decodedfrom fromfirst first and andboxes boxesrespectively, respectively, the the first first and andsecond second boxes boxes being being

43452381_1 43452381_1

54

different totoone different one another. another.The The metadata metadata decoder 1230also decoder 1230 alsodecodes decodesthe the 17 Jan 2024

TrackReferenceBox TrackReferenceBox 11223a 11223a and and a TrackReferenceTypeBox a TrackReferenceTypeBox 11223b contained 11223b contained in the in the box 11223a. box 11223a.The The TrackReferenceTypeBox TrackReferenceTypeBox 11223a 11223a has a reference_type has a reference_type indicative indicative of the of the auxiliary nature of the feature tracl, such as ‘auxl’ and may further indicate the track as a auxiliary nature of the feature tracl, such as 'auxl' and may further indicate the track as a

feature track,such feature track, suchasasusing using a reference_type a reference_type ‘axft’, 'axft', although although a different a different four character four character code code could be could be used used for for the the same purpose. The same purpose. Thetrack_id track_idofofthe theTrackBox TrackBox 11218 11218 is also is also decoded decoded fromfrom

the box the 11223b.Control box 11223b. Controlininthe theprocessor processor205 205progresses progressesfrom from thethe step1950 step 1950 to to a a decode decode feature feature 43452381_1

2024200305

track group indication step 1960. track group indication step 1960.

[000190] Atthe

[000190] At thestep step 1960 1960the the metadata metadatadecoder decoder1230 1230 decodes decodes thethe

VideoFeatureGroupBox VideoFeatureGroupBox 11241, 11241, whichwhich is an isinstance an instance of a of a VideoFeatureGroupBox VideoFeatureGroupBox

(track_group_typeset (track_group_type setto to 'vftg'). ‘vftg’). The The VideoFeatureGroupBox 11241extends VideoFeatureGroupBox 11241extends

TrackGroupTypeBox TrackGroupTypeBox and indicates and indicates thatthat the the containing containing track track is is oneone of of twotwo tracks tracks associated associated

the VideoFeatureGroupBox the indicates VideoFeatureGroupBox indicates that that thistrack this trackcontaining containingthat thatinstance instanceof of the the VideoFeatureGroupBox VideoFeatureGroupBox is aisfeature a feature track track (flagvalue (flag valueofofone). one).Decoding Decodingthethe feature feature track track

groupingindication grouping indication by by decoding decodingthe theVideoFeatureGroupBox VideoFeatureGroupBox 11241identifies 11241identifies the portions the portions of of the the bitstream 143 bitstream whichform 143 which formthe thefeature featurebitstream bitstream145 145(output (outputfrom fromthe thebox boxextractor extractor144). 144).While While the information the decodedatatstep information decoded step 1920 1920isis independent independentofofthe the information informationdecoded decodedatatstep step1950, 1950, the video data and the feature data in the bitstream are associated with one another. The video the video data and the feature data in the bitstream are associated with one another. The video

track grouping track informationdecoded grouping information decodedatatstep step1930 1930and andthe thefeature featuretrack track grouping groupinginformation information decoded at step 1960 allow the video and feature (tensor) data to be associated with one decoded at step 1960 allow the video and feature (tensor) data to be associated with one

another. another.

[000191] Steps1930

[000191] Steps 1930and and1960 1960 operate operate to to decode decode information information signalling signalling thethe association association

betweenthe between thevideo videoand andthe thetensors tensors from fromthe thebitstream. bitstream. The Theinformation informationsignalling signallingthe the association between the video and the tensors is decoded in the example described as (i) association between the video and the tensors is decoded in the example described as (i)

information from a first box in the bitstream for video using the first codec at step 1930 and (ii) information from a first box in the bitstream for video using the first codec at step 1930 and (ii)

information from a second box in the bitstream for the tensors using the second codec at step information from a second box in the bitstream for the tensors using the second codec at step

1960. 1960.

[000192] Controlininthe

[000192] Control the processor processor 205 205progresses progressesfrom fromthe thestep step1960 1960totoa adecode decodefeature featureframe frame step 1970. step 1970.

43452381_1 43452381_1

55

[000193] Step1950

[000193] Step 1950can canbebeconsidered considered toto definethe define thesecond secondcodec codec used used to to decode decode thethe tensors tensors 17 Jan 2024

from the from the bitstream. bitstream. At At the the step step 1970, the picture 1970, the picture decoder decoder 1204, under execution 1204, under executionofof the the processor 205, processor 205, according accordingtoto the the codec codec determined determinedbased basedononstep step1950, 1950,operates operatestotodecode decode one one

packedframe packed framefrom fromthe thefeature featuresub-bitstream sub-bitstream145 145totoproduce producea adecoded decoded frame frame 1210. 1210. Operation Operation

of the picture decoder 1204 is described with reference to Fig. 13. Control in the processor 205 of the picture decoder 1204 is described with reference to Fig. 13. Control in the processor 205

progresses from progresses fromthe the step step 1970 1970to to an an unpack unpacktensors tensorsstep step 1980. 1980. 43452381_1

2024200305

[000194] Atthe

[000194] At thestep step 1980, 1980, the the unpacker unpacker1214 1214ofofthe thedecoder decoder146, 146,under underexecution execution of of the the

processor 205, processor 205, reads reads feature feature maps fromthe maps from thedecoded decodedframe frame 1210 1210 in in accordance accordance withwith a packing a packing

format as format as described with reference described with reference to to Figs. Figs. 9A 9A & 9Band & 9B andFigs. Figs.10A 10A& & 10B. 10B. For For eacheach tensor, tensor, a a numberofoffeature number featuremaps mapsare aredecoded, decoded,thethenumber number corresponding corresponding to the to the number number of used of used channels channels

in the tensor as signalled in the tensor information 1195. The channels for each tensor are in the tensor as signalled in the tensor information 1195. The channels for each tensor are

unpackedasastwo-dimensional unpacked two-dimensional feature feature maps. maps. The The number number of feature of feature maps maps or channels or channels to decode to decode

for aa given for given tensor tensor is isdecoded decoded from from the the feature feature sub-bitstream sub-bitstream 145 145 as as aa ‘channel 'channel count’. count'. A A

plurality of tensors each having one channel are arranged horizontally or vertically within a plurality of tensors each having one channel are arranged horizontally or vertically within a

slice or slice orsubpicture subpictureof ofthe theframe. frame. The The feature feature maps of each maps of each previously determinedtype previously determined typeare are unpackedatatstep unpacked step 1980 1980bybythe theunpacker unpacker1214 1214 using using thethe allocationofofthe allocation thefeature feature maps maps determinedfor determined for the the frame frame(e.g., (e.g., 1210) 1210) in in the themethod 1800. The method 1800. Theallocation allocationmay maybebe decoded decoded from from

an SEI an SEI message. message.The The unpacker unpacker 1214 1214 outputs outputs integer integer tensors tensors 1216, 1216, where where the the tensors tensors 12161216 havehave

been decoded been decodedusing usingthe thetensor tensordecoder decoder146. 146.Control Control in in theprocessor the processor205 205 progresses progresses from from thethe

step 1980 to an inverse quantise tensors step 1990. step 1980 to an inverse quantise tensors step 1990.

[000195] Atthe

[000195] At thestep step 1990, 1990, the the inverse inverse quantiser quantiser 1218 of the 1218 of the decoder 1200,under decoder 1200, underexecution executionofof the processor the processor 205, 205, performs inverse quantisation performs inverse quantisation on on the the integer integer tensors tensors 1216 to produce 1216 to inverse produce inverse

quantised tensors quantised tensors 1220 byapplying 1220 by applyingquantisation quantisationranges ranges1232 1232decoded decoded from from the the bitstream bitstream 123123

by the by the metadata decoder1230, metadata decoder 1230,such suchasasquantisation quantisationranges ranges1196, 1196,totothe thedetermined determinedchannel channel count of count of each tensor. To each tensor. To perform performinverse inversequantisation, quantisation, the the quantisation quantisation ranges ranges 1196 1196are are decodedfrom decoded fromthe theSEI SEImessage message 1113 1113 by the by the receiver receiver 142. 142. Control Control in the in the processor processor 205 205

progresses from progresses fromthe the step step 1990 1990to to aa perform tensor decompression perform tensor decompression step19100. step 19100.

[000196]AtAtthe

[000196] thestep step 19100, 19100,the the tensor tensor decompressor decompressor1250, 1250, under under execution execution of of thethe

processor 205, processor 205, decompresses decompressesthethetensors tensors1224 1224asasinput inputtotoproduce producedecoded decoded tensors tensors 149. 149. An An example structure of a tensor decompressor is described in relation to Fig. 14. Control in the example structure of a tensor decompressor is described in relation to Fig. 14. Control in the

processor 205 processor 205progresses progressesfrom fromthe thestep step19100 19100totoaaperform performneural neuralnetwork network second second portion portion

step 19110 to generate a task result using a portion of a neural network. step 19110 to generate a task result using a portion of a neural network.

43452381_1 43452381_1

56

[000197] Atthe

[000197] At thestep step 19110, 19110,the the software software233, 233,under underexecution executionofofthe theprocessor processor205, 205,performs performs 17 Jan 2024

the NN the part 22 implemented NN part implemented as as NNNN part part 2 166. 2 166. TheThe module module 166 uses 166 uses the tensors the tensors 149 149 as input as input and and produces the task result 151 as output. Control in the processor 205 progresses from the produces the task result 151 as output. Control in the processor 205 progresses from the

step 19110 to a render task result step 19120. step 19110 to a render task result step 19120.

[000198] Atthe

[000198] At thestep step 19120 19120the thetask task result result renderer renderer 168, 168, under under execution of the execution of the processor processor 205 205

produces a video frame rendering the task result in association with the video layer. For produces a video frame rendering the task result in association with the video layer. For 43452381_1

2024200305

example,the example, the step step 168 168 can canpresent present both both the the video video frame frame172 172and anda arepresentation representationofofthe the task task result 151. result 151. The task result The task result 151 151 may be represented may be representedas as bounding boundingboxes boxes around around objects objects of of

interest, segmentation map with colour tinting of objects of interest or the like. Alternatively, interest, segmentation map with colour tinting of objects of interest or the like. Alternatively,

the video the video frame 172can frame 172 canbeberendered renderedindependently independentlyofof thecontents the contentsofofthe thetensors, tensors, for for example example

irrespective of whether a task result is present or without visual identifiers of the task result irrespective of whether a task result is present or without visual identifiers of the task result

such as such as bounding boxes,colour bounding boxes, colourmapping mappingor or thethe like.The like. The video video frame frame resulting resulting from from thethe

step 19120 can be presented in the user of the destination device 140 as final output of the step 19120 can be presented in the user of the destination device 140 as final output of the

system100, system 100,for for example examplevia viathe thedisplay display 214. 214. The Themethod method 1900 1900 terminates terminates and and the the processor 205 processor 205progresses progressestoto the the next next frame. frame.

[000199]The

[000199] Thesteps steps19100 19100toto19120 19120 operate operate to to perform perform a network a network second second portion portion (NN (NN part part 2) to2) to produceaa neural produce neural network networkresult result in in aa case case where steps 1970 where steps and1980 1970 and 1980determine determineto to decode decode

tensors from tensors the bitstream. from the bitstream. The method1900 The method 1900terminates terminatesandand theprocessor the processor 205 205 maymay reinvoke reinvoke

the method the 1900upon method 1900 upon receiving receiving thenext the nextframe frame in in thebitstream the bitstream123. 123.

[000200] Thebox

[000200] The boxextractor extractor144 144may may parse parse thethe bitstream bitstream 123 123 without without knowledge knowledge of specific of the the specific box structure box structure contained therein. However contained therein. However a abitstream bitstreamconforming conformingto to embodiments embodiments described described

herein contains herein contains box types can box types can be be anticipated anticipated by by the the method 1900.A A method 1900. generic generic parserthat parser thatcan can parse the parse the box structure may box structure be used, may be used, and and once onceparsed parsedthe theanticipated anticipated elements, elements, namely namelythe the video track codec indication, the feature track codec indication, and associated grouping video track codec indication, the feature track codec indication, and associated grouping

information, is required to be present, enforcing the box structure as described with reference to information, is required to be present, enforcing the box structure as described with reference to

Fig. 11B. Bitstreams that do not contain at least the boxes identified in Fig. 11B would not be Fig. 11B. Bitstreams that do not contain at least the boxes identified in Fig. 11B would not be

required to be decoded by the destination device 140. If the feature stream is required but not required to be decoded by the destination device 140. If the feature stream is required but not

the video the video stream or vice stream or vice versa, versa, the thedecoded decoded structure structure would indicate which would indicate steps of which steps of the the method method

1900 are to 1900 are to be be implemented basedononwhether implemented based whether feature feature and/or and/or video video resultsare results arerequired. required.

[000201]Fig.

[000201] Fig. 13 13is is aa schematic block diagram schematic block diagramshowing showing functional functional modules modules of aofvideo a video decoder 1300. decoder 1300.The Thevideo videodecoder decoder provides provides an an example example implementation implementation for each for each of picture of the the picture

43452381_1 43452381_1

57

decoder1204 decoder 1204and andthe thevideo videodecoder decoder170. 170.TheThe video video decoder decoder 13001300 may may be be implemented implemented as one as one 17 Jan 2024

or more or softwareapplication more software application programs programs233 233 executable executable within within thethe computer computer system system 200.200. The The video decoder video decoder1204 1204may maybe be effected effected byby instructions231 instructions 231(see (seeFig. Fig.2B) 2B)ininthe thesoftware software233 233that that are carried are carried out out within within the thecomputer computer system 200. The system 200. Thesoftware softwareinstructions instructions231 231may maybe be formed formed

as one as one or or more codemodules, more code modules,each eachfor forperforming performing one one or or more more particular particular tasks.TheThe tasks. example example

video decoder video decoder1300 1300isisoperable operabletoto decode decodeboth boththe thevideo videobitstream bitstream171 171and andthe thefeature feature bitstream 145. bitstream 145. 43452381_1

2024200305

[000202]One

[000202] Oneofofthe thesub-bitstreams sub-bitstreams145 145oror171 171isisinput inputasas aa sub-bitstream sub-bitstream 1301 1301totoan anentropy entropy decodermodule decoder module1320. 1320. The The entropy entropy decoder decoder module module 1320 1320 extracts extracts syntax syntax elements elements from from the the sub-bitstream 1301 sub-bitstream 1301bybydecoding decodingsequences sequences of of ‘bins’andand 'bins' passes passes thevalues the valuesofofthe thesyntax syntax elementsto elements to other other modules inthe modules in the video video decoder decoder1300. 1300.TheThe entropy entropy decoder decoder module module 1320 1320 uses uses variable-length and variable-length fixed length and fixed length decoding to decode decoding to decodeSPS, SPS,PPS PPSoror sliceheader slice headerusing usinganan arithmetic decoding arithmetic enginetoto decode decoding engine decodesyntax syntaxelements elementsofofthe theslice slice data data as as aa sequence of one sequence of one or or morebins. more bins. Each Eachbin binmay mayuseuse oneone or or more more ‘contexts’, 'contexts', with with a context a context describing describing probability probability

levels to be used for coding a ‘one’ and a ‘zero’ value for the bin. Where multiple contexts are levels to be used for coding a 'one' and a 'zero' value for the bin. Where multiple contexts are

available for a given bin, a ‘context modelling’ or ‘context selection’ step is performed to available for a given bin, a 'context modelling' or 'context selection' step is performed to

chooseone choose oneofofthe the available available contexts contexts for for decoding the bin. decoding the bin. The processof The process of decoding decodingbins binsforms forms a sequential a sequential feedback loop, where feedback loop, eachslice where each slice may maybebedecoded decodedinin entiretyby entirety byaagiven givenentropy entropy decoder1320 decoder 1320instance. instance.

[000203] Theentropy

[000203] The entropydecoder decoder module module 13201320 applies applies an arithmetic an arithmetic coding coding algorithm, algorithm, for for

example'context example ‘contextadaptive adaptivebinary binaryarithmetic arithmeticcoding' coding’(CABAC), (CABAC), to decode to decode syntax syntax elements elements

from the from the input input bitstream bitstream 145 or 171 145 or (1301). The 171 (1301). Thedecoded decoded syntax syntax elements elements areare used used to to reconstruct parameters reconstruct within the parameters within the video video decoder decoder1300. 1300.Parameters Parameters include include residual residual

coefficients (represented coefficients (represented by by an an arrow arrow 1324), 1324), a a quantisation quantisation parameter 1374, aa secondary parameter 1374, secondary transform index transform index1370, 1370,and andmode mode selectioninformation selection information such such as as an an intraprediction intra prediction mode(represented mode (representedbybyananarrow arrow1358). 1358). The The mode mode selection selection information information alsoalso includes includes

information such information suchas as motion motionvectors, vectors,and andthe the partitioning partitioning of of each each CTU intoone CTU into oneorormore moreCBs. CBs. Parametersare Parameters are used usedto to generate generate PBs, PBs,typically typically in in combination withsample combination with sampledata datafrom frompreviously previously decoded CBs. decoded CBs.

[000204] Theresidual

[000204] The residualcoefficients coefficients 1324 1324are are passed passedtoto an an inverse inverse secondary secondarytransform transform module1336 module 1336where where either either a a secondary secondary transform transform is is applied applied oror nono operation operation isisperformed performed (bypass) according (bypass) accordingto to aa secondary transformindex. secondary transform index.The Theinverse inversesecondary secondary transform transform

43452381_1 43452381_1

58

module1336 module 1336produces produces reconstructed reconstructed transform transform coefficients coefficients 1332. 1332. That That is,is, themodule the module 1336 1336 17 Jan 2024

producesprimary produces primarytransform transformdomain domain coefficients coefficients from from secondary secondary transform transform domain domain coefficients. coefficients.

Thereconstructed The reconstructedtransform transformcoefficients coefficients 1332 1332are areinput input to to aa dequantiser dequantiser module 1328.TheThe module 1328.

dequantiser module dequantiser module1328 1328 performs performs inverse inverse quantisation quantisation (or(or ‘scaling’)ononthe 'scaling') theresidual residual coefficients 1332, that is, in the primary transform coefficient domain, to create reconstructed coefficients 1332, that is, in the primary transform coefficient domain, to create reconstructed

intermediate transform intermediate transform coefficients, coefficients, represented represented by by an an arrow 1340, according arrow 1340, accordingtoto the the quantisation parameter quantisation 1374.The parameter 1374. The dequantiser dequantiser module module 13281328 may may also also applyapply a scaling a scaling matrix matrix to to 43452381_1

2024200305

provide non-uniform provide non-uniformdequantization dequantization within within theTB, the TB, corresponding corresponding to to operation operation of of thethe

dequantiser module dequantiser module840. 840.Should Should useuse of of a non-uniform a non-uniform inverse inverse quantisation quantisation matrix matrix be indicated be indicated

in the in the sub-bitstream sub-bitstream 1301, 1301, the the video video decoder 1300reads decoder 1300 readsaa quantisation quantisation matrix matrix from fromthe thereceived received bitstream 145 or 171 (1301) as a sequence of scaling factors and arranges the scaling factors bitstream 145 or 171 (1301) as a sequence of scaling factors and arranges the scaling factors

into aa matrix. into matrix. The inverse scaling The inverse scaling uses uses the the quantisation quantisation matrix matrix in incombination with the combination with the quantisation parameter quantisation parameter to create to create the reconstructed the reconstructed intermediate intermediate transformtransform coefficients coefficients 1340. 1340.

[000205] Thereconstructed

[000205] The reconstructedtransform transformcoefficients coefficients1340 1340are arepassed passedtotoananinverse inverseprimary primary transform module transform module1344. 1344.TheThe module module 13441344 transforms transforms the coefficients the coefficients 13401340 fromfrom the frequency the frequency

domainback domain backtotothe thespatial spatial domain. Theinverse domain. The inverseprimary primarytransform transformmodule module 1344 1344 applies applies inverse inverse

DCT-2transforms DCT-2 transforms horizontallyandand horizontally vertically,constrained vertically, constrainedbybythe themaximum maximum available available transform transform

size as size as described described with with reference reference to tothe theforward forward primary primary transform module826. transform module 826.TheThe resultofof result

operation of operation of the the module 1344isis aa block module 1344 block of of residual residual samples, samples, represented by an represented by an arrow arrow1348. 1348.The The block of block of residual residual samples 1348isis equal samples 1348 equal in in size size to tothe thecorresponding corresponding CB. Theresidual CB. The residual samples1348 samples 1348are aresupplied suppliedtotoaa summation summation module module 1350. 1350.

[000206] Atthe

[000206] At thesummation summation module module 1350, 1350, the the residual residual samples samples 13481348 are added are added to a to a decoded decoded PB PB (represented as (represented as 1352) to produce 1352) to produce aa block block of of reconstructed reconstructed samples, samples, represented representedby byan an arrow 1356. arrow 1356.The The reconstructed reconstructed samples samples 1356 1356 are are supplied supplied toreconstructed to a a reconstructed sample sample

cache 1360 cache 1360and andananin-loop in-loopfiltering filtering module 1388.The module 1388. Thein-loop in-loopfiltering filtering module module1388 1388 produces produces

reconstructed blocks reconstructed blocks of of frame frame samples, samples,represented representedasas1392. 1392.The Theframe framesamples samples 1392 1392 are are

written to written to aa frame frame buffer buffer 1396. Theframe 1396. The framebuffer buffer1396 1396outputs outputsimage imageor or video video frames frames 1397. 1397.

Theframes The frames1397 1397correspond correspond to to theframes the frames 172 172 if if thesub-bitstream the sub-bitstream1301 1301is isthe thesub-bitstream sub-bitstream 171. 171. The frames1397 The frames 1397correspond correspondto to theframes the frames1204 1204 if if thesub-bitstream the sub-bitstream1301 1301is isthe thesub- sub- bitstream 145. bitstream 145.

[000207]The

[000207] Thereconstructed reconstructedsample sample cache cache 1360 1360 operates operates similarly similarly to to thereconstructed the reconstructed sample sample

cache 856 cache 856of of the the video video encoder encoder800. 800.The The reconstructed reconstructed sample sample cache cache 1360 1360 provides provides storage storage for for

43452381_1 43452381_1

59

reconstructed samples reconstructed samplesneeded neededtotointra intra predict predict subsequent CBswithout subsequent CBs withoutthethememory memory 206 206 (e.g., (e.g., by by 17 Jan 2024

using the using the data data 232 232 instead, instead, which is typically which is typicallyon-chip on-chipmemory). Referencesamples, memory). Reference samples, represented by represented by an an arrow arrow1364, 1364,are areobtained obtainedfrom fromthe thereconstructed reconstructedsample sample cache cache 1360 1360 andand

supplied to a reference sample filter 1368 to produce filtered reference samples indicated by supplied to a reference sample filter 1368 to produce filtered reference samples indicated by

arrow 1372. arrow 1372.The Thefiltered filteredreference referencesamples samples1372 1372 aresupplied are suppliedtotoananintra-frame intra-frameprediction prediction module1376. module 1376.The The module module 1376 1376 produces produces a block a block of intra-predicted of intra-predicted samples, samples, represented represented by by an an arrow 1380, arrow 1380,in in accordance accordancewith withthe theintra intra prediction prediction mode parameter1358 mode parameter 1358 signalled signalled inin thesub- the sub- 43452381_1

2024200305

bitstream 1301 bitstream anddecoded 1301 and decodedbyby theentropy the entropydecoder decoder 1320. 1320. The The intra intra prediction prediction module module 13761376

supports the supports the modes ofthe modes of the encoder-side encoder-sidemodule module864, 864,including including IBC IBC andand MIP. MIP. The block The block of of samples1380 samples 1380isisgenerated generatedusing usingmodes modes such such as as DC, DC, planar planar or or angular angular intraprediction. intra prediction.

[000208] When

[000208] When thethe predictionmode prediction mode of of a CB a CB is indicated is indicated to to use use intraprediction intra predictioninin the the sub- sub- bitstream 1301 bitstream (145oror 171), 1301 (145 171), the the intra-predicted intra-predicted samples 1380form samples 1380 formthe thedecoded decodedPBPB 1352 1352 via via a a multiplexor module multiplexor module1384. 1384.Intra Intraprediction predictionproduces produces a predictionblock a prediction block(PB) (PB) of of samples, samples, which which

is aablock is block in inone onecolour colourcomponent, derived using component, derived using 'neighbouring ‘neighbouringsamples' samples’ininthe thesame samecolour colour component.TheThe component. neighbouring neighbouring samples samples are samples are samples adjacent adjacent to current to the the current block block and and by virtue by virtue

of being of being preceding in the preceding in the block block decoding order have decoding order havealready alreadybeen beenreconstructed. reconstructed.Where Where luma luma

and chroma and chromablocks blocksare arecollocated, collocated,the the luma lumaand andchroma chroma blocks blocks maymay use use different different intra intra

prediction modes. prediction However, modes. However, thetwo the two chroma chroma CBs CBs share share the the samesame intraintra prediction prediction mode. mode.

[000209] When

[000209] When thethe predictionmode prediction mode of of thethe CB CB is indicated is indicated to to bebe interprediction inter predictioninin the the sub- sub- bitstream 1301 bitstream 1301(145 (145oror171), 171), aa motion motioncompensation compensation module module 13341334 produces produces a block a block of inter- of inter-

predicted samples, predicted represented as samples, represented as 1338. 1338. The Theblock blockofofinter-predicted inter-predictedsamples samples1338 1338 are are

producedusing produced usingaamotion motionvector, vector,decoded decoded from from thethe sub-bitstream sub-bitstream 145145 or or 171171 by by thethe entropy entropy

decoder 1320, decoder 1320,and andreference referenceframe frameindex indextotoselect select and andfilter filter aablock block of ofsamples samples 1398 fromthe 1398 from the frame buffer frame buffer 1396. 1396. The Theblock blockofofsamples samples1398 1398 is isobtained obtainedfrom from a previously a previously decoded decoded frame frame

stored in stored in the theframe frame buffer buffer 1396. For bi-prediction, 1396. For bi-prediction, two two blocks blocks of of samples are produced samples are and produced and

blendedtogether blended together to to produce samplesfor produce samples forthe thedecoded decodedPBPB 1352. 1352. The The frame frame buffer buffer 13961396 is is populated with populated withfiltered filtered block block data data 1392 from an 1392 from an in-loop in-loop filtering filtering module 1388. AsAswith module 1388. withthe thein- in- loop filtering loop filtering module module 868 of the 868 of the video video encoder 542, the encoder 542, the in-loop in-loop filtering filteringmodule module 1388 applies 1388 applies

any of any of the the DBF, the ALF DBF, the ALFandand SAO SAO filtering filtering operations. operations. Generally, Generally, thethe motion motion vector vector is is applied applied

to both to both the the luma and chroma luma and chromachannels, channels,although althoughthe thefiltering filtering processes processes for for sub-sample sub-sample

interpolation in the luma and chroma channel are different. interpolation in the luma and chroma channel are different.

43452381_1 43452381_1

60

[000210] Fig. 14

[000210] Fig. 14 is is aa schematic block diagram schematic block diagramshowing showingan an implementation implementation of inter-channel of an an inter-channel 17 Jan 2024

decorrelation-based tensor decorrelation-based tensor decoder decoder1400, 1400,which whichmay may serve serve as as thetensor the tensordecompressor decompressor 1250. 1250.

Whenmultiple When multipletensors tensorsare aretotobe be decompressed, decompressed, multiple multiple instancesofofthe instances thetensor tensordecoder decoder1400 1400 maybebeinstantiated may instantiated as as the the tensor tensor decompressor 1250.Tensors decompressor 1250. Tensors 1224 1224 areare received received from from the the

tensor storage 1222 and include a coefficients tensor 1414, a basis vectors tensor 1412, and a tensor storage 1222 and include a coefficients tensor 1414, a basis vectors tensor 1412, and a

channel mean channel meantensor tensor1410. 1410.A zero-centred A zero-centred tensor tensor 1422 1422 is produced is produced by abydot a dot product product

module1420, module 1420,under underexecution execution of of theprocessor the processor205, 205,bybyperforming performing a dot a dot product product on on thethe 43452381_1

2024200305

coefficients tensor coefficients tensor 1414 1414 and and the the basis basis vectors vectors 1412. 1412. A summation A summation module module 14241424 addsadds the zero- the zero-

centred tensor centred tensor 1422 with the 1422 with the mean meanchannel channel1410 1410 to to produce produce a tensor a tensor ofof thereconstructed the reconstructed tensors 149 tensors as output 149 as output from the tensor from the tensor decoder 1400.When decoder 1400. Whenoneone type type of of input, input, such such as as themean, the mean, is not required to be updated, the channel count of that type of input may be set to zero. When is not required to be updated, the channel count of that type of input may be set to zero. When

the channel count of an input is set to zero, the most recently received value for that input may the channel count of an input is set to zero, the most recently received value for that input may

be used instead. be used instead.

[000211]Fig.

[000211] Fig. 15 15is is aa schematic block diagram schematic block diagramshowing showing a tensordecompressor a tensor decompressor 15001500 using using a a multi-scale feature multi-scale feature reconstruction reconstruction stage, stage,which which may be used may be usedin in some someimplementations implementationsas as the the

tensor decompressor tensor 1250.TheThe decompressor 1250. tensor tensor decompressor decompressor 15001500 includes includes a single-scale a single-scale feature feature

compression(SSFC) compression (SSFC) decompressor decompressor 1510. 1510. The decompressor The SSFC SSFC decompressor 1510 receives 1510 receives the the tensor 1224 tensor havingaareduced 1224 having reducedchannel channelcount, count,such suchasas6464channels, channels,and andpasses passesthe thetensor tensor1224 1224toto a convolution a layer 1512, convolution layer whichoutputs 1512, which outputsaatensor tensor 1513 1513having havinga arestored restoredchannel channelcount, count,such suchasas 256 channels. 256 channels. The Thetensor tensor1513 1513isispassed passedtotoa abatch batchnormalisation normalisationmodule module 1514 1514 to to produce produce a a tensor 1515. tensor Thetensor 1515. The tensor1515 1515isispassed passedtotoaa PreLU PreLUmodule module 1516 1516 to produce to produce a tensor a tensor 1520. 1520.

[000212]

[000212] The The tensor tensordecompressor decompressor1500 1500includes a MSFR includes a MSFRmodule module1530. 1530.The TheMSFR MSFR module module

operates to operates to produce produce aa plurality plurality of oftensors tensorsfrom fromthe thetensor tensor1520 1520produced produced by execution of by execution of step step 19100, described 19100, described with with reference reference to 19, to Fig. Fig.using 19, using one or one more or moreconvolutional trained trained convolutional layers. layers. Upsamplemodules Upsample modules 1532, 1532, 1534, 1534, and and 15361536 upsample upsample the tensor the tensor 1520 1520 horizontally horizontally and vertically and vertically

by factors of two, four, and eight, respectively, to produce tensors 1533, 1535, and 1537. The by factors of two, four, and eight, respectively, to produce tensors 1533, 1535, and 1537. The

tensor 1537 tensor formsone 1537 forms one(P'2, (P’2,1557) 1557)output outputfrom fromthe theMSFR MSFR module module 1530 1530 and isand is passed passed to a to a downsamplemodule downsample module1542. 1542.The Thedownsample downsample module module 1542 1542 downsamples downsamples thethe tensor1537 tensor 1537bybyaa factor of factor of two two horizontally horizontally and and vertically verticallytoto produce producea atensor tensor1543 1543having having the thesame same

dimensionality as the dimensionality as the tensor tensor 1535. Thetensor 1535. The tensor1543 1543isisprovided providedtotoaa convolution convolutionlayer layer1548 1548 whichoutputs which outputsaa tensor tensor 1549. 1549. A Asummation summation module module 1554 1554 addstensors adds the the tensors 1535 1535 and to and 1549 1549 to producetensor produce tensor 1555 1555asasananoutput output(P'3) (P’3)of of the the MSFR module MSFR module 1530. 1530.

43452381_1 43452381_1

61

[000213] Thetensor

[000213] The tensor1535 1535isispassed passedtotoaadownsample downsample module module 1540. 1540. The downsample The downsample module module 17 Jan 2024

1540 downsamples 1540 downsamples thethe tensor tensor 1535 1535 by by a factor a factor of of two two horizontallyandand horizontally verticallytotoproduce vertically producea a tensor 1541 tensor havingthe 1541 having the same samedimensionality dimensionalityasasthe thetensor tensor1533. 1533.TheThe tensor tensor 1541 1541 is is provided provided to to

a convolution a layer 1546 convolution layer whichoutputs 1546 which outputsa atensor tensor1547. 1547.A A summation summation module module 1552 1552 adds adds the the tensors 1533 tensors and1547 1533 and 1547totoproduce producetensor tensor1553 1553asas anan output(P'4) output (P’4)ofofthe theMSFR MSFR module module 1530. 1530.

[000214] Thetensor

[000214] The tensor1533 1533isispassed passedtotoaadownsample downsample module module 1538. 1538. The downsample The downsample module module 43452381_1

2024200305

1538 downsamples 1538 downsamples thethe tensor tensor 1533 1533 by by a factor a factor of of two two horizontallyandand horizontally verticallytotoproduce vertically producea a tensor 1539 tensor havingthe 1539 having the same samedimensionality dimensionalityasasthe thetensor tensor1520. 1520.TheThe tensor tensor 1539 1539 is is provided provided to to

a convolution a layer 1544 convolution layer whichoutputs 1544 which outputsa atensor tensor1545. 1545.A A summation summation module module 1550 1550 adds adds the the tensors 1520 tensors and1545 1520 and 1545totoproduce producetensor tensor1551 1551asas anan output(P'5) output (P’5)ofofthe theMSFR MSFR module module 1530.1530.

Thetensors The tensors 1551, 1551, 1553, 1553,1555 1555and and1557 1557 provide provide thethe tensors tensors 149. 149.

[000215]Fig.

[000215] Fig. 16A 16Aisisaa schematic schematicblock blockdiagram diagram showing showing an example an example implementation implementation 1600 1600 of of the NN the part 22 166 NN part 166of of aa CNN CNN forobject for objectdetection, detection,corresponding correspondingtotoa aportion portionofofaa "YOLOv3" “YOLOv3” networkexcluding network excludingthe the"DarkNet-53" “DarkNet-53” backbone backbone portion. portion. Thepart The NN NN2part 16002 of 1600 Fig.of16A Fig.can16A can be used be used when whenthe theCNN CNN backbone backbone (NN 1part (NN part 114)1 114) is implemented is implemented as in as in 3A Fig. Fig.for 3Aexample. for example. Dependingononthethetask Depending tasktotobe beperformed performedininthe thedestination destinationdevice device140, 140,different different networks maybebe networks may

substituted for substituted for the theNN NN part part 22 166. Incomingtensors 166. Incoming tensors149 149are areseparated separatedinto into the the tensor tensor of of each each

layer (i.e., layer (i.e., tensors 1610, tensors 1620, 1610, and 1620, and1634). 1634).The The tensor tensor 1610 1610 is is passed passed to toaaCBL module1612 CBL module 1612toto

producetensor produce tensor 1614. 1614.The Thetensor tensor1614 1614isispassed passedtotoaa detection detection module module1616 1616 and and an an upscaler upscaler

module1622. module 1622.TheThe detection detection module module outputs outputs bounding bounding boxesboxes 1618,1618, in theinform the form of a of a detection detection

tensor. The tensor. The bounding boxes1618 bounding boxes 1618 arepassed are passed to to a anon-maximum non-maximum suppression suppression (NMS) (NMS)

module 1648. module 1648.

[000216] Toproduce

[000216] To produce bounding bounding boxes boxes addressing addressing co-ordinates co-ordinates in the in the original original video video data data 113, 113,

prior to resizing for the NN part 1 of the network 114, scaling by the original video width and prior to resizing for the NN part 1 of the network 114, scaling by the original video width and

height is height is performed at the performed at the upscaler upscaler module 1622.The module 1622. Theupscaler upscalermodule module 1622 1622 receives receives the the tensor tensor

1614 andthe 1614 and the tensor tensor 1620 1620and andproduces producesananupscaled upscaled tensor1624, tensor 1624, which which is is passed passed to to a a CBL CBL

module1626. module 1626.The The CBL CBL module module 1626 1626 produces produces a tensor a tensor 1628 1628 as as output. output. The tensor The tensor 1628 1628 is is passed to passed to aa detection detection module 1630and module 1630 andananupscaler upscalermodule module 1636. 1636. The The detection detection module module 1630 1630 producesaa detection produces detection tensor tensor 1632, whichisis supplied 1632, which supplied to to the the NMS module NMS module 1648. 1648. The The upscaler upscaler

module1636 module 1636isisanother anotherinstance instanceofofthe the module module1622. 1622.TheThe upscaler upscaler module module 1636 1636 receives receives the the tensor 1628 tensor andthe 1628 and the tensor tensor 1634 1634and andoutputs outputsananupscaled upscaledtensor tensor1638. 1638.TheThe upscaled upscaled

tensor 1638 tensor is passed 1638 is to aa CBL passed to module CBL module 1640, 1640, which which outputs outputs a tensor a tensor 1642 1642 todetection to a a detection

43452381_1 43452381_1

62

module1644. module 1644.TheThe detection detection module module 16441644 produces produces a detection a detection tensor tensor 1646, 1646, whichwhich is supplied is supplied 17 Jan 2024

to the to theNMS module 1648. NMS module 1648.

[000217]The

[000217] TheCBL CBL modules modules 1612, 1612, 1626,1626, and 1640 and 1640 each contain each contain a concatenation a concatenation of CBL of five five CBL modules(e.g., modules (e.g., CBL model CBL model 360360 shown shown in Fig. in Fig. 3D).3D). The The upscaler upscaler modules modules 1622 1622 and andare 1636 1636 are each instances each instances of of an an upscaler upscaler module 1660asasshown module 1660 shownin in Fig.16B. Fig. 16B.TheThe module module 1648 1648 receives receives

the tensors the tensors 1618, 1618, 1632 and1646 1632 and 1646and andoutputs outputsthe thetask taskresult result 151. 151. 43452381_1

2024200305

[000218]AsAsshown

[000218] shownin in Fig.16B, Fig. 16B, theupscaler the upscalermodule module 1660 1660 accepts accepts a tensor a tensor 1662 1662 (for(for example example

the tensor the tensor 1614 of Fig. 1614 of Fig. 16A) as an 16A) as an input. input. The The tensor tensor 1662 is passed 1662 is passed to to aaCBL module1666 CBL module 1666 (having structure (having structure of of the the module 360) to module 360) to produce produce aa tensor tensor 1668. 1668. The Thetensor tensor1668 1668isispassed passedtotoanan upsampler1670 upsampler 1670totoproduce produceanan upsampled upsampled tensor tensor 1672. 1672. A concatenation A concatenation module module 1674 produces 1674 produces

a tensor a tensor 1676 by concatenating 1676 by concatenatingthe the upsampled upsampled tensor1672 tensor 1672 with with a second a second input input tensor tensor 1664 1664 (for (for

examplethe example thetensor tensor 1620 1620input inputtoto the the upscaler upscaler 1622 1622inin Fig. Fig. 16A). 16A).

[000219] Thedetection

[000219] The detectionmodules modules 1616, 1616, 1630, 1630, andand 1644 1644 are are instances instances ofdetection of a a detection module1680 module 1680asasshown shown in in Fig.16C. Fig. 16C. TheThe detection detection module module 1680 1680 receives receives a tensor a tensor 1682.1682. The The tensor 1682 tensor is input 1682 is input to to aaCBL module1684 CBL module 1684 having having structureofofthe structure themodule module 360. 360. TheThe CBLCBL

module1684 module 1684generates generates a a tensor1686. tensor 1686.The The tensor1686 tensor 1686 is is passed passed to to a aconvolution convolution module module 1688, 1688,

whichimplements which implements a detectionkernel a detection kerneltotogenerate generatea aresult result 1690. 1690. InInsome somearrangements, arrangements, thethe

detection kernel applies a 1 × 1 kernel to produce the output on feature maps at each of the detection kernel applies a 1 x 1 kernel to produce the output on feature maps at each of the

three layers three layers of ofthe thetensor. tensor.The The detection detectionkernel kernelisis1 1 × X1 × 1 (B × X(5(5+ C) X (B ), where + C) whereBBisis the the number number of bounding boxes a particular cell can predict, typically three (3), and C is the number of of bounding boxes a particular cell can predict, typically three (3), and C is the number of

classes, which may be eighty (80), resulting in a kernel size of two-hundred and fifty five (255) classes, which may be eighty (80), resulting in a kernel size of two-hundred and fifty five (255)

detection attributes (i.e. tensor 1690). The constant “5” represents four boundary box attributes detection attributes (i.e. tensor 1690). The constant "5" represents four boundary box attributes

(box centre x, y and size scale x, y) and one object confidence level (“objectness”). The result (box centre X, y and size scale X, y) and one object confidence level ("objectness"). The result

of a detection kernel has the same spatial dimensions as the input feature map, but the depth of of a detection kernel has the same spatial dimensions as the input feature map, but the depth of

the output corresponds to the detection attributes. The detection kernel is applied at each layer, the output corresponds to the detection attributes. The detection kernel is applied at each layer,

typically three typically three layers, layers,resulting in in resulting a large number a large ofof number candidate bounding candidate boundingboxes. boxes. A A process of process of

non-maximum non-maximum suppression suppression is applied is applied by the by the NMSNMS module module 1648 to1648 the to the resulting resulting bounding bounding boxes boxes to discard redundant boxes, such as overlapping predictions at similar scale, resulting in a final to discard redundant boxes, such as overlapping predictions at similar scale, resulting in a final

set of bounding boxes as output for object detection. set of bounding boxes as output for object detection.

[000220]Fig.

[000220] Fig. 17 17is is aa schematic block diagram schematic block diagramshowing showing a head a head portion portion 1700 1700 of of a CNN. a CNN. The The head portion head portion 1700 1700can canbebeimplemented implementedas as thethe NN NN partpart 2 166 2 166 where where the the NN part NN part 1 1141 is 114 is implementedasasthe implemented thebackbone backbone400400 forfor example. example. The The head head portion portion 1700 1700 formsforms part part of anof an overall overall

43452381_1 43452381_1

63

networkknown network knownas as ‘FasterRCNN' 'Faster RCNN’ and includes and includes a feature a feature network network (i.e., (i.e., backbone backbone portion portion 400), 400), 17 Jan 2024

a region a region proposal network, and proposal network, andaa detection detection network. network.Input Inputtotothe the head headportion portion1700 1700are arethe the tensors 149, tensors 149, which include the which include the P2-P6 P2-P6layer layertensors tensors 1710, 1710, 1712, 1712,1714, 1714,1716, 1716,and and1718. 1718.TheThe P2-P2-

P5 layer P5 layer tensors tensors 1710, 1710, 1712, 1714, and 1712, 1714, and1716, 1716,correspond correspondtotothe theP2P2totoP5 P5outputs outputs477, 477,475, 475,473, 473, and 471 and 471of of Fig. Fig. 4. 4. The The P2-P6 tensors 1710, P2-P6 tensors 1710,1712, 1712,1714, 1714,1716, 1716,and and1718 1718 areare inputtotoa aregion input region proposal network proposal network(RPN) (RPN) head head module module 1720. 1720. The The P6 P6 tensor tensor 1718 1718 is is produced produced by maxby max pool pool module1742, module 1742,operating operatingononP5P5 tensor1716 tensor 1716 to to perform perform a 2×2 a 2x2 maxmax pooling pooling operation. operation. 43452381_1

2024200305

[000221]The

[000221] TheRPN RPN head head module module 1720 1720 performs performs a convolution a convolution on theon the input input tensors, tensors, generating generating

an intermediate tensor. The intermediate tensor is fed into two subsequent sibling layers, (i) one an intermediate tensor. The intermediate tensor is fed into two subsequent sibling layers, (i) one

for classifications and (ii) one for bounding box, or ‘region of interest’ (ROI), regression. A for classifications and (ii) one for bounding box, or 'region of interest' (ROI), regression. A

resultant output resultant output is isclassification classificationandand bounding boundingboxes boxes 1722. Theclassification 1722. The classification and and bounding bounding

boxes 1722 boxes 1722are arepassed passedtotoan anNMS NMS module module 1724. 1724. Themodule The NMS NMS module 1724out 1724 prunes prunes out redundant redundant

boundingboxes bounding boxesbybyremoving removing overlapping overlapping boxes boxes withwith a lower a lower score score to produce to produce pruned pruned bounding bounding

boxes 1726. boxes 1726.

[000222] Thebounding

[000222] The bounding boxes boxes 1726 1726 are are input input to to a region a region of of interest(ROI) interest (ROI)pooler pooler1728. 1728.TheThe ROI pooler 1728 uses some of the layer tensors of the tensor 149 (described further hereafter) ROI pooler 1728 uses some of the layer tensors of the tensor 149 (described further hereafter)

and the and the bounding boxes1726 bounding boxes 1726to to produce produce fixed-sizefeature fixed-size featuremaps maps from from various various input input size size maps maps

using max using maxpooling poolingoperations. operations.InInthe the max maxpooling poolingoperation operationa asubsampling subsampling takes takes thethe maximum maximum

value in each group of input values to produce one output value in the output tensor. value in each group of input values to produce one output value in the output tensor.

[000223] Inputto

[000223] Input to the the ROI pooler1728 ROI pooler 1728are arethe theP2-P5 P2-P5feature featuremaps maps 1710, 1710, 1712, 1712, 1714, 1714, andand 1716, 1716,

and region and region of of interest interestproposals proposals 1726. Eachproposal 1726. Each proposal(ROI) (ROI)from from 1726 1726 is is associatedwith associated with a a portion of portion of the the feature featuremaps maps (1710-1716) to produce (1710-1716) to producea afixed-size fixed-size map. map.The The fixed-sizemap fixed-size map is is ofof

a size a size independent of the independent of the underlying underlying portion portion of of the the feature featuremap map 1710-1716. Oneofofthe 1710-1716. One thefeature feature maps 1710-1716 is selected such that the resulting cropped map has sufficient detail, for maps 1710-1716 is selected such that the resulting cropped map has sufficient detail, for

example, according to the following rule: floor(4 + log2(sqrt(box_area) / 224)), where 224 is example, according to the following rule: floor(4 + log2(sqrt(box_area) / 224)), where 224 is

the canonical the canonical box size. The box size. TheROI ROIpooler pooler1728 1728 operates operates to to cropincoming crop incoming feature feature maps maps according according

to the to the proposals proposals 1726 producingaatensor 1726 producing tensor 1730. 1730.

[000224] Thetensor

[000224] The tensor1730 1730isisfed fedinto into aa fully fully connected (FC) neural connected (FC) neural network networkhead head1732. 1732.TheThe FC FC

head 1732 head 1732performs performstwo two fullyconnected fully connected layerstotoproduce layers produce classscore class scoreand andbounding bounding boxbox

predictor delta predictor delta tensor tensor1734. 1734. The class score The class score is isgenerally generallyan an80-element 80-element tensor, tensor, each each element element

correspondingtoto aa prediction corresponding prediction score score for for the the corresponding object category. corresponding object Thebounding category. The boundingboxbox prediction deltas prediction deltas tensor tensorisisanan80×4 80x4 == 320 320 element tensor, containing element tensor, containing bounding boxesfor bounding boxes forthe the

43452381_1 43452381_1

64

correspondingobject corresponding objectcategories. categories. Final Final processing processingisis performed performedbybyananoutput outputlayers layers 17 Jan 2024

module1736, module 1736,receiving receivingthe thetensor tensor1734 1734and andperforming performing a filteringoperation a filtering operationtotoproduce producea a filtered tensor filtered tensor1738. 1738. Low-scoring (lowclassification) Low-scoring (low classification) objects objects are are removed fromfurther removed from further consideration. AAnon-maximum consideration. non-maximum suppression suppression module module 1740 receives 1740 receives the filtered the filtered tensor tensor 1738 1738 and and removesoverlapping removes overlappingbounding bounding boxes boxes by removing by removing the overlapped the overlapped box awith box with a lower lower

classification score, resulting in an inference output tensor 1742, corresponding to the tensor classification score, resulting in an inference output tensor 1742, corresponding to the tensor

151. 151. 43452381_1

2024200305

[000225]ItIt should

[000225] be noted should be noted that that although the source although the device 110 source device 110and andthe the destination destination device device 140 140 are described are described with with reference reference to to the the video video source source 112 112 comprising videoand comprising video andimage image data,other data, other types of content such as audio data or textual data may also be supplied as input to neural types of content such as audio data or textual data may also be supplied as input to neural

networksapplicable networks applicabletoto such such types types of of input input and the resulting and the resulting intermediate intermediate feature featuremaps maps may be may be

compressedand compressed anddecompressed decompressed by the by the modules modules 116146 116 and andwith 146 suitable with suitable encoder encoder and decoder and decoder

networktopologies. network topologies.

[000226]The

[000226] Theexamples examplesof of methods methods 18001800 and and 1900 1900 relate relate to where to where both both a neural a neural network network result result

and the and the associated associated video data are video data are required. required. For For example, example, a a human may human may wish wish to to view view thethe

frame(s) relating frame(s) relating to toaaresult resultfound foundbybyoperation operationofofthe combined the combined NN part 11 114 NN part 114 and andNN NNpart part2 2 166. Forexample, 166. For example, if presence if presence of a of a person person hasdetected has been been detected in a security in a security application, application, a security a security

guard may guard maywish wishtotoview viewthe therelevant relevantportion portionofofthe the video. video. In In some implementations,the some implementations, thevideo video stream and stream and associated associated information informationmay mayonly only bebe encoded encoded in in thethe bitstream bitstream 155 155 (or(or decoded decoded from from

the bitstream 143) if required. In other implementations, the feature stream and associated the bitstream 143) if required. In other implementations, the feature stream and associated

information may information mayonly onlybebeencoded encoded in in thebitstream the bitstream155 155 (ordecoded (or decoded from from thethe bitstream bitstream 143) 143) if if required. required.

[000227] It should

[000227] It be noted should be noted that that although the source although the source device 110 and device 110 andthe the destination destination device device 140 140 are described are described with with reference reference to to the the video video source source 112 112 comprising videoand comprising video andimage-related image-relateddata, data, other types of content such as audio data or textual data may also be supplied as input to neural other types of content such as audio data or textual data may also be supplied as input to neural

networktopologies. network topologies. Based Basedononthe thetype typeofofinformation informationbeing beingused, used,the theencoded encodeddata datamay may take take a a different form different form to to aa bitstream bitstreamreceived received from from the the communications channel130, communications channel 130,forforexample examplethethe

encodeddata encoded datacould couldbebeobtained obtainedfrom froma afile file server server and and accessed accessedusing usingoperating operatingsystem system'read' ‘read’ and ‘seek’ methods to parse the box structure contained therein. and 'seek' methods to parse the box structure contained therein.

43452381_1 43452381_1

65

INDUSTRIAL APPLICABILITY INDUSTRIAL APPLICABILITY

[000228] Thearrangements

[000228] The arrangements described described areare applicable applicable to to thecomputer the computerandand data data processing processing

industries and particularly for the digital signal processing for the encoding and decoding of industries and particularly for the digital signal processing for the encoding and decoding of 43452381_1

2024200305

signals such signals such as as video video and and image signals, achieving image signals, high compression achieving high compressionefficiency. efficiency.

[000229] Some

[000229] Some embodiments embodiments described described herein herein can include can include both both a video a video layerlayer and and a feature a feature

layer in a bitstream, with the layers associated with each other and with the codec selection layer in a bitstream, with the layers associated with each other and with the codec selection

signalled independently for the video layer and the feature layer. Including both a video layer signalled independently for the video layer and the feature layer. Including both a video layer

and aa feature and feature layer layer in ina abitstream bitstreamprovides providesa amechanism by which mechanism by whichhuman humanmaymay review review or or interpret aaCNN interpret result. Having CNN result. the codec Having the codecselection selection signalled signalled independently for the independently for the video layer video layer

and the feature layer allows a greater flexibility and adaptability to use different codecs as and the feature layer allows a greater flexibility and adaptability to use different codecs as

encodingmethods encoding methods areupdated, are updated,encoding encoding requirements requirements are are altered altered and/or and/or as as industry industry trends trends

change. In change. In signalling signalling codec codec selection selection independently, independently, the the arrangements describedprovide arrangements described providea a mechanism mechanism by by which which both both feature feature andand video video streams streams can can be presented, be presented, if required, if required, without without

relying on relying on a a particular particularencoding encoding standard. standard. The The examples describedherein examples described hereinrelate relate to to video video coding coding

standards such standards such as as VVC VVC and and HEVC, HEVC, however however other other encoding encoding methods methods could potentially could potentially be be used, used, for example for for the example for the feature feature stream, stream, such such as as coding coding techniques based on techniques based on end-to-end end-to-endlearned learned compressiontechniques. compression techniques.

[000230] Some

[000230] Some of of thetechniques the techniquesdescribed described hereinuseusefeatures herein featuresofofthe theISO ISOBase Base Media Media File File

Format(ISOBMFF). Format (ISOBMFF).The The box box structure structure of ISOBMFF of ISOBMFF adaptedadapted touse to allow allow of use of independent independent

codecs for video and feature layers is described herein. Additionally, features such as use of codecs for video and feature layers is described herein. Additionally, features such as use of

track boxes and track grouping, provide further flexibility in cases where the NN result may track boxes and track grouping, provide further flexibility in cases where the NN result may

also require also require viewing by aa human. viewing by Moreover, human. Moreover, thedisclosed the disclosedbox box structurefor structure forISOBMFF ISOBMFF may may be be used by used by other other standards standards which whichconvey conveyfeature featureand andvideo videodata, data,such suchasasthose thosedeveloped developedbyby

organizations such organizations such as as 3GPP. 3GPP.

[000231] The

[000231] The foregoing foregoing describes describes only only some some embodiments embodiments of theof the present present invention, invention, and and

modifications and/or modifications and/or changes changescan canbebemade made theretowithout thereto without departing departing from from thethe scope scope andand spirit spirit

of the invention, the embodiments being illustrative and not restrictive. of the invention, the embodiments being illustrative and not restrictive.

43452381_1 43452381_1

66

In the context of this specification, the word “comprising” means “including principally but not In the context of this specification, the word "comprising" means "including principally but not 17 Jan 2024

necessarily solely” or “having” or “including”, and not “consisting only of”. Variations of the necessarily solely" or "having" or "including", and not "consisting only of". Variations of the

word"comprising", word "comprising",such suchasas"comprise" “comprise”andand “comprises” "comprises" havehave correspondingly correspondingly varied varied

meanings. meanings. 43452381_1

2024200305

43452381_1 43452381_1

Claims

67

CLAIMS CLAIMS 1. 1. A method A method of of decoding decoding tensors tensors andfrom and video video from a bitstream, a bitstream, the the tensors tensors being being related related to the to the

video, the video, the method comprising: method comprising:

determining a first codec for the video based on first information decoded from the determining a first codec for the video based on first information decoded from the 43452381_1

bitstream; 2024200305

bitstream;

determiningaa second determining secondcodec codecfor forthe thetensors tensors based basedononsecond secondinformation informationdecoded decoded from from the the

bitstream; bitstream;

decoding the video according to the first codec; decoding the video according to the first codec;

decodingthe decoding the tensors tensors according accordingto to the the second codec;and second codec; and whereinthe wherein the first first information information and and the the second second information are independent information are independentofofeach eachother otherand and the video and the tensors are associated with each other. the video and the tensors are associated with each other.
2. A method according to claim 1, wherein the first information is decoded from a first box in 2. A method according to claim 1, wherein the first information is decoded from a first box in

the bitstream the bitstream and and the the second informationis second information is decoded froma asecond decoded from secondbox box differentfrom different fromthe thefirst first box in the bitstream. box in the bitstream.
3. A 3. A method accordingtotoclaim method according claim1,1,wherein whereinthe thevideo videoisisrendered renderedindependently independentlyofofthe thecontents contents of the tensors. of the tensors.
4. AAmethod 4. method according according to to claim claim 1, 1, wherein wherein thethe firstcodec first codecisisone oneofofananAdvanced Advanced Video Video

Coding), aa HEVC Coding), HEVC (High (High Efficiency Efficiency Video Video Coding), Coding), or a or a VVC VVC (Versatile (Versatile Video Video Coding) Coding) codec. codec.
5. AAmethod 5. method according according to to claim claim 1, 1, wherein wherein thethe second second codec codec is one is one of of a HEVC a HEVC (High(High

Efficiency VideoCoding), Efficiency Video Coding),a aVVC VVC (Versatile (Versatile Video Video Coding) Coding) codec, codec, orend-to-end or an an end-to-end learned learned

codec. codec.
6. AAmethod 6. method according according to to claim claim 1, 1, wherein wherein thethe tensors tensors were were generated generated by by performing performing at least at least

one convolution one convolutionononthe thevideo. video.
7. AAmethod 7. method according according to to claim claim 1, 1, furthercomprising further comprising decoding decoding information information signalling signalling thethe

association between association the video between the videoand andthe thetensors tensors from fromthe the bitstream. bitstream.

43452381_1 43452381_1

68
8. AAmethod method according to to claim 7, 7, wherein thethe information signalling thethe associationbetween between 17 Jan 2024

8. according claim wherein information signalling association

the video and the tensors is decoded as (i) information from a first box in the bitstream for the video and the tensors is decoded as (i) information from a first box in the bitstream for

video using the first codec and (ii) information from a second box in the bitstream for the video using the first codec and (ii) information from a second box in the bitstream for the

tensors using tensors using the the second codec. second codec.
9. AAmethod 9. methodof of encoding encoding tensors tensors andand video video into into a bitstream,the a bitstream, thetensors tensorsbeing beingrelated relatedto to the the 43452381_1

video, the video, the method comprising: method comprising: 2024200305

encoding first information into the bitstream, the first information being used to determine a encoding first information into the bitstream, the first information being used to determine a

first codec for the video; first codec for the video;

encodingsecond encoding secondinformation informationinto intothe thebitstream, bitstream,the the second secondinformation informationbeing beingused usedtoto determineaa second determine secondcodec codecfor forthe thetensors; tensors; encoding the video into the bitstream according to the first codec; encoding the video into the bitstream according to the first codec;

encodingthe encoding the tensors tensors into into the the bitstream bitstream according according to to the the second second codec; codec; and and

whereinthe wherein the first first information information and and the the second second information are independent information are independentofofeach eachother otherand and the video and the tensors are associated with each other. the video and the tensors are associated with each other.
10. Amethod 10. A method according according to claim to claim 9 wherein 9 wherein the firstthe first information information is encodedis encoded into a firstinto box a infirst box in

the bitstream and the second information is encoded into a second box different from the first the bitstream and the second information is encoded into a second box different from the first

box in the bitstream. box in the bitstream.
11. 11. A methodaccording A method accordingtotoclaim claim9,9,wherein whereinthe thefirst first codec is one codec is one of of an an Advanced Video Advanced Video

Coding), aa HEVC Coding), HEVC (High (High Efficiency Efficiency Video Video Coding), Coding), or a or a VVC VVC (Versatile (Versatile Video Video Coding) Coding) codec. codec.
12. 12. A methodaccording A method accordingtotoclaim claim9,9,wherein whereinthe thesecond secondcodec codec is is oneofofa aHEVC one HEVC (High (High

Efficiency Video Efficiency VideoCoding), Coding),a aVVC VVC (Versatile (Versatile Video Video Coding) Coding) codec, codec, orend-to-end or an an end-to-end learned learned

codec. codec.
13. 13. A methodaccording A method accordingtotoclaim claim9,9,wherein whereinthe thetensors tensorswere weregenerated generatedbybyperforming performing at at least least

one convolution one convolutionononthe thevideo. video.
14. Adecoder 14. A decoderforfor decoding decoding tensors tensors and from and video video from a bitstream, a bitstream, thebeing the tensors tensors being related related to the to the

video, the decoder configured for: video, the decoder configured for:

determining a first codec for the video based on first information decoded from the determining a first codec for the video based on first information decoded from the

bitstream; bitstream;

43452381_1 43452381_1

69

determiningaa second secondcodec codecfor forthe thetensors tensors based basedononsecond secondinformation information decoded from the the 17 Jan 2024

determining decoded from

bitstream; bitstream;

decoding the video according to the first codec; decoding the video according to the first codec;

decodingthe decoding the tensors tensors according accordingto to the the second codec;and second codec; and whereinthe wherein the first first information information and and the the second second information are independent information are independentofofeach eachother otherand andthe the video and the tensors are associated with each other. video and the tensors are associated with each other. 43452381_1

2024200305
15. 15. A non-transitory computer-readable A non-transitory storagemedium computer-readable storage medium which which stores stores a program a program for for executing executing a a method of decoding tensors and video from a bitstream, the tensors being related to the video, method of decoding tensors and video from a bitstream, the tensors being related to the video,

the method the comprising: method comprising:

decoding tensors and video from a bitstream, the tensors being related to the video, the decoding tensors and video from a bitstream, the tensors being related to the video, the

methodcomprising: method comprising: determiningaa first determining first codec codec for for the thevideo videobased based on on first firstinformation information decoded fromthe decoded from the bitstream; bitstream;

determiningaa second determining secondcodec codecfor forthe thetensors tensors based basedon onsecond secondinformation informationdecoded decoded from from the the

bitstream; bitstream;

decoding the video according to the first codec; decoding the video according to the first codec;

decodingthe decoding the tensors tensors according accordingto to the the second codec;and second codec; and whereinthe wherein the first first information information and and the the second second information are independent information are independentofofeach eachother otherand andthe the video and the tensors are associated with each other. video and the tensors are associated with each other.
16. Anencoder 16. An encoderfor for encoding encoding tensors tensors andinto and video video into a bitstream, a bitstream, thebeing the tensors tensors being related related to the to the

video, the encoder configured for: video, the encoder configured for:

encoding first information into the bitstream, the first information being used to determine a encoding first information into the bitstream, the first information being used to determine a

first codec for the video; first codec for the video;

encodingsecond encoding secondinformation informationinto intothe thebitstream, bitstream,the the second secondinformation informationbeing beingused usedtoto determineaa second determine secondcodec codecfor forthe thetensors; tensors; encoding the video into the bitstream according to the first codec; encoding the video into the bitstream according to the first codec;

encodingthe encoding thetensors tensors into into the the bitstream bitstream according according to to the the second second codec; codec; and and

whereinthe wherein the first first information information and and the the second second information are independent information are independentofofeach eachother otherand andthe the video and the tensors are associated with each other. video and the tensors are associated with each other.

43452381_1 43452381_1

70
17. 17. A computer-implemented medium non-transitory computer-readable storage mediummedium which 17 Jan 2024

A computer-implemented medium non-transitory computer-readable storage which

stores aa program stores for executing program for executing aa method ofencoding method of encodingtensors tensorsand andvideo videointo intoa abitstream, bitstream, the the tensors being related to the video, the method comprising: tensors being related to the video, the method comprising:

encoding first information into the bitstream, the first information being used to determine a encoding first information into the bitstream, the first information being used to determine a

first codec for the video; first codec for the video;

encodingsecond encoding secondinformation information intothe into thebitstream, bitstream,the the second secondinformation informationbeing beingused usedtoto 43452381_1

determineaa second determine secondcodec codecfor forthe thetensors; tensors; 2024200305

encoding the video into the bitstream according to the first codec; encoding the video into the bitstream according to the first codec;

encodingthe encoding the tensors tensors into into the the bitstream bitstream according according to to the the second second codec; codec; and and

whereinthe wherein the first first information information and and the the second second information are independent information are independentofofeach eachother otherand andthe the video and the tensors are associated with each other. video and the tensors are associated with each other.

CANONKABUSHIKI CANON KABUSHIKIKAISHAKAISHA Patent Attorneys Patent for the Attorneys for the Applicant Applicant

Spruson & Spruson Ferguson & Ferguson

43452381_1 43452381_1