[go: up one dir, main page]

WO2025230719A1 - Diffusion en continu de données audio et vidéo à perspective multiple - Google Patents

Diffusion en continu de données audio et vidéo à perspective multiple

Info

Publication number
WO2025230719A1
WO2025230719A1 PCT/US2025/024727 US2025024727W WO2025230719A1 WO 2025230719 A1 WO2025230719 A1 WO 2025230719A1 US 2025024727 W US2025024727 W US 2025024727W WO 2025230719 A1 WO2025230719 A1 WO 2025230719A1
Authority
WO
WIPO (PCT)
Prior art keywords
mpav
camera
view
class
camera views
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/024727
Other languages
English (en)
Inventor
Guan-Ming Su
Sejin OH
Janusz Klejsa
Kristofer Kjorling
Ludvig Carl Henrik NORING
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby International AB
Dolby Laboratories Licensing Corp
Original Assignee
Dolby International AB
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby International AB, Dolby Laboratories Licensing Corp filed Critical Dolby International AB
Publication of WO2025230719A1 publication Critical patent/WO2025230719A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/2662Controlling the complexity of the video stream, e.g. by scaling the resolution or bitrate of the video stream based on the client capabilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/262Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
    • H04N21/26258Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists for generating a list of items to be played back in a given order, e.g. playlist, or scheduling item distribution according to such list
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • H04N21/4316Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations for displaying supplemental content in a region of the screen, e.g. an advertisement in a separate window
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/462Content or additional data management, e.g. creating a master electronic program guide from data received from the Internet and a Head-end, controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
    • H04N21/4621Controlling the complexity of the content stream or additional data, e.g. lowering the resolution or bit-rate of the video stream for a mobile client with a small screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/658Transmission by the client directed to the server
    • H04N21/6587Control parameters, e.g. trick play commands, viewpoint selection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • the immersive experience enabled by these cameras can enhance user satisfaction for broadcast performances (e.g., theater and concerts) and events (e.g., graduation and award ceremonies), as well as online meetings.
  • broadcast performances e.g., theater and concerts
  • events e.g., graduation and award ceremonies
  • users do not just passively consume content, but may interactively traverse the content along many different paths from a perspective of their choice, with different users being able to observe different perspectives of the same content.
  • the client player devices are expected to download, and the content distribution networks (CDNs) are expected to generate perspectives on demand while ensuring low latency and support for a variety of devices for capture and consumption of the corresponding MPAV content.
  • CDNs content distribution networks
  • Various embodiments disclosed herein provide methods and apparatus for streaming MPAV content.
  • the perspective analysis and camera-view clustering and categorization are used to enable the player device to build a graphical user interface suitable for navigating through various camera-view selections and are further used to assign different encoding parameters to different MPAV bitstreams and implement different downloading strategies under different playback modes.
  • Some examples also provide an editing mode that can beneficially be used to grow the MPAV content and provide it with re-shareable and re-editable capabilities.
  • an apparatus for streaming multiple- perspective audio and video comprising: at least one processor; and at least one memory including program code, wherein the at least one memory and the program code are configured to, with the at least one processor, cause the apparatus at least to: group a plurality of camera views representing the MPAV into a predetermined number of clusters; for a selected camera view of the plurality of camera views, classify other camera views of the plurality of camera views into one of a respective first class and a respective second class, each of the other camera views classified into the respective first class being from a different respective cluster than a cluster into which the selected camera view has been grouped, the second respective class including remaining ones of the other camera views; and when requested by a player device, stream a camera view classified into the respective first class using a first type of bitstream; stream a camera view classified into the respective second class using a second type of bitstream; and stream the selected camera view using a third type of bitstream, wherein the first, second,
  • an apparatus for streaming multiple-perspective audio and video comprising: at least one processor; and at least one memory including program code, wherein the at least one memory and the program code are configured to, with the at least one processor, cause the apparatus at least to: receive a media presentation description (MPD) of the MPAV stored in a storage container accessible via a server device; build a graphical user interface (GUI) based on the MPD, the GUI being configured to: enable a user to select a camera view from a plurality of camera views representing the MPAV and further select a playback mode from a plurality of MPAV playback modes; and present a predetermined number of clusters into which different camera views of the plurality of camera views are grouped; upon receiving an indication of the selected camera view through the GUI, determine classification of other camera views of the plurality of camera views into one of a respective first class and a respective second class, each of the other camera views classified into the respective first class being from a different respective cluster
  • MPD media presentation description
  • GUI graphical user interface
  • a method for streaming multiple-perspective audio and video (MPAV) from a server device comprising: grouping a plurality of camera views representing the MPAV into a predetermined number of clusters; for a selected camera view of the plurality of camera views, classifying other camera views of the plurality of camera views into one of a respective first class and a respective second class, each of the other camera views classified into the respective first class being from a different respective cluster than a cluster into which the selected camera view has been grouped, the second respective class including remaining ones of the other camera views; and when requested by a player device, streaming a camera view classified into the respective first class using a first type of bitstream; streaming a camera view classified into the respective second class using a second type of bitstream; and streaming the selected camera view using a third type of bitstream, wherein the first, second, and third types of bitstream differ from one another in at least one of a first characteristic and a second characteristic.
  • MPAV multiple-perspective audio and video
  • GUI graphical user interface
  • FIG. 1 depicts an example process for a video/audio/image delivery pipeline.
  • FIG. 2 is a block diagram illustrating a configuration for capturing MPAV content for the delivery pipeline of FIG. 1 according to one example.
  • FIGS. 3A-3D pictorially illustrate four different views captured using the configuration of FIG.
  • FIG. 4 is a block diagram illustrating a communication system configured to support MPAV transmissions using DASH and ISO BMFF according to some examples.
  • FIG. 5 is a block diagram illustrating a data structure of an MPAV container used in the communication system of FIG. 4 according to some embodiments.
  • FIG. 6 is a block diagram illustrating a high-level hierarchical data model that can be used in DASH according to some examples.
  • FIGS. 7A-7B are block diagrams pictorially illustrating example improvements provided by the use of the QUIC protocol in the delivery pipeline of FIG. 1 according to some examples. [0018] FIG.
  • FIG. 8 is a block diagram illustrating a configuration for capturing MPAV content for the delivery pipeline of FIG. 1 according to another example.
  • FIGS. 9-10 pictorially illustrate a process of sorting the camera views of the configuration of FIG. 8 into main, side, and rare views according to one example.
  • FIG. 11 pictorially illustrates a first nonlimiting example of the broader interpretation of the term “perspective” for possible implementation in the delivery pipeline of FIG. 1.
  • FIG. 12 pictorially illustrates a second nonlimiting example of the broader interpretation of the term “perspective” for possible implementation in the delivery pipeline of FIG. 1.
  • FIGS. 9-10 pictorially illustrate a process of sorting the camera views of the configuration of FIG. 8 into main, side, and rare views according to one example.
  • FIG. 11 pictorially illustrates a first nonlimiting example of the broader interpretation of the term “perspective” for possible implementation in the delivery pipeline of FIG. 1.
  • FIG. 12 pictorially illustrates a second nonlimiting example of the broader interpretation of
  • FIG. 14 is a block diagram illustrating four types of bitstream that can be used for delivering main, side, and rare views in the delivery pipeline of FIG. 1 according to some examples.
  • FIGS. 15A-15B are block diagrams illustrating a single main view playback mode used in the delivery pipeline of FIG. 1 according to some examples.
  • FIG. 16 is a block diagram illustrating the switching from the main view to a side view in the bitstream configuration of FIG. 15A according to one example.
  • FIG. 16 is a block diagram illustrating the switching from the main view to a side view in the bitstream configuration of FIG. 15A according to one example.
  • FIG. 17 is a block diagram illustrating the switching from the main view to a rare view in the bitstream configuration of FIG. 15A according to one example.
  • FIG. 18 is a block diagram illustrating the loop viewing playback mode according to one example.
  • FIGS. 19A-19C pictorially illustrate a GUI that can be used at the user side according to one example.
  • FIG. 20 is a block diagram illustrating a configuration of the GUI of FIG. 19 showing the editing history corresponding to a primary view according to one example.
  • FIGS. 21A-21C are block diagrams illustrating layout and/or presentation options corresponding to the GUI configuration of FIG. 20 according to some examples.
  • FIG. 22 pictorially illustrates a configuration of the GUI of FIG.
  • FIG. 23 is a block diagram illustrating different priorities assigned to different MPAV bitstreams according to one example.
  • FIG. 24 is a flowchart illustrating an MPAV workflow implemented at the server side according to some examples.
  • FIG. 25 is a flowchart illustrating a workflow implemented with a server side and a playback side during MPAV playback according to some examples.
  • FIG. 26 is a flowchart illustrating client-server communications and the corresponding processing operations performed at the client that can be implemented in the MPAV system according to various examples.
  • FIG. 27 is a block diagram illustrating a DASH configuration for grouping four perspective components belonging to a single representation within an MPD according to one example.
  • FIG. 28 is a block diagram illustrating a DASH configuration for defining two groups of multi-perspective video experiences according to one example.
  • FIG. 29 is a block diagram illustrating a DASH configuration for signaling the hierarchy of data structure in the MPD according to one example.
  • FIG. 30 is a block diagram illustrating a DASH configuration for using MPER descriptors at the Adaptation set level according to one example.
  • FIG. 31 is a block diagram illustrating a process of adding a user-contributed primary video to MPAV content according to one example.
  • FIGS. 32A-32C are block diagrams illustrating a process of adding a user-contributed comment video to MPAV content according to one example.
  • FIG. 33 is a block diagram of an example computing device, one or more instances of which can be used to implement various methods and processes disclosed herein according to various examples.
  • DETAILED DESCRIPTION [0043]
  • MPAV streaming can beneficially be used to provide immersive experience for end users from multiple perspectives/viewing angles within the same event.
  • the pertinent MPAV footage can be stored in a bitstream with metadata to allow for synchronized playback between different perspectives.
  • the corresponding MPAV streaming system then enables the end user to seamlessly switch from one perspective to another perspective for an enhanced immersive experience.
  • FIG. 1 depicts an example process of a video/audio/image delivery pipeline (100), showing various stages from video/audio/image capture to content display according to an embodiment.
  • a sequence of video/image frames (102) may be captured or generated using an image-generation block (105).
  • the frames (102) may be digitally captured (e.g., by a digital camera) or generated by a computer (e.g., using computer animation) to provide video, audio, and/or image data (107).
  • the frames (102) may be captured on film by a film camera. Then, the film may be translated into a digital format to provide the video/audio/image data (107).
  • the data (107) may be edited to provide a video/audio/image production stream (112).
  • the data of the video/audio/image production stream (112) may be provided to a processor (or one or more processors, such as a central processing unit, CPU) at a post-production block (115) for post-production editing.
  • the post-production editing of the block (115) may include, e.g., adjusting or modifying colors or brightness in particular areas of an image to enhance the image quality or achieve a particular appearance for the image in accordance with the video creator’s creative intent.
  • This part of post-production editing is sometimes referred to as “color timing” or “color grading.”
  • Other editing e.g., scene selection and sequencing, image cropping, addition of computer-generated visual special effects, removal of artifacts, etc.
  • video and/or images may be viewed on a reference display (125).
  • the data of the final version (117) may be delivered to a coding block (120) for being further delivered downstream to decoding and playback devices, such as television sets, set-top boxes, movie theaters, and the like.
  • the coding block (120) may include audio and video encoders, such as those defined by the ATSC, DVB, DVD, Blu-Ray, and other delivery formats, to generate a coded bitstream (122).
  • the coded bitstream (122) is decoded by a decoding unit (130) to generate a corresponding decoded signal (132) representing a copy or a close approximation of the signal (117).
  • the receiver may be attached to a target display (140) that may have somewhat or completely different characteristics than the reference display (125).
  • a display management (DM) block (135) may be used to map the decoded signal (132) to the characteristics of the target display (140) by generating a display-mapped signal (137).
  • the decoding unit (130) and display management block (135) may include individual processors or may be based on a single integrated processing unit.
  • a codec used in the coding block (120) and/or the decoding block (130) enables video/audio/image data processing and compression/decompression.
  • the compression is used in the coding block (120) to make the corresponding file(s) or stream(s) smaller.
  • the decoding process carried out by the decoding block (130) typically includes decompressing the received video/audio/image data file(s) or streams(s) into a form usable for playback and/or further editing.
  • FIG. 2 is a block diagram illustrating a configuration (200) for capturing MPAV content according to one example.
  • the configuration (200) can be used in the production phase (110) of the delivery pipeline (100).
  • the configuration (200) includes video/audio capture devices (210 1 -210 4 ), such as cameras. Each of the video/audio capture devices (210 1 -210 4 ) is oriented at a different respective angle with respect to a scene (220) that is being captured.
  • 3A-3D pictorially illustrate views of the scene (220) captured using the configuration (200) according to one example.
  • the four views of the scene (220) correspond to the video/audio capture devices (2101-2104), respectively.
  • a corresponding distribution component plays an important role because it can significantly affect the final viewing experience.
  • the distribution component the delivery pipeline (100) can be configured to use different types of file format and storage for streaming MPAV content.
  • Various features of the delivery pipeline (100) related to the streaming of MPAV content are described in more detail below.
  • the MPAV streaming mode uses a format that is different from a storage format equivalent owing to the limited bandwidth of the network channel(s) between the server and client sides and due to usually limited decoding capabilities of consumer-grade playback devices.
  • some embodiments disclosed herein operate to analyze each perspective to understand their respective characteristics and relative priority.
  • different perspectives can be encoded with different video parameters as different bitstreams.
  • the playback side can request different bitstream versions to enable low-latency and seamless view switching.
  • example embodiments provide ways to prioritize each perspective and to assign attributes to other views that are associated with each view. Also provided are methods of setting video coding parameters for different types of bitstream.
  • the disclosed server-side content preparation and playback-side bitstream download request system beneficially enable low-delay and seamless view switching.
  • Some examples use the MPEG-DASH format for these purposes, together with a new MPD descriptor.
  • An editing mode capable of adding a new perspective and comment video upload is provided as well.
  • DASH stands for Dynamic Adaptive Streaming over HTTP
  • MPD stands for the Media Presentation Description of the MPEG-DASH standard.
  • Architecture of MPEG-DASH Based MPAV Streaming System [0053] This section describes an example MPAV format for streaming. This format incorporates some features of MPEG-DASH and ISO-BMFF as described below.
  • FIG. 4 is a block diagram illustrating a communication system (400) configured to support MPAV transmissions using DASH and ISO BMFF according to some examples.
  • the communication system (400) includes a server (402) and a client (406) connected via a communication channel (404).
  • the server (402) includes a data storage device having stored therein one or more MPAV containers (500).
  • the client (406) requests and receives, via the communication channel (404), pertinent portions of the data stored in the MPAV containers (500) as described in more detail below.
  • FIG. 4 is a block diagram illustrating a communication system (400) configured to support MPAV transmissions using DASH and ISO BMFF according to some examples.
  • the communication system (400) includes a server (402) and a client (406) connected via a communication channel (404).
  • the server (402) includes a data storage device having stored therein one or more MPAV containers (500).
  • the client (406) requests and receives, via the communication channel (404), pertinent portions of the data
  • the MPAV container (500) includes a plurality of segments logically organized (e.g., using appropriate indexing) to form three sets of planes of a 3D space whose dimensions are Adaptation Set, Representation, and Segment Time.
  • An end user at the client (406) will typically request one initialization segment (632) per period (610).
  • the initialization segment (632) contains camera information and metadata related to the media segment component (e.g., video, audio, subtitle, etc.) so that the end user knows which view(s) to use.
  • FIG. 6 is a block diagram illustrating a high-level hierarchical data model (600) that can be used in DASH according to some embodiments.
  • the model (600) includes a sequence in time of the corresponding Media Presentation.
  • the model (600) includes the choices offered in the Media Presentation, to be selected by the DASH Client in a static or dynamic manner.
  • the model (600) relies on the following definition of various levels used in the DASH hierarchy:
  • Media Presentation Description (MPD) a DASH Media Presentation is described by an MPD document (602). This document describes a sequence of Periods (610) in time that make up the Media Presentation.
  • Period a period (610) typically represents a media content period during which a consistent set of encoded versions of the media content is available.
  • material is arranged into Adaptation Sets (620).
  • Adaptation Set an adaptation set (620) represents a set of interchangeable encoded versions of one or several media content components and contains a set of Representations (630).
  • the concept of Preselection (612) is used to enable a combination of different Adaptation Sets (630) into a single decoding instance and user experience.
  • a representation (630) describes a deliverable encoded version of one or several media content components and includes one or more media streams (e.g., one for each media content component in the multiplex).
  • One example way to use the representation (630) is to apply different bit rates.
  • the content may be divided in time into Segments (632, 634, 636) for proper accessibility and delivery.
  • Segment in some examples, a segment is a largest unit of data that can be retrieved with a single HTTP request. In order to access a segment, a URL is provided for each Segment.
  • An Initialization Segment contains static metadata for the Representation (630); o Media Segments (634) contain media samples and advance the timeline.
  • Self-initializing Segment Some representations (630) may also be organized to have a single self-initializing Segment (636) which contains both initialization information and media data.
  • an important component at the server side is the MPD. When the user joins the session to start an MPAV playback, an MPD will be sent to end user.
  • the MPD file is configured to contain sufficient information for end users to select which perspective/bitstream to request for downloading.
  • each adaptation set represents one MPAV from one perspective with a specific compositing configuration. For example, if we have 50 perspectives, then we will have 50 corresponding adaptation sets. The end user can decide which adaptation set to download by choosing which perspective with specific compositing to render. Note that the end user can request one adaptation set for a single perspective playback, or multiple adaptation sets for pre-fetch, or showing multiple perspectives simultaneously.
  • each representation is associated with one particular encoded MPAV bit rate. A different respective representation can be chosen according to the current network condition between server and client.
  • each Adaptation Set includes the Representation (630) indexed to logically be in a plane that is orthogonal to the Adaptation Set axis of the logical 3D space of the MPAV container (500).
  • each Adaptation Set corresponds to one perspective. If we have fifty camera views, then the MPAV container (500) has fifty corresponding Adaptation Sets. The end user selects a specific Adaptation Set by choosing which perspective to render. In various examples, the end user may request segments from one Adaptation Set for single- perspective rendering or segments from multiple Adaptation Sets for multi-perspective rendering, depending on the rendering algorithm and other relevant conditions.
  • Each Representation includes the segments (632, 634) indexed to logically be in a plane that is orthogonal to the Representation axis of the logical 3D space of the MPAV container (500). In one example, each Representation is associated with one respective encoded bit rate.
  • Each segment-time plane has the segments (634) indexed to logically be in a plane that is orthogonal to the Segment Time axis of the logical 3D space of the MPAV container (500).
  • each segment-time plane is associated with a different respective playtime within a video.
  • Different segment-time planes typically correspond to different respective playtimes. In the view presented in FIG.
  • each of the segments (634) is shown as a box whose location in the 3D space of the MPAV container (500) reflects the segment’s playtime, view, and bit rate.
  • the metadata transmitted through the communication channel (404) include: • An MPD (412): the MPD (412) provides information about the MPAV container (500) and the URLs for different segments corresponding to different Adaptation Set, representation, and segment-time planes. The MPD (412) enables the end user to retrieve sufficient MPAV scene information and determine which media bitstream(s) to download.
  • An initialization segment (414) an initialization segment (414) is sent for each period (610) and provides information about the number of views, the intrinsic/extrinsic camera parameters, and indicators for post-decoder operations along the time dimension. Note that different MPAV scenes may have different respective numbers of perspectives and different respective camera-pose setups.
  • the transmitted initialization segment (414) contains those time-dependent pieces of information.
  • three inputs are used to determine which media segment(s) to request from the server (402): • Novel viewing pose (436): o
  • the client (406) infers the users’ new selected viewing pose (436).
  • the client (406) has information about the selection of camera views. o Based on the above-indicated information, the client (406) operates to determine from which view(s) (Adaptation Set(s)) of the MPAV container (500) to request segments. o Note that depending on the rendering algorithm, the segments can be requested from one Adaptation Set or multiple Adaptation Sets of the MPAV container (500). o In some examples, a specialized algorithm configured to provide smooth playback transitions when the end-user requests a relatively large change in the novel viewing pose (436) can be used to request a larger number of perspectives, for example, at a lower bit rate due to bandwidth considerations.
  • Network condition (432) According to the network condition experienced by the communication channel (404), such as bandwidth, packet loss, and packet delay, the client (406) operates to determine from which Representation of the MPAV container (500) to request segments.
  • the request (426) is then communicated, e.g., via an http GET message (416) to the server (402).
  • the server (402) transmits MPAV bitstreams (418) carrying the desired media segments (634) for being rendered (438) at the client (406).
  • the MPAV container (500) can be further extended to include additional dimensions, such as resolution, codec type, frame rate, etc.
  • the bit rate dimension may be collapsed down to a single bit-rate value.
  • Quick UDP Internet Connections [0065] To support priority transmission, some embodiments are configured to use Quick UDP Internet Connections (QUIC) to set up multiple queues with different priority.
  • QUIC Quick UDP Internet Connections
  • QUIC is a general- purpose transport layer network protocol supported by Google Chrome, Microsoft Edge, Firefox, and Safari web browsers. Although its name was initially proposed as the acronym for “Quick UDP Internet Connections,” IETF’s use of the word QUIC is not an acronym. Rather, QUIC is simply the name of the protocol. In various examples, QUIC improves performance of connection-oriented web applications that are currently using TCP.
  • QUIC works together with HTTP/2’s multiplexed connections, allowing multiple streams of data to reach all the endpoints independently, and hence independent of packet losses involving other streams.
  • HTTP/2 hosted on Transmission Control Protocol TCP
  • QUIC secondary goals include reduced connection and transport latency, and bandwidth estimation in each direction to avoid congestion.
  • FIGS. 7A-7B are block diagrams pictorially illustrating example improvements provided by the use of the QUIC protocol in the delivery pipeline (100) according to some examples. More specifically, FIGS. 7A-7B pictorially illustrate how the QUIC protocol can alleviate the head of line issues encountered with a legacy TCP connection. For example, in the network configuration illustrated in FIG.
  • a client (702) is connected, via the Internet (704), to a server (708) using a TCP connection (706).
  • the server (708) operates to send a sequence of packets (1-8) to the client (702).
  • the delivery of the third packet (3) fails, then the delivery of the subsequent packets (4-8) is adversely affected by the failure.
  • a client (712) is connected, via the Internet (704), to a server (718) using a UDP connection (716).
  • the native multiplexing support provided by QUIC enables multiple byte streams passing through the UDP connection (716) without being blocked by a failed packet.
  • FIG. 8 is a block diagram illustrating a configuration (800) for capturing MPAV content according to another example.
  • ⁇ ⁇ 10 cameras (810 0 -810 9 ) in the configuration (800).
  • the set of the cameras (810 0 -810 9 ) as ⁇ ⁇ .
  • the cameras (810 0 - 810 9 ) operate to capture the same event or scene, with each of the cameras (810 0 -810 9 ) being located at a different respective location and having a respective orientation, e.g., as indicated in FIG. 8.
  • the location (x, y) for the i th camera as ⁇ ⁇ , ⁇ .
  • the shooting angle and the coverage for the i th camera are represented as ⁇ ⁇ , ⁇ and ⁇ ⁇ , ⁇ .
  • the camera locations can be obtained plan estimation.
  • Each camera may be operated by different user and faced at a different respective angle with a respective coverage, e.g., characterized by zoom-in/out and the camera height/width ratio.
  • Main view is the current camera/perspective position an end-user is watching. Normally, it is preferred for the main view to display a relatively high (e.g., the highest available) spatial resolution of video and a relatively high (e.g., the highest available) resolution of audio, subject to the currently available bandwidth.
  • Side view e.g., the highest available spatial resolution of video and a relatively high resolution of audio, subject to the currently available bandwidth.
  • a side view is the other relatively highly relevant view of the event/scene that provides additional information not available from the main view. For each main view, there will be several associated side views (but fewer than all of the other views). Those side views are not watched by the end user at the current moment (or at least not watched in a full screen mode but may be overlayed on top of the main view video as video thumbnails). One of the side views is very likely to be switched to from the current view. When the switching request arises, an example embodiment is enabled to perform fast switching from the main view to the requested side view(s). To enable such fast switching, we allow lower spatial resolution of a side view bitstream comparing to that of the main view.
  • a practical and effective solution is to transmit the side views in a pre-fetch fashion to be decoded together or along with the main view.
  • Rare view are the perspectives that do not belong to the set of key views from the side view perspectives. For example, a rare view may provide more detailed (e.g., finer resolution) information but some of that information might already be covered in a related side view.
  • the rare views form a set of on-demand views to be invoked only when the end users specifically request them.
  • the rare view bitstreams can be encoded at a lower spatial resolution and with high intra frame frequency, e.g., smaller groups of pictures, GOPs.
  • the end user can download and decode the corresponding rare view bitstream as fast as possible.
  • the main view will be the view i.
  • one needs to determine the corresponding side view set ⁇ , ⁇ and the corresponding rare view set ⁇ , ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ .
  • the maximum number of side views is ⁇ ⁇ ⁇ ⁇ (limited by the number of prefetch views the system can deliver through the channel and get decoded with the playback device).
  • example determination operations may include the following: • Group all views in ⁇ into ( ⁇ + 1) clusters according to the location ⁇ , ⁇ using the K- ⁇ ( ⁇ ) means clustering method.
  • t ( ⁇ ) ( ⁇ ) • h cluster ⁇ ⁇ , ⁇ determine the view ( ⁇ ⁇ , ⁇ ) having the most coverage using information ( ⁇ , ⁇ , ⁇ , ⁇ ) for ⁇ ⁇ ⁇ ( ⁇ ) ⁇ , ⁇ .
  • K-means clustering is a originally from signal processing, that aims to partition n observations into K clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This method results in a partitioning of the data space into Voronoi cells.
  • K-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances. The corresponding problem is relatively computationally difficult (NP-hard). However, efficient heuristic algorithms converge quickly to a local optimum.
  • FIG. 9-10 pictorially illustrate some of the above-described determination operations directed at sorting the camera views of the configuration (800) into main, side, and rare views according to one example. More specifically, FIG. 9 pictorially illustrates the grouping operation in which the camera views (810 0 -810 9 ) are grouped into three clusters (902, 904, 906). After the clustering, for each view i at time t, the system operates to build the sets ⁇ ⁇ , ⁇ and ⁇ ⁇ , ⁇ as described above. FIG. 10 pictorially illustrates the resulting categorization of (810 0 -810 9 ) into main, side, and rare views. In the example shown, the tenth camera view (8109) is categorized as the main view.
  • the third and seventh camera views (810 2 , 810 6 ) are categorized as the side views.
  • the remaining camera views are categorized as the rare views.
  • this information can be stored in the initial MPD when the content is initially published, e.g., to enable the end users to choose bitstreams for prefetch and last-fetch.
  • the term “perspective” is not limited to a camera view. Under the broader interpretation of the term “perspective,” the main/side/rare views sets can be still generated, but such generation will utilize other suitable features for clustering and categorization. Several nonlimiting examples of the broader interpretation of the term “perspective” are described in more detail below. [0076] FIG.
  • FIG. 11 pictorially illustrates a first nonlimiting example of the broader interpretation of the term “perspective.” More specifically, the example of FIG. 11 corresponds to a band including three musical instruments and a vocalist.
  • the “guitar” view shown in FIG. 11 is the main view.
  • the corresponding side views are the bass view and the drum views indicated by the thumbnail icons.
  • the rare view is the vocalist view (not explicitly shown).
  • FIG. 12 pictorially illustrates a second nonlimiting example of the broader interpretation of the term “perspective.” More specifically, the example of FIG. 12 corresponds to a chess match.
  • the “referee” view of the players shown in FIG. 12 is the main view.
  • the chessboard view shown in the corner inset is the side view.
  • FIGS. 13A-13B pictorially illustrate a third nonlimiting example of the broader interpretation of the term “perspective.”
  • different “perspectives” can be different products.
  • the three perspectives indicated in the right panel represent different colors of the lipstick product demonstrated by the influencer in the main view shown in the left panel.
  • the two perspectives indicated in the bottom row of the right panel represent different makeup products that can be used in the makeup techniques demonstrated by the influencer in the main view shown in the left panel.
  • the view switching history can be recorded, and one can use this information to update the side and rare views.
  • each segment t For example, at certain time periods (e.g., days or weeks), for each segment t, we maintain a 2D histogram table ⁇ h ( ⁇ ,-) ⁇ ⁇ with dimension ⁇ ⁇ x ⁇ ⁇ to indicate the view switching from view a to view b.
  • This view switching information is obtained when each end user makes a request for a new main view from the current main view.
  • the side view set ⁇ ⁇ , ⁇ of each main view i can be updated periodically using the ⁇ largest element ( ⁇ ,.) ⁇ s of ⁇ h ⁇
  • 0 0, ... , ⁇ ⁇ 1 ⁇ .
  • MPAV Bitstream Preparation and Interactive Streaming [0081]
  • Some solutions are implemented at the playback side to request different types of bitstreams depending on the viewing mode and current playback time. For example, a user interface is often addressed to indicate the playback behavior.
  • Some examples provide four different types of bitstreams. These types are distinguished using the following properties: (1) spatial resolution and (2) GOP size. • Spatial resolution. The higher spatial resolution a bitstream has, the higher bit rate a bitstream needs, with the associated benefit of providing a sharper video quality. • GOP size. The GOP size impacts both the video compression efficiency and the random- access Instantaneous Decoder Refresh (IDR) capability.
  • IDR Instantaneous Decoder Refresh
  • decoding a non-IDR frame involves decoding an IDR frame and possibly some other non-IDR frame(s) first, which introduces playback delay.
  • the following four types of bitstreams can be constructed: • High resolution and long GOP (HRLG). This type of bitstream provides a highest video quality, with a highest bit rate.
  • the distance between two consecutive IDR frames is large, e.g., up to 2 seconds, to enable better picture quality (at the same bit rate).
  • Mid resolution and long GOP MRLG
  • This type of bitstream provides a medium video quality, with a medium bit rate.
  • the distance between two consecutive IDR frames is large, e.g., up to 2 seconds, to enable better picture quality (at the same bit rate).
  • Low resolution and long GOP LRLG
  • This type of bitstream provides a low video quality, with a low bit rate.
  • the distance between two consecutive IDR frames is large, e.g., up to 2 seconds.
  • Low resolution and short GOP LRSG). This type of bitstream provides a low video quality, with a medium bit rate.
  • the distance between two consecutive IDR frames is small, e.g., less than 1 second.
  • a main objective is to provide low latency for download and decoding.
  • different types of views adopt different types of bitstream.
  • Main view use HRLG or MRLG, depending on the number of main views the end user is watching simultaneously (and the current network condition).
  • Side view use LRLG.
  • Denote the GOP size as >? (@) .
  • FIG. 14 is a block diagram illustrating four types of bitstream that can be used for delivering main, side, and rare views in the delivery pipeline (100) according to some examples.
  • the four types are the HRLG, MRLG, LRLG, and LRSG bitstreams, respectively.
  • the resolution is indicated by the geometric size of the frame, with a larger geometric size representing a higher resolution.
  • the HRLG, MRLG, and LRLG bitstreams have progressively lower resolution.
  • the LRLG and LRSG bitstreams have the same resolution.
  • the relative GOP size is indicated by the number of non-IDR frames between two consecutive IDR frames.
  • each main view may contain multiple tiles.
  • a main view tile is used for the originally captured videos.
  • Several side view tiles are used for the later commented videos.
  • the side view tiles can be moved/overlapped to/with different areas on top of the main view tile according to users’ preferences. Tile information for each view is also entered into the MPD.
  • the following different playback modes are supported: (1) a single main view; (2) two or more simultaneous main views; and (3) loop viewing. These three playback modes are described in more detail below in reference to FIGS. 15-18.
  • a single main view playback mode a normal configuration is to display only one selected main view.
  • the main view bitstream is requested for download and decoded.
  • the resolution and quality for the main view is maintained at a highest possible level according to the network condition.
  • Several side view bitstreams with a smaller resolution are also downloaded together with main view bitstream, which provides the user(s) with a choice of those side views to watch.
  • FIGS. 15A-15B are block diagrams illustrating a single main view playback mode used in the delivery pipeline (100) according to some examples. More specifically, FIG.
  • a first bitstream (1502) is an HRLG bitstream that carries a main view.
  • Each of three second bitstreams (1504 1 -1504 3 ) is an LRLG bitstream that carries a different respective side view.
  • the relative view resolutions indicated in FIG. 15A correspond to those indicated in FIG. 14.
  • FIG. 15B also illustrates a configuration in which four bitstreams are transmitted.
  • the first bitstream (1502) is an HRLG bitstream that carries a main view.
  • Each of two second bitstreams (15041, 15042) is an LRLG bitstream that carries a different respective side view.
  • a third bitstream 1506 is an LRSG bitstream that carries a rare view.
  • FIG. 16 is a block diagram illustrating the switching from the main view (1502) to the side view (1504 3 ) in the bitstream configuration of FIG. 15A according to one example.
  • the icon (1602) indicates the time at which the view switching occurs, and the side view (1504 3 ) is specified by the end user.
  • the side view (15043) is pre-fetched/downloaded together with main view (1502) so that the switching can be substantially instantaneous.
  • the lower resolution of the side view is upscaled to display’s resolution as indicated by the display view sequence shown in the bottom part of FIG. 16.
  • FIG. 16 is a block diagram illustrating the switching from the main view (1502) to the side view (1504 3 ) in the bitstream configuration of FIG. 15A according to one example.
  • the icon (1602) indicates the time at which the view switching occurs, and the side view (1504 3 ) is specified by the end user.
  • the side view (15043) is pre-fetched/downloaded together with main view (1502) so that the switching can be substantially instant
  • FIG. 17 is a block diagram illustrating the switching from the main view (1502) to a rare view in the bitstream configuration of FIG. 15A according to one example.
  • the icon (1702) indicates the time at which the view switching occurs, and the rare view is specified by the end user.
  • the rare view is NOT prefetched/downloaded together with main view (1502).
  • the playback side needs to request the rare view which is encoded using a lower resolution with a smaller GOP, e.g., using the LRSG format (also see FIG. 14). For this reason, the switching will have a larger delay than that in the scenario illustrated in FIG. 16.
  • the delay can be reduced (e.g., minimized) by requesting the bitstream with the IDR frame closest to the current playback time (e.g., frame d in this case) so that it does not need to be downloaded from frame a.
  • B ⁇ , ⁇ the bit rate for the main view
  • B ⁇ , ⁇ the bit rate for the currently (@) selected side view
  • the total estimated available bandwidth is C ⁇ .
  • the remaining bandwidth to transmit the rare view (the bit rate B(A) ⁇ , ⁇ ) is expressed as C ⁇ ⁇ B( ⁇ ) ⁇ , ⁇ ⁇ ⁇ E%,& B (@) ⁇ , ⁇ .
  • the current frame and the previous IDR frame at main view is ⁇ ⁇ ,MNO
  • overhead time including the time for requesting a rare view, (A) response time from the server side, and pre-decoding time at the client side
  • ⁇ ⁇ ,P Denote the expected number of frames for the main view and rare view to sync up from current playback frame index as 9.
  • the time duration (in terms of the number of frames) in the main view is ⁇ ( ⁇ ) ⁇ ,MNO + 9.
  • the actual time duration to download the rare view (in terms of the number of frames) is 9 ⁇ ⁇ (A) ⁇ ,P ; and the number of rare view frames we can carry using the remaining bandwidth is ⁇ (9 ⁇ ⁇ (A) ⁇ ,P ).
  • ⁇ (H) V (K) %,STU A % ⁇ %,W Eq.
  • the player is configured to use the bit rate information [C , B( ⁇ ), B(@), B (A) (A), ⁇ ( ⁇ ) ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ and delay ⁇ ⁇ ,P ⁇ ,MNO ⁇ to determine which rare view segment to request to minimize the latency.
  • the end user wants to simultaneously watch multiple main views on the same screen. In this case, multiple main views are requested and downloaded at the same time.
  • the resolution and bit rate for all views may need to be reduced to a lower spatial resolution. Due to the bandwidth limitations, the side view is not transmitted. View switching for each main view is updated when the playback enters the next segment.
  • FIG. 18 is a block diagram illustrating the loop viewing playback mode according to one example.
  • the loop viewing playback mode enables the user to watch a portion of the MPAV content over and over again for a set of views.
  • the loop mode can be implemented to support the above-described single main view and/or two or more simultaneous main views.
  • the playback device can gradually download higher resolution bitstreams (sequentially or in parallel) for each view and eventually all main views become available at the highest possible resolution under the bandwidth constraint.
  • none of the views is available in the highest resolution.
  • In loop 2 one of the views is available in the highest resolution.
  • FIGS. 19A-19C pictorially illustrate a GUI (1900) that can be used at the user side according to one example.
  • the GUI (1900) represents the configuration (800) (also see FIG. 8).
  • the GUI (1900) is configured to dynamically show the main, side, and rare views when the user moves the focus to a different main view location.
  • the tenth camera view (8109) is selected by the user as the main view.
  • FIG. 19A the tenth camera view (8109) is selected by the user as the main view.
  • FIG. 20 is a block diagram illustrating a configuration (2000) of the GUI (1900) showing the editing history corresponding to a primary layer according to one example.
  • a primary layer (2002) from one view is provided with four comments (C0, C1, C2, and C3) uploaded from other four users. The comments may be related to the scene/event in general and/or to this view in particular.
  • FIG. 22 pictorially illustrates a configuration of the GUI (1900) according to another example.
  • an information icon (2202) pops up showing the layout and/or presentation options (2102, 2104, 2106) corresponding to that camera view.
  • the comment videos can be encoded/processed using one of at least two different methods: • Rendered (burn in) together with the primary layer as a new video version. This method can save the download bandwidth but may lack in the flexibility for the end user to have a selected level of interactivity. • Stored as a side tile and downloaded together with the primary layer so that end user can move the comment video to a different location overlap with the primary video and play the audio for different comments.
  • the MPD is configured to describe different levels of information: • Level 1 - Main view/side view/rare view: For each view, we specify the URL to download the main view and to prefetch the side view. The URL for download the instantaneous-fetch rare view is also provided. A tag in MPD XML is introduced for side views ⁇ ⁇ , ⁇ per main view i. • Level 2 -Version History: For each view, we also specify the provided and some icons linking to the URL of different versions.
  • each bitstream stored in the server is encoded as a tile.
  • the playback side will request different bitstreams containing different tiles, use a high-level syntax transcoder to merge them into a single video frame, and push the merged frame into a single video decoder for decoding.
  • the user-side player has a set of buffer features, wherein the downloading queue is governed by the (interactive) content player.
  • FIG. 23 is a block diagram illustrating different priorities assigned to different MPAV bitstreams according to one example.
  • different views are assigned with different priority level queues as follows: • Main view bitstream in high priority queue: A main view has the content that the user is watching at the current moment. It is assigned the highest priority to ensure the smoothness during the playback.
  • the corresponding bitstream can be an HRLG bitstream or an MRLG bitstream.
  • Side view bitstream in low priority queue A side view has the backup content the user might want to watch with relatively high probability. However, the user is not watching the side view at the current moment. The corresponding stream is therefore downloaded at a low priority.
  • Rare view bitstream in mid priority queue A rare view has the content the user wants to watch at the current moment, but it is not yet available. To ensure the smooth view switching, the main view streaming should still remain at the high priority. The priority of the rare view streaming is lower than that of main view but higher than that of the side view. By assigning different priorities to different type of views, the corresponding embodiments tend to provide better low-latency interactivity.
  • FIG. 24 is a flowchart illustrating an MPAV workflow (2400) implemented at the server side according to some examples.
  • the workflow (2400) is triggered when the view switching is enabled or when new content (either new perspective or comment video) is added.
  • Main operations of the workflow (2400) include the following: • With new content (2404), the server operates to analyze the perspective of the entire content (including existing content and the new content (2404)) to get the camera location(s), viewing angle(s), and content coverage. Then, an algorithm (2410) is run to cluster all views.
  • FIG. 25 is a flowchart illustrating an MPAV workflow (2500) implemented with a server side (2502) and a playback side (2504) during playback according to some examples.
  • Main operations of the workflow (2500) include the following: • At the beginning of each event, the server (2502) operates (2510) to send a corresponding MPD (2511) to the client (2504). The client (2504) operates (2512) to receives the sent MPD (2511). • The client (2504) operates (2514) to build the corresponding GUI (1900) using the MPD three-level tags to show the view cluster. When the user moves the cursor (1902) to a selected camera view (treated as the main view), the GUI (1900) will show its own side view and rare view. • The client (2504) operates (2516) to receive the user input specifying the playback mode to show a desired number of main views, and whether to play them in a loop mode.
  • the playback system of the client (2504) performs a set of operations (2520). Therein, the playback system operates to determine the side view from the MPD (2511). The playback system also detects whether a rare view is to be requested.
  • the client (2504) operates (2528) to send a request (2529) specifying bitstreams (2535) to be sent from the server (2502).
  • the server (2502) operates (2530) to receive the request (2529).
  • the server (2502) operates (2532) to update the statistics of view switching.
  • the server (2502) further operates (2534) to sends the requested bitstreams (2535) to client (2504).
  • MPEG-DASH MPD [00105] This section discloses embodiments directed at adapting MPEG-DASH to MPAV purposes and enabling seamless perspective switching for interactive video streaming experience. Such embodiments build upon the existing DASH technologies/infrastructure, to which they add new tools configured to fill the corresponding gaps related to MPAV. This approach enables a relatively speedy deployment of MPAV over the existing infrastructure to the benefit of end users. [00106] In the interactive video streaming scenario, when the player/client joins the session to start the playback, an MPD is sent to the player/client.
  • the MPD contains sufficient information for the player/client to identify which perspective is to be prefetched and to select which segment(s) to download according to a change of the perspective.
  • the metadata transmitted through the channel may include: • Information about the structure of the whole audio/video representation, e.g., indicating whether it is a single- or multi-perspective representation. • When it is a multi-perspective, o The information configured to enable the player/client to identify perspectives to be prefetched. o The information about the view categorization, e.g., indicating which views are main, side, and rare views, among the different perspectives. • Version information to specify the provided versions and to link to the URLs of different versions.
  • each preselection represents a combination of Adaptation Sets that can be selected and prefetched for joint provision by the given time period.
  • each Adaptation Set represents one or multiple perspectives. Each perspective is associated with a perspective information, such as particular extrinsic camera information or a particular label. When each Adaptation Set represents one perspective, if we have four perspectives, we will have four corresponding adaptation sets. Each Adaptation Set can represent multiple perspectives.
  • each representation is associated with one particular bitrate version or GOP length of the encoded bitstream. The particular representation is chosen according to the current network condition between the server and client and/or according to the selected perspective.
  • FIG. 26 is a flowchart (2600) illustrating communications between a server (2620) and a client (2630) and the corresponding processing operations performed at the client (2630) according to various examples.
  • the client (2630) includes a MPAV playback system.
  • a final block (2611) of operations in the flowchart (2600) includes the client generating the output view and rendering the generated view according to the user’s selected-view input.
  • an MPD (2601) is sent from the server (2620) to the client (2630).
  • the client (2630) parses the MPD (2601) in a first block (2602) of operations and determines a preselection based on one or more user inputs in a second block (2603) of operations.
  • the client (2630) selects a collection of adaptation sets, sends a request (2605) for the corresponding initialization segment, and receives the requested initialization segment (2606).
  • the client (2630) determines one or more representations and sends one or more corresponding requests (2608) to the server (2620).
  • the client (2630) receives the requested bitstreams (2609) and decodes the received bitstreams.
  • the client (2630) uses the decoded bitstreams to generate the view(s) that fit(s) to the selected view and renders the generated view for the user.
  • the Adaptation Set contains multiple perspective representations multiplexed at the file-container level, each media component is mapped to a Content Component.
  • a Representation contains multiple tracks, and each track is mapped to a Content Component, e.g., as defined in the sub-clause 5.3.4 of ISO/IEC 23090-1.
  • each perspective video has a different respective set of properties, such as the bitrates and GOP lengths
  • the Sub-Representation element describes the properties of video components.
  • each representation contains four video tracks and that each track corresponds to a perspective, in three different bitrates.
  • This Representation conforms to a Sub-Indexed Media Segment as defined in subclause 6.3.4.4 of ISO/IEC 23090-1.
  • the Initialization Segment contains the Level Assignment ('leva') box.
  • the SubRepresentation@level specifies the level to which the described Sub-Representation is associated in the Subsegment Index.
  • the data in Representation, Sub-Representation, and in the Level Assignment ('leva') box contain information on the assignment of media data to levels.
  • the PreSelection is used to enable a combination of different Adaptation Sets into a single user experience.
  • the MPAV preselection which includes multiple perspectives for a single user experience, may be signaled either in the MPD using a PreSelection element within the Period element or in a Preselection descriptor.
  • the MPAV preselection When the MPAV preselection is present in the DASH MPD, it indicates that it needs to prefetch multiple adaptation sets to provide a single user experience.
  • a single Adaptation Set contains multiple perspectives, the perspective can be referenced by the @id of a Content Component.
  • the Preselection references Content Components
  • the @id of Adaptation Sets and Content Components shall be unique within the scope of a Period.
  • the first id of @preselectionComponents defines the Main Adaptation Set which carries one of the main view perspective videos.
  • An MPAV descriptor is an EssentialProperty or a SupplementalProperty element with a @schemeIdUri set to "urn:dolby:dash:mpav".
  • the MPAV descriptor specifies which one is a main view, which needs to provide a higher quality version, and side views or rare views, optionally, among the members of the preselection. At most one MPAV descriptor shall be present at the Preselection level.
  • the @value attribute of the MPAV descriptor is not present.
  • the MPAV descriptor includes attributes as specified in Table 1. Table 1: Attributes for the MPAV descriptor.
  • mpav:@main_ids M Specifies a white space separated list of ids of Main view Adaptation Sets or Content Components that belongs to this Preselection.
  • mpav:@side_ids O Specifies a white space separated list of ids of the side view Adaptation Sets or Content Components that belong to this Preselection.
  • mpav:@rare_ids O Specifies a white space separated list of ids of the rare view Adaptation Sets or Content Components that belong to this Preselection.
  • FIG. 27 is a block diagram illustrating a DASH configuration (2700) for grouping four perspective components belonging to a single representation within an MPD according to one example. Each perspective is presented as a respective one of Adaptation Sets (2704, 2706, 2708, 2710). Using a Preselection (2702), the configuration (2700) groups the four Adaptation Sets (2704, 2706, 2708, 2710) that need to be prefetched to provide a single experience.
  • FIG. 28 is a block diagram illustrating a DASH configuration (2800) for defining two groups of multi-perspective video experiences according to one example.
  • a first Preselection (2802) groups four Adaptation Sets (2806, 2808, 2810, 2812) to provide a first multi-perspective video experience.
  • a second Preselection (2804) groups three Adaptation Sets (2812, 2814, 2816) to provide a second multi-perspective video experience.
  • the configuration (2800) also indicates which of the adaptation sets carries the main view, side view, and rare view in each experience.
  • One main view video component is delivered in a single Adaptation Set.
  • the other three video components are delivered in a separate Adaptation Set and described by ContentComponent elements.
  • the first experience is a combination of the main view and one side view video.
  • the second experience is a combination of the main view and three side views.
  • the MPAV Perspective (MPER) descriptor is used.
  • the MPER descriptor is a SupplementalProperty or EssentialProperty element with a @schemeIdUri set to "urn:dolby:dash:mpav:pers".
  • Zero or more MPER descriptors are present at Adaptation Set and at most one MPER descriptor is present at Sub- Representation levels, exclusively.
  • Zero or more MPER descriptors are present at Preselection level.
  • the @value attribute of the MPER descriptor is not present.
  • the MPER descriptor includes elements and attributes as specified in Table 2. Table 2: Elements and attributes for the MPAVPerspective (MPER) descriptor.
  • mpav:@id O Specifies the identifier of this perspective.
  • mpav:@label O Specifies the textual description to annotate this perspective.
  • mpav:@coord O Indicates the x-, y- and z-coordinates of the position of the perspective in the global reference coordinate system. The values in the array are in said order and the length of array is three.
  • the MPER descriptor indicates either the 3D (x, y, and z) coordinates of the position or the textual label of that main perspective, and the identifier of Adaptation Set carrying the corresponding main view perspective representation.
  • the information in this MPER descriptor at the Preselection level enables the player/client to select the proper Preselection that fits to the current selected perspective/view.
  • the indicated main view Adaptation Set or main view Content Component by @dsId starts to play.
  • the MPER descriptor When the MPER descriptor is present at the Adaptation Set level, it indicates either the 3D (x, y, and z) coordinates of the position or the textual label of the perspective that is carried in the adaptation set.
  • FIG. 29 is a block diagram illustrating a DASH configuration (2900) for signaling the hierarchy of data structure in the MPD according to one example.
  • a Preselection (2902) of configuration (2900) includes an MPER descriptor (2903) that indicates the x, y, and z coordinates of the position or the textual label of the main view perspective.
  • the DASH configuration (2900) includes Adaptation Sets (2904, 2906, 2908, 2910).
  • the MPER descriptor (2903) also indicated the identifier of the Adaptation set carrying the main view perspective representation. The information in the MPER descriptor (2903) at the preselection level enables the player/client to select a proper Preselection that fits to the current or selected perspective/view.
  • the DASH configuration (3000) includes a Preselection Set (3002) and Adaptation Sets (3004, 3006, 3008, 3010).
  • Each of the Adaptation Sets (3004, 3006, 3008, 3010) includes a respective one of Adaptation set level MPER descriptors (3005, 3007, 3009, 3011).
  • the information in the MPER descriptor at the Adaptation set level provides the respective information of each perspective.
  • the following MPD example illustrates the DASH configuration (3000) for providing a multi-perspective video experience with “front,” “vocal,” “drum,” and “guitar” perspectives/views.
  • each of the MPER descriptors (3005, 3007, 3009, 3011) present at the adaptation set level provides the respective “label” information.
  • the player/client can provide “label” information of each perspective to the user through the GUI based the information contained in the MPER descriptors (3005, 3007, 3009, 3011).
  • this information helps the player/client to filter the perspectives, e.g., to only show the particular label perspective to match the user preference, among the perspectives of the Preselection Set (3002).
  • the Adaptation Set or Sub-Representation associated with @tag containing a lower integer value indicates that it was edited/added earlier than one associated with @tag containing a higher value.
  • the following provides an MPD example containing three adaptation sets that were added earlier than another one adaptation set, which is indicated by using @tag.
  • PI descriptor is a SupplementalProperty element with a @schemeIdUri set to "urn:dolby:dash:mpav:pi".
  • a zero or more PI descriptors may be at the Adaptation Set or the Sub-Representation level.
  • the @value attribute of the PI descriptor is not present.
  • the PI descriptor includes elements and attributes as specified in Table 3. Table 3: Elements and attributes for the PI descriptor. Elements and Use Description attributes Personalization 0..N Specifies the personalization information to be used Info for filtering out the content.
  • Personalization M Indicates the condition associated with the media .
  • Info @cond for example, the value “source”indicates the source of the media
  • personalization M Indicate the property of the condition for carried Info @property media. For example, when @cond is equal to “source”, this value can be “original”, “comment”, or “supplemental”.
  • each Representation or Sub-Representation is associated with different respective GOP length
  • @maximumSAPPeriod is present at the Representation or Sub-Representation level to specify the maximum SAP interval in seconds of all contained media streams, where the SAP (Stream Access Point) interval is the maximum time interval between the TSAP of any two successive SAPs of one media stream in the associated Representations.
  • SAP Stream Access Point
  • One of the Representation includes a video stream with a maximum SAP interval of 3 seconds, and each of the other two Representations includes a respective video stream with a maximum SAP interval of 1 second.
  • the MPTS descriptor is a SupplementalProperty or EssentialProperty element with a @schemeIdUri set to "urn:dolby:dash:mpav:mpts". Zero or more MPTS descriptors may be present at the Adaptation Set, and at most one MPTS descriptor is present at the Sub-Representation level, exclusively. The @value attribute of the MPTS descriptor is not present.
  • the MPTS descriptor may include elements and attributes as specified in Table 4. Table 4: Elements and attributes for the MPTS descriptor.
  • the MPTS descriptor When the MPTS descriptor is present at the Sub-Representation level, it indicates allowed or non- allowed perspective transitions when the perspective carried in the Sub-representation is selected.
  • the following MPD example provides that each perspective has a different respective transition permission, for example, the “front” perspective allows to transition to “vocal” and “guitar” perspectives and disallows to transition to the “back” perspective.
  • the @codec of this timed metadata Representation is set as ‘dypt’.
  • the MPAV Preselection may contain the timed metadata Adaptation Set.
  • the timed metadata Adaptation Set is present under the MPAV Preselection.
  • this timed metadata Representation is associated with a Representation in the Adaptation Set using the @associationId attribute and an @associationType value that includes the 4CC ‘cdsc’.
  • timed metadata indicate the director’s predefined perspective trajectory that is recommended to be followed by the player in the absence of appropriate user input, e.g., metadata that can control the layout as well as specify aspects on how the interaction happens in the presentation timeline.
  • the timed metadata track has a type entry ‘dypt’ defined as follows.
  • Class DynamicPerspectiveSampleEntry extends MetaDataSampleEntry(‘dypt’) ⁇ PerspectiveInfoConfigurationBox(); ⁇ aligned(8) class PerspectiveInfoConfigurationBox extends FullBox(‘mvpC’, version, 0) ⁇ unsigned int(8) perspective_type; string perspective_description; ⁇
  • perspective_type specifies the type of the perspective trajectory as indicated in Table 5: Table 5: Values for all samples referring to perspective_type.
  • perspective_description is a null-terminated UTF-8 string that provides a textual description of the perspective trajectory.
  • interaction_info_present indicates whether the interaction information is signalled in the sample.
  • perspective_id is an identifier that is used to identify the perspective.
  • perspective_cancel_flag is equal to 1
  • perspective_cancel_flag is equal to 0
  • perspective_label specifies the textual description to annotate this perspective.
  • layout_option specifies the layout of presentation of the signaled perspectives in the sample, in the absence of user input, as described in the following table: Value Description 0 Only one of main view is selected and presented 1 Simultaneous multiple view 2..239 Reserved 240..255 Unspecified (for use by applications or external specifications)
  • interaction_option specifies aspects on how the interaction happens, with the meaning of different values indicated in the following table: Value Description 0 bar with alternatives 1 split-view click Value Description 2 swipe 3..239 Reserved 240..255 Unspecified (for use by applications or external specifications) [00145]
  • the following MPD example offers multiple perspective video streams with a timed metadata Representation carrying the director’s default trajectory of moving perspectives.
  • Each AdaptationSet in each PreSelection is one of (1) Main view, (2) Side view, and (3) Rare view.
  • Table 6 Relationship between PreSelection and AdaptationSet.
  • A1 A2 A3 A4 A5 A6 A7 A8 A9 P1 Main Side Side Rare Rare Rare Rare Rare Rare Rare Rare P2 Side Rare Rare Main Rare Side Rare Rare Rare Rare Rare P3 Rare Rare Side Rare Rare Rare Rare Main Side Rare Bitstream Editing [00148] As mentioned in the previous section, at least some example embodiments provide re- shareable and re-editable functionality for MPAV content.
  • a user may contribute the primary video to the MPAV content.
  • one or more users may contribute the comment video(s) to the MPAV content.
  • FIGS. 31 is a block diagram illustrating a process (3100) of adding a user-contributed primary video to MPAV content according to one example.
  • the process (3100) includes analyzing (3102, 3106) the user-contributed video to assess its location, angle, and coverage relative to the videos corresponding to other perspectives.
  • the process (3100) also includes synchronizing (3104) the user-contributed video and corresponding audio with the video/audio pairs corresponding to other perspectives.
  • the process (3100) also includes postprocessing (3108) of the updated MPAV content and generating (3110) the corresponding metadata.
  • FIG. 32A-32C are block diagrams illustrating a process of adding a user-contributed comment video to MPAV content according to one example. More specifically, FIG. 32A is a block diagram illustrating an initial configuration (3200) of the GUI (1900) that the user considers editing by adding a new video comment. The initial configuration (3200) already has four uploaded comments (C0, C1, C2, and C3).
  • FIG. 32B is a block diagram illustrating a new configuration (3202) after the user edits. Therein, the comments (C 1 , C 2 ) are deleted, and a new comment (C 4 ) is added and uploaded to the server side. The server operates to transcode the new comment video (C 4 ) into a tile-based encoded bitstream.
  • FIG. 32C is a block diagram illustrating a main view (3204) corresponding to the configuration (3202).
  • the main view (3204) has multiple tiles to allow comments from different users to be overlapped to different corners/areas dynamically according to the end users’ preferences.
  • FIG. 33 is a block diagram of an example computing device (3300) according to various examples.
  • the computing device (3300) is configured to perform at least some methods and processes described above.
  • the computing device (3300) of FIG. 33 is illustrated as having a number of components, but any one or more of these components may be omitted or duplicated, as suitable for the application and setting.
  • some or all of the components included in the computing device (3300) may be attached to one or more motherboards and enclosed in a housing.
  • some of those components may be fabricated onto a single system-on-a-chip (SoC) (e.g., the SoC may include one or more electronic processing devices (3302) and one or more storage devices (3304)).
  • SoC system-on-a-chip
  • the computing device (3300) may not include one or more of the components illustrated in FIG.(33, but may include interface circuitry for coupling to the one or more components using any suitable interface (e.g., a Universal Serial Bus (USB) interface, a High-Definition Multimedia Interface (HDMI) interface, a Controller Area Network (CAN) interface, a Serial Peripheral Interface (SPI) interface, an Ethernet interface, a wireless interface, or any other appropriate interface).
  • the computing device (3300) may not include a display device (3310), but may include display device interface circuitry (e.g., a connector and driver circuitry) to which an external display device (3310) may be coupled.
  • the computing device (3300) includes a processing device (3302) (e.g., one or more processing devices).
  • a processing device e.g., one or more processing devices.
  • the terms “electronic processor device” and “processing device” interchangeably refer to any device or portion of a device that processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory.
  • the processing device (3302) may include one or more digital signal processors (DSPs), application-specific integrated circuits (ASICs), central processing units (CPUs), graphics processing units (GPUs), server processors, or any other suitable processing devices.
  • DSPs digital signal processors
  • ASICs application-specific integrated circuits
  • CPUs central processing units
  • GPUs graphics processing units
  • server processors or any other suitable processing devices.
  • the computing device (3300) also includes a storage device (3304) (e.g., one or more storage devices).
  • the storage device (3304) may include one or more memory devices, such as random-access memory (RAM) devices (e.g., static RAM (SRAM) devices, magnetic RAM (MRAM) devices, dynamic RAM (DRAM) devices, resistive RAM (RRAM) devices, or conductive-bridging RAM (CBRAM) devices), hard drive-based memory devices, solid-state memory devices, networked drives, cloud drives, or any combination of memory devices.
  • RAM random-access memory
  • SRAM static RAM
  • MRAM magnetic RAM
  • DRAM dynamic RAM
  • RRAM resistive RAM
  • CBRAM conductive-bridging RAM
  • the storage device (3304) may include memory that shares a die with the processing device (3302).
  • the memory may be used as cache memory and include embedded dynamic random-access memory (eDRAM) or spin transfer torque magnetic random-access memory (STT-MRAM), for example.
  • eDRAM embedded dynamic random-access memory
  • STT-MRAM spin transfer torque magnetic random-access memory
  • the storage device (3304) may include non-transitory computer readable media having instructions thereon that, when executed by one or more processing devices (e.g., the processing device (3302)), cause the computing device (3300) to perform any appropriate ones of the methods disclosed herein below or portions of such methods.
  • the computing device (3300) further includes an interface device (3306) (e.g., one or more interface devices (3306)).
  • the interface device (3306) may include one or more communication chips, connectors, and/or other hardware and software to govern communications between the computing device (3300) and other computing devices.
  • the interface device (3306) may include circuitry for managing wireless communications for the transfer of data to and from the computing device (3300).
  • wireless and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data via modulated electromagnetic radiation through a nonsolid medium.
  • the term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.
  • Circuitry included in the interface device (3306) for managing wireless communications may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.11 family), IEEE 802.16 standards, Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as “3GPP2”), etc.).
  • IEEE Institute for Electrical and Electronic Engineers
  • Wi-Fi IEEE 802.11 family
  • IEEE 802.16 standards
  • LTE Long-Term Evolution
  • LTE Long-Term Evolution
  • UMB ultramobile broadband
  • circuitry included in the interface device (3306 for managing wireless communications may operate in accordance with a Global System for Mobile Communication (GSM), General Packet Radio Service (GPRS), Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Evolved HSPA (E-HSPA), or LTE network.
  • GSM Global System for Mobile Communication
  • GPRS General Packet Radio Service
  • UMTS Universal Mobile Telecommunications System
  • E-HSPA Evolved HSPA
  • LTE LTE network.
  • circuitry included in the interface device (3306 for managing wireless communications may operate in accordance with Enhanced Data for GSM Evolution (EDGE), GSM EDGE Radio Access Network (GERAN), Universal Terrestrial Radio Access Network (UTRAN), or Evolved UTRAN (E-UTRAN).
  • EDGE Enhanced Data for GSM Evolution
  • GERAN GSM EDGE Radio Access Network
  • UTRAN Universal Terrestrial Radio Access Network
  • E-UTRAN Evolved
  • circuitry included in the interface device (3306 for managing wireless communications may operate in accordance with Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Digital Enhanced Cordless Telecommunications (DECT), Evolution-Data Optimized (EV-DO), and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond.
  • the interface device (3306 may include one or more antennas (e.g., one or more antenna arrays) configured to receive and/or transmit wireless signals.
  • the interface device (3306) may include circuitry for managing wired communications, such as electrical, optical, or any other suitable communication protocols.
  • the interface device (3306) may include circuitry to support communications in accordance with Ethernet technologies.
  • the interface device (3306) may support both wireless and wired communication, and/or may support multiple wired communication protocols and/or multiple wireless communication protocols.
  • a first set of circuitry of the interface device (3306) may be dedicated to shorter-range wireless communications such as Wi- Fi or Bluetooth
  • a second set of circuitry of the interface device (3306) may be dedicated to longer-range wireless communications such as global positioning system (GPS), EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others.
  • GPS global positioning system
  • the computing device (3300) also includes battery/power circuitry (3308).
  • the battery/power circuitry (3308) may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device (3300) to an energy source separate from the computing device (3300) (e.g., to AC line power).
  • the computing device (3300) also includes a display device (3310) (e.g., one or multiple individual display devices).
  • the display device (3310) may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD), a light-emitting diode display, or a flat panel display.
  • the computing device (3300) also includes additional input/output (I/O) devices (3312).
  • the I/O devices (3312) may include one or more data/signal transfer interfaces, audio I/O devices (e.g., microphones or microphone arrays, speakers, headsets, earbuds, alarms, etc.), audio codecs, video codecs, printers, sensors (e.g., thermocouples or other temperature sensors, humidity sensors, pressure sensors, vibration sensors, etc.), image capture devices (e.g., one or more cameras), human interface devices (e.g., keyboards, cursor control devices, such as a mouse, a stylus, a trackball, or a touchpad), etc.
  • audio I/O devices e.g., microphones or microphone arrays, speakers, headsets, earbuds, alarms, etc.
  • audio codecs e.g., video codecs
  • printers e.g., sensors (e.g., thermocouples or other temperature sensors, humidity sensors, pressure sensors, vibration sensors, etc.), image capture devices (e.g., one or more cameras), human interface
  • various components of the interface devices (3306) and/or I/O devices (3312) can be configured to output suitable control signals, receive suitable control/telemetry signals, and receive and transmit data streams.
  • the interface devices (3306) and/or I/O devices (3312) include one or more analog-to-digital converters (ADCs) for transforming received analog signals into a digital form suitable for operations performed by the processing device (3302) and/or the storage device (3304).
  • ADCs analog-to-digital converters
  • the interface devices (3306) and/or I/O devices (3312) include one or more digital-to- analog converters (DACs) for transforming digital signals provided by the processing device (3302) and/or the storage device (3304) into an analog form suitable for being transmitted through a communication channel.
  • DACs digital-to- analog converters
  • an apparatus for streaming multiple-perspective audio and video comprising: at least one processor; and at least one memory including program code, wherein the at least one memory and the program code are configured to, with the at least one processor, cause the apparatus at least to: group a plurality of camera views representing the MPAV into a predetermined number of clusters; for a selected camera view of the plurality of camera views, classify other camera views of the plurality of camera views into one of a respective first class and a respective second class, each of the other camera views classified into the respective first class being from a different respective cluster than a cluster into which the selected camera view has been grouped, the second respective class including remaining ones of the other camera views; and when requested by a player device, stream a camera view classified into the respective first class using a first type of bitstream; stream a camera view classified into the respective second class using a second type of bitstream; and stream the selected camera view using a third type of bitstream, wherein the first, second, and third types of bitstream differ
  • a method for streaming multiple-perspective audio and video (MPAV) from a server device comprising: grouping a plurality of camera views representing the MPAV into a predetermined number of clusters; for a selected camera view of the plurality of camera views, classifying other camera views of the plurality of camera views into one of a respective first class and a respective second class, each of the other camera views classified into the respective first class being from a different respective cluster than a cluster into which the selected camera view has been grouped, the second respective class including remaining ones of the other camera views; and when requested by a player device, streaming a camera view classified into the respective first class using a first type of bitstream; streaming a camera view classified into the respective second class using a second type of bitstream; and streaming the selected camera view using a third type of bitstream, wherein the first, second
  • the first characteristic is spatial resolution; and wherein the second characteristic is group-of-pictures (GOP) size.
  • the method further comprises selecting the third type of bitstream from a first alternative and a second alternative, the first and second alternatives being characterized by different respective spatial resolutions.
  • the selecting is performed based on one or both of: a number of selected views of the MPAV being streamed to the player device; and a present condition of a network link used to stream the MPAV from the server device to the player device.
  • the method further comprises selecting the predetermined number prior to performing the grouping.
  • a total number of camera views in the respective first class is equal exactly to the predetermined number minus one.
  • classifying the other camera views into the respective first class comprises: for each of the other camera views, determining a respective overlap with the selected camera view; for each of the clusters not including the selected view, ranking a corresponding set of the other camera views based on the respective overlaps; and collecting camera views into the first respective class based on the ranking.
  • the method further comprises performing the streaming using a Quick UDP Internet Connection.
  • the method further comprises: recording view-switching history from the selected camera view performed by a plurality of player devices that requested the selected camera view for streaming; and changing compositions of the respective first and second classes based on the recorded view-switching history.
  • the method further comprises recording view-switching history from the selected camera view performed by a plurality of player devices that requested the selected camera view for streaming, wherein the classifying is based at least in part on the recorded view-switching history.
  • the method further comprises: for each camera view of the plurality of camera views, preparing a corresponding set of bitstreams including a respective bitstream of the first type, a respective bitstream of the second type, and a respective bitstream of the third type. [00174] In some embodiments of any of the above methods, the method further comprises: storing the corresponding sets of bitstreams in a storage container accessible via the server device; and streaming the MPAV using a selected subset of the bitstreams stored in the storage container.
  • the method further comprises: providing to the player device a media presentation description (MPD) of the MPAV stored in the storage container; for a period, providing to the player device a respective initialization segment from the storage container, the respective initialization segment being configured to inform a selection, at the player device, of a subset of the plurality of camera views for which to request media segments for playback; receiving, from the player device, a request identifying the selection and indicating which one or more camera views in the selection are the selected camera views and which one or more camera views in the selection are in the respective first class; and transmitting to the player device the selected subset retrieved from the storage container based on the identified selection.
  • MPD media presentation description
  • the storage container has a plurality of media segments logically organized in accordance with different camera views and further logically organized in accordance with one or more of different bit rates, different spatial resolutions, different codec types, and different GOP sizes.
  • the selected subset retrieved from the storage container includes media segments corresponding to at least two different representations.
  • the selected subset retrieved from the storage container includes media segments corresponding to at least two different adaptation sets.
  • the storage container has a respective sequence of media segments corresponding to different respective video segment times.
  • the method further comprises changing the selected subset from a first subset to a different second subset when the request indicates a change in the identified selection.
  • the respective initialization segment has a box format compatible with an MPEG DASH specification.
  • the MPD is configured to describe first, second, and third levels of information; wherein the first level includes download information for the selected camera view, the camera views of the respective first class, and the camera views of the respective second class; wherein the second level includes information about version history corresponding to at least some of the camera views; and wherein the third level includes information configured to enable filtering of the MPAV based on user preferences.
  • the MPD includes an MPAV descriptor having a first attribute specifying a list of identifiers of first-class-view adaptation sets or content components that belongs to a preselection.
  • the MPAV descriptor further has one or both of: a second attribute specifying a list of identifiers of second-class-view adaptation sets or content components that belong to the preselection; and a third attribute specifying a list of identifiers of third-class-view adaptation sets or content components that belong to the preselection.
  • the MPD includes an MPAV perspective (MPER) descriptor having a set of attributes selected from the group consisting of: a first attribute specifying an identifier of a perspective; a second attribute specifying a textual description configured to annotate the perspective; a third attribute specifying x-, y- and z- coordinates of a position of the perspective in a reference coordinate system; and a fourth attribute specifying a identifier of an adaptation set or content component having the a representation associated with the perspective.
  • MPER MPAV perspective
  • the server device is configured to support a plurality of playback modes for the player device including: a playback mode having a single selected camera view and at least one camera view of the respective first class; a playback mode having two or more selected camera views; and a loop-viewing playback mode having two or more selected camera views.
  • the server device for the loop-viewing playback mode, is configured to enable the playback device to download, as a loop count increases, progressively higher resolution bitstreams for a set of camera views selected for the loop-viewing playback mode.
  • the method further comprises queueing the first, second, and third types of bitstreams in different respective queues having different respective priorities.
  • the method further comprises: receiving from the player device a request to add to the MPAV a user-contributed video representing an additional camera view; and updating the MPAV to include into a corresponding updated plurality of camera views the additional camera view and the plurality of camera views.
  • the updating comprises: grouping the corresponding updated plurality of camera views into the predetermined number of clusters; for a selected camera view of the corresponding updated plurality of camera views, classifying other camera views of the corresponding updated plurality of camera views into one of a corresponding first class and a corresponding second class, each of the other camera views classified into the corresponding first class being from a different respective cluster than a cluster into which the selected camera view has been grouped, the second corresponding class including remaining ones of the other camera views; and when requested by a player device, streaming a camera view classified into the corresponding first class using the first type of bitstream; and streaming a camera view classified into the corresponding second class using the second type of bitstream.
  • the method further comprises: receiving from the player device a request to add to the MPAV a user-contributed comment video corresponding to the selected camera view; and updating the MPAV to include the user-contributed comment video.
  • the updating comprises: converting the user-contributed comment video into a tile-based bitstream; and when requested by the player device, merging the tile-based bitstream with bitstreams that are being streamed to the player device.
  • a non-transitory computer-readable medium storing instructions that, when executed by an electronic processor, cause the electronic processor to perform operations comprising the method of any one of the above methods.
  • an apparatus for streaming multiple-perspective audio and video comprising: at least one processor; and at least one memory including program code, wherein the at least one memory and the program code are configured to, with the at least one processor, cause the apparatus at least to: receive a media presentation description (MPD) of the MPAV stored in a storage container accessible via a server device; build a graphical user interface (GUI) based on the MPD, the GUI being configured to: enable a user to select a camera view from a plurality of camera views representing the MPAV and further select a playback mode from a plurality of MPAV playback modes; and present a predetermined number of clusters into which different camera views of the plurality of camera views are grouped; upon receiving an indication of the selected camera view through the GUI, determine classification of other camera views of the plurality of camera views into
  • MPD media presentation description
  • GUI graphical user interface
  • a method for streaming multiple-perspective audio and video (MPAV) with a playback device comprising: receiving a media presentation description (MPD) of the MPAV stored in a storage container accessible via a server device; building a graphical user interface (GUI) based on the MPD, the GUI being configured to: enable a user to select a camera view from a plurality of camera views representing the MPAV and further select a playback mode from a plurality of MPAV playback modes; and present a predetermined number of clusters into which different camera views of the plurality of camera views are grouped; upon receiving an indication of the selected camera view through the GUI, determining classification of other camera views of the plurality of camera views into one of a respective first class and a respective second class, each of the other camera views classified into the respective first class being from a different respective cluster than
  • the playback mode is selected from the group of playback modes consisting of a playback mode having a single selected camera view and at least one camera view of the respective first class; a playback mode having two or more selected camera views; and a loop-viewing playback mode having two or more selected camera views.
  • the playback device in the loop-viewing playback mode, is configured to download, as a loop count increases, progressively higher resolution bitstreams for a set of camera views selected for the loop-viewing playback mode.
  • a camera view classified into the respective first class is streamed using a first type of bitstream; wherein a camera view classified into the respective second class is streamed using a second type of bitstream; wherein the selected camera view is streamed using a third type of bitstream; and wherein the first, second, and third types of bitstream differ from one another in at least one of a first characteristic and a second characteristic.
  • the first characteristic is spatial resolution; and wherein the second characteristic is group-of-pictures (GOP) size.
  • the third type of bitstream is selected from a first alternative and a second alternative, the first and second alternatives being characterized by different respective spatial resolutions.
  • the first or second alternative is selected based on one or both of: a number of selected views of the MPAV being streamed to the player device; and a present condition of a network link used to stream the MPAV from the server device to the player device.
  • the plurality of bitstreams is received via a Quick UDP Internet Connection.
  • the method further comprises: for a period, receiving a respective initialization segment from the storage container, the respective initialization segment being configured to inform a selection, at the player device, of a subset of the plurality of camera views for which to request media segments for playback; sending to the server device a request identifying the selection and indicating which one or more camera views in the selection are the selected camera views and which one or more camera views in the selection are in the respective first class; and receiving via the server device the selected subset retrieved from the storage container based on the identified selection.
  • the storage container has a plurality of media segments logically organized in accordance with different camera views and further logically organized in accordance with one or more of different bit rates, different spatial resolutions, different codec types, and different GOP sizes.
  • the storage container has a respective sequence of media segments corresponding to different respective video segment times.
  • the method further comprises changing the selected subset from a first subset to a different second subset based on a user input received through the GUI.
  • the respective initialization segment has a box format compatible with an MPEG DASH specification.
  • the MPD is configured to describe first, second, and third levels of information; wherein the first level includes download information for the selected camera view, the camera views of the respective first class, and the camera views of the respective second class; wherein the second level includes information about version history corresponding to at least some of the camera views; and wherein the third level includes information configured to enable filtering of the MPAV based on user preferences.
  • the method further comprises: sending to the server device a request to add to the MPAV a user-contributed video representing an additional camera view; and receiving from the server device an updated MPD corresponding to an updated plurality of camera views including the additional camera view.
  • the method further comprises: sending to the server device a request to add to the MPAV a user-contributed comment video corresponding to the selected camera view; and receiving from the server device an updated MPD corresponding to the updated MPAV that includes the user-contributed comment video.
  • a non-transitory computer-readable medium storing instructions that, when executed by an electronic processor, cause the electronic processor to perform operations comprising any one of the above methods.
  • Some embodiments can be embodied in the form of methods and apparatuses for practicing those methods. Some embodiments can also be embodied in the form of program code recorded in tangible media, such as magnetic recording media, optical recording media, solid state memory, floppy diskettes, CD-ROMs, hard drives, or any other non-transitory machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the patented invention(s).
  • Some embodiments can also be embodied in the form of program code, for example, stored in a non- transitory machine-readable storage medium including being loaded into and/or executed by a machine, wherein, when the program code is loaded into and executed by a machine, such as a computer or a processor, the machine becomes an apparatus for practicing the patented invention(s).
  • program code segments When implemented on a general-purpose processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits.
  • references herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure.
  • the appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments.
  • the phrase “if it is determined” or “if [a stated condition] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event].”
  • the terms “couple,” “coupling,” “coupled,” “connect,” “connecting,” or “connected” refer to any manner known in the art or later developed in which energy is allowed to be transferred between two or more elements, and the interposition of one or more additional elements is contemplated, although not required. Conversely, the terms “directly coupled,” “directly connected,” etc., imply the absence of such additional elements.
  • the term compatible means that the element communicates with other elements in a manner wholly or partially specified by the standard and would be recognized by other elements as sufficiently capable of communicating with the other elements in the manner specified by the standard.
  • the compatible element does not need to operate internally in a manner specified by the standard.
  • the functions of the various elements shown in the figures, including any functional blocks labeled as “processors” and/or “controllers,” may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared.
  • processor or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and nonvolatile storage. Other hardware, conventional and/or custom, may also be included.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • ROM read only memory
  • RAM random access memory
  • nonvolatile storage nonvolatile storage.
  • Other hardware conventional and/or custom, may also be included.
  • any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
  • circuit may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry); (b) combinations of hardware circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions); and (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.”
  • This definition of circuitry applies to all uses of this term in this application, including in any claims.
  • circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware.
  • circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Procédés et appareil de diffusion en continu de contenu MPAV. L'analyse de perspective et le regroupement et la catégorisation de vue de caméra sont utilisés pour permettre au dispositif de joueur de construire une interface utilisateur graphique adaptée pour naviguer parmi diverses sélections de vue de caméra et sont en outre utilisés pour attribuer différents paramètres de codage à différents flux binaires MPAV et mettre en œuvre différentes stratégies de téléchargement dans différents modes de lecture. Certains exemples fournissent également un mode d'édition qui permet avantageusement d'être utilisé pour faire croître le contenu MPAV et lui conférer des capacités de repartage et de réédition.
PCT/US2025/024727 2024-04-30 2025-04-15 Diffusion en continu de données audio et vidéo à perspective multiple Pending WO2025230719A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463640663P 2024-04-30 2024-04-30
US63/640,663 2024-04-30

Publications (1)

Publication Number Publication Date
WO2025230719A1 true WO2025230719A1 (fr) 2025-11-06

Family

ID=95784167

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/024727 Pending WO2025230719A1 (fr) 2024-04-30 2025-04-15 Diffusion en continu de données audio et vidéo à perspective multiple

Country Status (1)

Country Link
WO (1) WO2025230719A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180241988A1 (en) * 2015-10-26 2018-08-23 Huawei Technologies Co., Ltd. Multi-View Video Transmission Method and Apparatus
US20200389640A1 (en) * 2018-04-11 2020-12-10 Lg Electronics Inc. Method and device for transmitting 360-degree video by using metadata related to hotspot and roi
US20230308639A1 (en) * 2019-01-02 2023-09-28 Nokia Technologies Oy Apparatus, a Method and a Computer Program for Video Coding and Decoding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180241988A1 (en) * 2015-10-26 2018-08-23 Huawei Technologies Co., Ltd. Multi-View Video Transmission Method and Apparatus
US20200389640A1 (en) * 2018-04-11 2020-12-10 Lg Electronics Inc. Method and device for transmitting 360-degree video by using metadata related to hotspot and roi
US20230308639A1 (en) * 2019-01-02 2023-09-28 Nokia Technologies Oy Apparatus, a Method and a Computer Program for Video Coding and Decoding

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
INDUSTRY FORUM: "GUIDELINES", no. USA, 28 January 2021 (2021-01-28), XP052182113, Retrieved from the Internet <URL:https://ftp.3gpp.org/tsg_sa/WG4_CODEC/TSGS4_112-e/Docs/S4-210009.zip> [retrieved on 20210128] *

Similar Documents

Publication Publication Date Title
US11330311B2 (en) Transmission device, transmission method, receiving device, and receiving method for rendering a multi-image-arrangement distribution service
RU2728904C1 (ru) Способ и устройство для управляемого выбора точки наблюдения и ориентации аудиовизуального контента
CN110431850B (zh) 在使用mime类型参数的网络视频流式传输中发信重要视频信息
CN108476324B (zh) 增强视频流的视频帧中的感兴趣区域的方法、计算机和介质
CN109155873B (zh) 改进虚拟现实媒体内容的流传输的方法、装置和计算机程序
KR101777348B1 (ko) 데이터 전송 방법 및 장치와 데이터 수신 방법 및 장치
CN102132562B (zh) 用于轨道和轨道子集归组的方法和设备
US10863211B1 (en) Manifest data for server-side media fragment insertion
US11665219B2 (en) Processing media data using a generic descriptor for file format boxes
US20150124048A1 (en) Switchable multiple video track platform
US11805303B2 (en) Method and apparatus for storage and signaling of media segment sizes and priority ranks
JP2020526982A (ja) メディアコンテンツのためのリージョンワイズパッキング、コンテンツカバレッジ、およびシグナリングフレームパッキング
JP7035088B2 (ja) 魚眼ビデオデータのための高レベルシグナリング
US20190014350A1 (en) Enhanced high-level signaling for fisheye virtual reality video in dash
US12074934B2 (en) Method and apparatus for grouping entities in media content
CN114930869B (zh) 用于视频编码和视频解码的方法、装置和计算机程序产品
CN117296317A (zh) 媒体文件处理方法及其设备
US20180227504A1 (en) Switchable multiple video track platform
KR101944601B1 (ko) 기간들에 걸쳐 오브젝트들을 식별하기 위한 방법 및 이에 대응하는 디바이스
WO2025230719A1 (fr) Diffusion en continu de données audio et vidéo à perspective multiple
EP3777219B1 (fr) Procédé et appareil de signalisation et de stockage de points de vue multiples pour un contenu audiovisuel omnidirectionnel
CN118633292A (zh) 用于媒体容器文件和流传输清单中的画中画的信令
CN117256135A (zh) 用于cmaf和dash多媒体流式传输的可寻址资源索引事件