WO2025034759A1

WO2025034759A1 - Audio-visual-coding optimization of joint pipeline design using mv-hevc and sound object for head-mounted displays

Info

Publication number: WO2025034759A1
Application number: PCT/US2024/041130
Authority: WO
Inventors: Guan-Ming Su; Peng Yin; Robin Atkins
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2023-08-09
Filing date: 2024-08-06
Publication date: 2025-02-13
Anticipated expiration: 2026-02-09

Abstract

Methods and systems are described for optimizing a coding for multiview video stream are described. In one embodiment, an encoder receives a video source with multiple views. The encoder computes disparity statistics for a plurality of stereo pairs of the multiple views of the video source, wherein the number of multiple views is greater than or equal to three. The encoder further encodes the video stream using the disparity statistics, wherein the encoded video stream includes the plurality of stereo pairs. The encoder transmits the encoded video stream.

Description

AUDIO-VISUAL-CODING OPTIMIZATION OF JOINT PIPELINE DESIGN USING MV-HEVC AND SOUND OBJECT FOR HEAD-MOUNTED DISPLAYS

[0001] The present application claims the benefit of priority from European Patent Application No.23196284.6, filed on 8 September 2023 and U.S. Provisional Patent Application Ser. No.63/518,525, filed on 09 Aug.2023, each of which is incorporated by reference herein in its entirety. TECHNOLOGY [0002] The present invention relates generally to encoding video. More particularly, an embodiment of the present invention relates to optimizing an audio- visual-coding of a joint Pipeline Design using MV-HEVC and Sound Object for HMD. BACKGROUND [0003] Multi-View HEVC (MV-HEVC) defines a 3D-extention from existing HEVC codec. MV-HEVC allows encoding multiple views in a bit stream. For example, assuming that there are M views, there will be a primary view using intra-view coding method (I view). The rest of views will be encoded via inter- view method, namely, prediction from other existing decoded views stored in a decoded picture buffer (DPB). There are two different inter-view prediction methods: prediction view (P-view), which allows one disparity vector to predict from any views in the DPB for each block; and bidirectional predicted view (B- view), which allows two disparity vectors to predict from any views in the DPB for each block. The coding efficiency depends on the similarity among views. [0004] MV-HEVC can be used to encode stereoscopic content, where there are different views for both left and right eyes separated by a disparity d. Due to the similarity between the two views, the inter-view prediction of MV-HEVC is able to achieve better compression ratios than separately encoding two views. BRIEF DESCRIPTION OF THE DRAWINGS [0005] The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements. [0006] Figure 1 shows an example of a system that can be used in one or more embodiments the invention. [0007] Figure 2 shows an example of a set of views for multi-view video. [0008] Figure 3 shows an example of an optical setting of a head-mounted display that can be used with one or more embodiments of the invention. [0009] Figures 4A-B shows examples of multiple views with different user ( _^^^^, ^_^^^^) that can be used with one or more embodiments of the invention. [0010] Figure 5 shows, in a flow diagram, an example of a visual optimization that can be used with one or more embodiments of the invention. [0011] Figure 6 shows an example of a disparities with different user ( _^^^^, ^_^^^^) that can be used with one or more embodiments of the invention. [0012] Figure 7 shows an example of user disparities bounds that can be used with one or more embodiments of the invention. [0013] Figure 8 shows, in a flow diagram, an example of optimizing Multiview video coding that can be used with one or more embodiments of the invention. [0014] Figures 9A-B shows an example of P-views and B-views that are generated for multiview video that can be used with one or more embodiments of the invention. [0015] Figure 10 shows an example of a spatial audio object coordinate system that can be used with one or more embodiments of the invention. [0016] Figure 11 shows, in a flow diagram, an example of a spatial audio object adjustment method that can be used with one or more embodiments of the invention. [0017] Figure 12 shows an example of a data processing system that can be used to perform or implement one or more embodiments of the invention. DETAILED DESCRIPTION [0018] Various embodiments and aspects will be described with reference to details discussed below, and the accompanying drawings will illustrate the various embodiments. The following description and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of various embodiments. However, in certain instances, well- known or conventional details are not described in order to provide a concise discussion of embodiments. [0019] Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in conjunction with the embodiment can be included in at least one embodiment. The appearances of the phrase "in one embodiment" in various places in the specification do not necessarily all refer to the same embodiment. The processes depicted in the figures that follow are performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), software, or a combination of both. Although the processes are described below in terms of some sequential operations, it should be appreciated that some of the operations described may be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially. [0020] The embodiments described herein can be used to optimize coding for a multiview video stream for transmission to another device (e.g., a device with a decoder, such as a head mounted display (HMD)). In one embodiment, a method is described that can provide an efficient way of coding multiview video source for transmission using disparity statistics between stereo pairs of the multiview video. [0021] In one embodiment, an HMD end-to-end pipeline is described that (1) preserves director intent, (2) provides wide range of personalized stereo video viewing comfort, (3) achieves better coding efficiency using more than 2 views of MV-HEVC codec, and (4) shares sound objects with re-location for bit rate saving. The proposed framework jointly considers the volumetric audio-visual content creation process, audio-visual coding efficiency in distribution process, and audio- visual experience (viewing comfort and correct sounding position) in viewing consumption process. [0022] In this embodiment, MV-HEVC is extended to encode more than two views separated by different levels of disparity d’. This can allow for improved viewing comfort, since different individuals may find different levels of disparity more comfortable due to a combination of physiology and preference. In addition, a system and method for encoding more than two views for improved comfort of stereoscopic viewing, while imposing practical constraints on the predictions of secondary views from primary views to ensure that the content can be deployed to a wide range of playback devices with minimal memory and computations resources. [0023] One way to address the viewing comfort via scaling and shifting the received 2-view MV-HEVC bitstreams with metadata as described in “HEAD- MOUNTED DISPLAY ADJUSTMENT METHODS AND SYSTEMS”, U.S. Provisional Patent Application No.63/507,726, filed on 12 June 2023. The actual viewed content can be scaled and cropped with padded black areas, which might not be the creators’ intent. To avoid modifying the original creators’ intent, one naïve solution is to use 2+ stereo pair MV-HEVC (e.g., 2m view, where m is an integer), and each stereo view is designed to provide viewing comfort to directly fit for (1) users having certain range of IPD and (2) HMD devices having certain range of optical/screen configurations. However, to provide satisfactory viewing comfort to a wide distribution of human eyes and HMD parameters, the required number of views (e.g., m) coded in MV-HEVC could be very large. This would result in a very high bit rate for the encoded MV-HEVC bitstream and require a very large storage at the server side. The naïve solution is not practical in real deployment. [0024] Instead, a MV-HEVC system based on the content disparity is described. For an M view MV-HEVC, an end user can select any two views as a stereo pair to watch. For example, and in one embodiment, if a MV-HEVC video source has 4 views, there are a possible of 6 stereo pairs. In this example, the different stereo pairs are (left view ID, right view ID): (0,1); (0,2); (0,3); (1,2); (1,3); and (2,3). To meet the viewing comfort requirement, metadata describing each stereo pair is provided. At the decoder side, the end user will select the optimal stereo pair which (1) maintains the viewing comfort and (2) maximizes the perceived depth range according to the metadata. At the encoder side, an integrated system provides the disparity design guidance to content creators during multi novel view generation. The content disparity information is measured for each stereo pair after the content creator renders one scene of multi-view content. In addition, the integrated system collects the statistics of human perceptual systems among a large group of subjects and optical/screen HMD parameters from the potential markets to generate the distribution of user profiles. Those user profiles can help the content creators to understand the distribution of each stereo pair viewing hit rate and enable the iterative content creation to meet the targeted distribution. With a set of multi-view frame and corresponding content disparity between stereo pair, the integrated system determines the coding structure in MV-HEVC to optimize the coding efficiency. In this embodiment, the content is not changed at all at the decoder side (namely, no scaling, no shifting). Instead, the content creator renders the view and approve the final look. This preserves the content creators’ intent, at the cost of storing and sending more than 2 views MV-HEVC. While in one embodiment, MV-HEVC is the multiview video source, in alternate embodiments, other types of multiview video source can be used (e.g., multiview coding (MVC), multilayer VVC, or another multiview type of coding). [0025] In a further embodiment, because different stereo pairs (along with a user and viewing environment) provides different perceived depth, and the sound objects will be different from the original design. Additional metadata and algorithm to re- locate the sound object are proposed as well. By addressing both visual and sound, the proposed method can provide much more immersive and/or comfortable experience. [0026] The embodiments described herein can be used in apparatuses which include one or more processors in a processing system, and which include memory and which are configured to perform any one of the methods described herein. Moreover, the embodiments described herein can be implemented using non-- transitory machine-readable storage media storing executable computer program instructions which when executed by a machine cause the machine to perform any one of the methods described herein. [0027] Figure 1 shows an example of a system 100 that can be used in one or more embodiments the invention. In Figure 1, the system 100 includes a multiview video source 102 that is fed to an encoder 104. In one embodiment, multiview video source 102 is a video source with multiple views of the same subject. In one embodiment, the multiview source has two or more video streams for the same subject that are provided by the content creator. In another embodiment, the multiview video source 102 is another type of multiview video source. In a further embodiment, the encoder 104 is a device that can encode the multiview video source 102 by converting an analog or digital video to another digital video format that can be used to deliver to the encoded video to a decoder 112. In this embodiment, the encoder 104 can be a server, personal computer, laptop, camera, smartphone, or another device that can encode a stereo source 102. In one embodiment, the stereo source 102 is a video source that can produce a three-dimensional image in a moving form. While in one embodiment, the encoder 104 encodes the multiview source 102 to a Multiview High Efficiency Video Coding (MV-HEVC) standard, in alternate embodiment, the encoder 104 can encode the multiview source 102 to another type encoding standard. [0028] In a further embodiment, the encoder 104 includes a video optimizer 106 that optimizes the coding of the multiview video source 102 by using disparities statistics between stereo pairs of the multiview video source 102. Using the disparities statistics to optimizes the coding of the multiview video source 102 is further described in Figure 5 below. The encoder 104 further incudes an audio processor 108 that adjusts the spatial audio object for stereo pairs of the multiview video source 102 and is further described in Figure 11 below. [0029] In one embodiment, for an M view MV-HEVC, a user can take any two views as a stereo pair to watch. Figure 2 shows an example of a set of views for multi-view video. In Figure 2 there are 4 views 200A-D, and the combination to take any two views out to form a stereo pair is 6, namely, (left view ID, right view ID) as (0,1), (0,2), (0,3), (1,2), (1,3), and (2,3). To meet the viewing comfort requirement, metadata describing each stereo pair is provided. At the decoder side, the end user will select the optimal stereo pair which (1) maintains the viewing comfort and (2) maximizes the perceived depth range according to the metadata. At the encoder side, an integrated system provides the disparity design guidance to content creators during multi novel view generation. The content disparity information is measured for each stereo pair after the content creator renders one scene of multi-view content. In addition, the integrated system collects the statistics of human perceptual systems among a large group of subjects and optical/screen HMD parameters from the potential markets to generate the distribution of user profiles. Those user profiles can help the content creators to understand the distribution of each stereo pair viewing hit rate and enable the iterative content creation to meet the targeted distribution. With a set of multi-view frame and corresponding content disparity between stereo pair, the integrated system determines the coding structure in MV-HEVC to optimize the coding efficiency. In this embodiment, the content is not changed at all at the decoder side (namely, no scaling, no shifting). Instead, the content creator renders the view and approve the final look. This preserves the content creators’ intent, at the cost of storing and sending more than 2 views MV-HEVC. [0030] In one embodiment, it is also possible to adapt the optimized stereo pair given a known viewer head rotation^^*LYHQ^D^NQRZQ^YLHZHU^,3'^DQG^NQRZQ^KHDG^URWDWLRQ^ڧ^ (measured via camera or other sensor system known in the art), both a horizontal and vertical view separation can be computed. The horizontal view separation becomes IPDh = ,'3^^^FRV^ڧ^^^DanG^WKH^YHUWLFDO view separation becomes IPDv ^,3'^^^VLQ^ڧ^^^I,f^WKH^ content only contains horizontal views, then the distance IPDh can be used in lieu of the viewer IPD. If the content contains both horizontally and vertically separated views then both the IPDh and IPDv can be used to select the optimum view. [0031] In a further embodiment, the content viewpoints may not align exactly with the desired IPD. In this embodiment, the playback device (e.g., the viewing device as described in Figure 1 above) may select a viewpoint nearest or close to the desired IPD. Other more capable playback devices may perform an interpolation from two or more near viewpoints to interpolate a viewpoint aligned precisely with the desired IPD. Such viewpoint interpolation algorithms are known in the art, based on depth-based novel view interpolation or other algorithms. [0032] In one embodiment, the encoder 104 sends the encoded multiview video to the decoder 110. In this embodiment, the decoder 112 decodes the encoded multiview video output by the encoder 104 that can be used to output to a viewing device 110. The decoder 112 can be a server, personal computer, laptop, camera, smartphone, or another device that can decode the encoder stereo video. In one embodiment, the decoder 112 is part of a viewing device 110 that includes a screen for outputting the decoded stereo video. In one embodiment, the viewing device 110 is a head mounted display (HMD), where an HMD is a display device worn on the head with a display 114 in front of one or both eyes of a user. In this embodiment, the viewing device 110 includes metadata for screen size, screen distance, pixels per screen, IPD, and or other metadata regarding the HMD. Furthermore, this metadata can be used by the viewing device to determine a zone of comfort for the user of the HMD. In one embodiment, the viewing device 110 can apply a post-processing factor to the decoded video so that the resulting stereo video is more comfortable or more exciting with increased sense of depth to the HMD users. While in one embodiment, the viewing device 110 includes a display, in alternate embodiments, the viewing device 110 can output decoded video to a separate display (e.g., a 3D display). Furthermore, while in one embodiment, the viewing device is an HMD, in alternate embodiments, the viewing device be an alternate type of device with a display (e.g., a 3D laptop). [0033] In one embodiment, it is believed that the viewing discomfort in stereo video experience comes from the vergence and accommodation conflict (VAC). Vergence is the rotation of the eyes toward or away from one another. The eyes’ lines of sight rotate toward one another—e.g., converge—when shifting gaze from a far to a near object, and rotate away from one another—diverge—when shifting from near to far. Vergence is quantified by the vergence distance which is the distance from the eyes to the intersection of the lines of sight. Accommodation is the adjustment of the eye’s optics to bring an object into focus on the retina. It is achieved by adjusting the focal length of the eye’s crystalline lens. When shifting gaze from a near to a far object, focal length is increased. When shifting from far to near, focal length is decreased. Accommodation is quantified by the accommodative distance which is the distance from the eye to the focal plane. The accommodation distance is also called focal distance, screen distance, or viewing distance. A natural viewing occurs when the accommodation distance is equal or nearly equal to the vergence. [0034] In a further embodiment, experimental results draw upper bound and lower bound to satisfy the viewing comfort between accommodation distance and vergence distance. Denote ^_^,^^^ and ^_^,^^^^ as the accommodation distance for upper and lower curves in meters, respectively. The vergence distance in meters is denoted as ^_௩

where ^_^^^= 1.129, _^^^^= 0.442, ^_^^^^= 1.035, _^^^^^= -0.626. [0035] Figure 3 shows an example of an optical setting 300 of an HMD that can be used with one or more embodiments of the invention. In Figure 3, the ^ (314) refers to the virtual screen size. The ^_௩ (310) and ^_^(312) each refer to the vergence distance and the distance to the virtual screen, respectively. The ^_^(306) and ^_ோ(308) each refer to the coordinate of the matching objects on the left and right view, respectively. In addition, the IPD (302) is the interpupillary distance and is illustrated between eyes 304A-B. [0036] In one embodiment, the disparity, s, as the pixel coordinate difference between one point in the left view and the corresponding point in the right view. The disparity statistics are a set of disparities between corresponding views in a stereo pair. ^^{= ^} _^ ^{െ ^} _ோ ⁽³⁾ According to the similar triangle property, Rearrange the equation, which leaves ^ ^{^^^^^} ௩₌ _{^ .}

The final disparity bounds to HMD case as follows:

[0037] The above disparity is represented in meters. For more practical usage, the disparity can often express in terms of pixels. Denote the horizontal screen size in meters as _^^ and the horizontal pixel resolution per each screen as

[0038] The disparity of an image should stay inside bound [^_^^^[^^^] ^_^^^^[^^^]] to maintain the viewing comfort. As the equations suggested, there are four parameters to determine this bound

[0039] In one embodiment, with M views of the multiview video stream, any two of the views can be selected for a 3D stereo view for an end user to watch. For example, view a and view b. In fact, the combinations, N, to have different two views can be expressed as:

[0040] For M=4 views, there will be N=6 pairs. In each stereo pair, an end user will perceive different depth and experience different viewing comfort. From the above discussion, the viewing comfort can be measured in terms of disparity. Following the method described (e.g., fusing Semi Global Matching (SGM) and SURF matching), the content disparity (in terms of pixel unit) between view a and view b as ^{(^,^)} ^_^^^

For N pairs, ^ =

^^{(^,^)} ^_^^^)}. This set of disparity information ^ will be stored as metadata in the MV-HEVC bitstream. [0041] In MV-HEVC, the multiple views can be arranged horizontally from most left view to the rightest view with index from 0 to M-1. The disparity range (e.g.,

െ i_{ncreases when the view distance} ^| _{^ െ ^} ^| _{increases. When} ^| _{^ െ ^} ^| _{= 1, this stereo} pair is the neighboring views [0042] At the playback side, according to user’s IPD, viewing distance, and display information, the personalized bound _^^^^ and ^_^^^^ can be computed using equations (7) and (8) by assigning

= ^_^^^[^^^] and ^_^^^^ = ^_^^^^[^^^]. The optimal selected stereo pair from ^ by given an

is to first find the pair whose disparity is inside users’ bound, e.g.

[0043] Then, among pairs inside ^^{^}, choose the pair having the largest disparity span, ^{i.e., |^(^,^) ( )} ^{^^^^ െ ^,^} ^{^^^^ |:}

[0044] In other words, the optimal stereo pair is the one having the largest disparity span which do not violate the users’ viewing comfort. Figures 4A-B shows examples of multiple views with different user

that can be used with one or more embodiments of the invention. In Figure 4A, one example in the following figure where there are 3 views with a given user

There are three stereo possible stereo pairs (0,1), (0,2), (1,2) with corresponding disparity spans (404A-C and 406A-C). In Figure 4A, only one of the disparity spans is within the constraints (408A-B), namely (0,1). As illustrated, there is one stereo pair satisfying the constraint, the qualified set ^^{^}is _{{(0,1)}. Thus, the optimal stereo pair will be} ⁽ _^ ^כ _{, ^} ^כ) _{= (0,1).} [0045] In contrast, in Figure 4B. with a different user ( _^^^^, ^_^^^^). The qualified set ^^{^}is {(0,1), (1,2)}. In these two pairs, (1,2) provides larger disparity range. Thus, we choose (^^כ, ^^כ) = (1,2). There are three stereo possible stereo pairs (0,1), (0,2), (1,2) with corresponding disparity spans (454A-C and 456A-C). In Figure 4B, there are two possible disparity spans within the constraints (458A-B), (0,1) and (1,2). The stereo pair chosen here is the one with the largest disparity span, namely (1,2). [0046] With this design, and in one embodiment, the system just needs to provide multiple views with information ^

At the decoder side, the decoder can compute ( _^^^^, ^_^^^^) and use equation (11) to get the best perceived depth with viewing comfort. [0047] Figure 5 shows, in a flow diagram, an example of a visual optimization process 500 that can be used with one or more embodiments of the invention. In Figure 5, process 500 begins by creating the M views. In one embodiment, a content creator creates the M views. For example, and in one embodiment, the content creator first creates M views and uses disparity measurement module to compute the content disparity for each stereo pair. At block 504, process 500 computes the disparity for each stereo pair. In one embodiment, the disparity is computed (e.g., fusing Semi Global Matching (SGM) and SURF matching as described in “HEAD-MOUNTED DISPLAY ADJUSTMENT METHODS AND SYSTEMS,” U.S. Patent Application No. 63/507,726 filed on 12 June 2023). With the user profile and content disparity, process 500 computes the estimated viewing hit distribution for each stereo pair at block 506. If the distribution does not meet the content creator’s target or marketing target (508), process 500 moves to block 502, where the content creator needs to go back to content creation process to re-generate M views and goes through above process again. The loop stops when the final distribution is satisfied. [0048] Process 500 passes the stereo hit distribution to determine the MV-HEVC coding structure, namely, I/P/B-view and coding dependency at block 510. In one embodiment, process 500 uses the stereo hit distribution to determine an optimized coding way to encode the MV-HEVC coding structure. This is further described in Figure 8 below. Process 500 optionally determines if the bit rate is within a target bitrate (while maintaining the minimal quality requirement) at block 512. If the final coding structure with required picture quality does not meet the bit rate target, execution proceeds to block 502 above. If the MV-HEVC coding structure meets the bit rate requirement with satisfactory stereo quality, process 500 outputs the MV-HEVC bitstream and metadata. [0049] In one embodiment, at block 516, process 500 (as shown on top-right corner of above figure) which takes distribution of user and HMD device information is used to compute the user profile. In one embodiment, this will be just computed once and stored/used in the system. [0050] In another embodiment, the distribution of ( _^^^^, ^_^^^^) among a large group of users is acquired since this can help understand the viewing comfort distribution, and thus to determine how to render the views at the content creation side. [0051] As discussed above, there are four parameters, { ^^^,

_^^, _^^}., to determine this viewing comfort bound, where each parameter has its own distribution, namely, _^^^(), _^^(), _^^^(), and _^^^(). In one embodiment, the distribution of IPD can be modelled as normal distribution with mean 63.36 mm and standard deviation 3.832 mm.

[0052] Some examples of distributions are provided here, but the actual distribution can be obtained via market research. The eye to virtual screen distance can be modelled as uniform distribution between 1 m to 3 m. ^_^()~^(1, 3) (13) [0053] Furthermore, one can assume that the horizontal screen size _^^ and the horizontal pixel resolution per each screen _^^ as uniform distribution. ^_^^()~^(2.16, 3.16) (14) ^_^^()~^(1664, 2176) (15) [0054] From equations (5)-(8), with R randomly sampling from distribution _^^^(), ^_^^(), and _^^^(), the user portfolio is as follows:

random user profiles can be sorted by first sorting the

. Denote the sorted set of ^ = ^ _^^^^(^), ^_^^^^(^)^. Figure 6 shows an example of a disparities with different user ^_^^^^) that can be used with one or more embodiments of the invention. The upper curve is the ^_^^^^(602A) and the lower curve _^^^^(602B). In one embodiment, Figure 6 illustrates that the first sorted profile index (as 1) has the largest disparity range with value

= (-82.5520,115.5048) and that the last sorted profile index (as 10,000) has the smallest disparity range with va lu e ^_^ ^{^} ^^{^} ^^{^} ^ ൯ = (-17.1617,17.4108). Here, ^ _^ ^{^} ^_^ ^{^} ^^{^}, ^_^ ^{^} ^^{^} ^^{^} ^ ൧ represents the minimum viewing comfort range that the content disparity should stay inside the range to avoid bringing discomfort. In other words, among all N stereo pairs, we should at least provide one stereo pair which satisfies this requirement. All users who watch this stereo pair should experience viewing comfort. In addition, ^ _^ ^{^} ^_^ ^{^} ^^௫, ^_^ ^{^} ^^{^} ^^௫ ^൧ is the maximum viewing comfort range. It provides the maximal content disparity can reach. Beyond this range, all users will experience discomfort. [0056] Figure 7 shows an example of user disparities bounds that can be used with one or more embodiments of the invention. In one embodiment, Figure 7 illustrates _{the distribution of}

_{The left group of bars represent the histogram of} _^^^^(702A) and the right group of bars represent the histogram of ^_^^^^ (702B). [0057] With the given M views with disparity ^ =

the probability of that each pair is most watched can be estimated under the portfolio ^ . In one embodiment, the stereo pair index {(a, b)} is flattened to a 1-D index, k, where k = 0, …, N-1. For the i^th entry in ^, the optimal pair as

is determined using equation (10). The occurrence of each pair can be normalized as

where,{^ഥ^ _^ ^כ} represents the probability of watching under the distribution of user portfolio by given the current content disparity. R is mentioned earlier as the number of random samples. [0058] Reusing the user profile distribution shown in the previous section, the following table is an example for 4-view system (with 6 stereo pairs) with corresponding occurrence of each pair. ^{For example, and in one embodiment, Table 1.}

Table 1. Example Disparity Spans and Occurrence Probabilities [0059] The content creators can adjust the distribution by re-creating the content to modify the disparity. The content creators can also decide to add more views and reduce the views to meet different granularity of targeted users. [0060] Figure 8 shows, in a flow diagram, an example of optimizing Multiview video coding process 800 that can be used with one or more embodiments of the invention. In Figure 8, process 800 begins by receiving input for the optimizing the multiview video at block 802 (e.g., user profile distribution, multiple view metadata (e.g., ^^^ ^^{(^,^)} ^_^^^) and/or any other type of data for optimizing the multiview video. At block 804, process 800 determines the most watched view. In one embodiment, _{process 800 determines which view is the most watched view from the distribution {} ^ഥ _{^ ^} ^כ _}. Note each ^ҧ_^ ^כ represents a stereo pair (two views). For each k, process 800 finds the corresponding (a, b). For each view, process 800 sets a counter to accumulate all probabilities which cover that view. For example, and in one embodiment, for (k = 0; k < N; k++) find corresponding (a, b) from k ^_^ = ^_^ + ^ҧ_^ ^כ ^_^ = ^_^ + ^ҧ_^ ^כ end ^{[0061] In the previous example}

Table 2. Distributions for stereo pairs [0062] Utilizing the above, gives {^_^}

Table 3. Weights for the Different Views [0063] In one embodiment, process 800 finds the view with the highest weight ^^ூ = ^^^max {^ } ^ _^ (17) [0064] Process 800 assigns the highest weighted view as the I-view at block 806. The view ^^ூ with the highest weight will be selected as the I-view. Because this view will be watched most often, it makes sense to have this view with the best video quality. Without inter-view decoding dependency, this view can be quickly decoded and rendered. Following the previous example, we choose view 2 as I-view since it has highest weight. [0065] At block 808, process 800 selects one or more P-views. In one embodiment, process 800 computes one or more P-views. In one embodiment, assume the range of disparity affects the coding efficiency. A smaller range of disparity implies higher similarity of two views, thus the bit rate to encode the residual will be smaller. Although a P-view can have disparity vector from several decoded views in DPB, we assume a stereo pair having smaller disparity range has better coding efficiency since the residual is smaller. The disparity range is defined as the difference between the extreme disparity values:

[0066] Note that in one embodiment, other statistics can be used for

. For example, the standard deviation of the disparity, instead of the entire range can be used to compute the weights. [0067] Since the I-view is determined as view ^^ூ, process 800 next determines the main prediction direction for the rest of M-1 views. After finding the main prediction between views, a refinement to find the best reference views inside the DPB for each block can be done. To simplify the discussion, we only describe the main prediction part here. ^{[0068] Denote the already selected set as and to- ்}

^{be-selected set as ^ =} _{{0,1, … ,^ െ 1}\^ௌ. In one embodiment, process 800 collects a set of view in view a in} ^^ௌ and view b in ^^் .

Process 800 further assigns prediction path, by selecting the pair a and b having the smallest disparity range. Process 800 additionally assigns view ^^ௌ as P-view with prediction from view ^^ௌ. Process 800 updates the set, ^^ௌ = ^^ௌ ^ ^^ௌ (20) ^^் = ^^்\^^ௌ (21) [0069] Following the same example, the above algorithm results in:

Table 4. Disparity Ranges for Different Stereo Pairs Iter#1: x ^^ௌ = {2} and ^^் = {0,1,3} x ^^{(^,ଶ) (^,ଶ) (ଶ,ଷ) (} ^_^^^^ = 90, ^_^^^^^ = 31, ^_^^^^^ = 68. ^ ^{^,ଶ)} ^_^^^^ has the lowest value.

x View 1 is selected as P-view and predicted from view 2. Iter#2 ^^(ଶ,ଷ) ^_^^^^ = 68. ^^{(^,^)} ^_^^^^ has the lowest value.

x View 0 is selected as P-view and predicted from view 1. Iter#3 x_^ ^ௌ ₌ ^{ _0,1,2 ^} _{and ^} ^் ₌ ^{ ₃ ^} x ^^{(^,ଷ)} ^_^^^^ = 160,

= 115, ^^(ଶ,ଷ) ^_^^^^ = 68. ^^(ଶ,ଷ) ^_^^^^ has the lowest value. x (^^ௌ, ^^ௌ) = (2,3) x View 3 is selected as P-view and predicted from view 2. [0070] Figures 9A-B shows an example of P-views and B-views that are generated for multiview video that can be used with one or more embodiments of the invention. In Figure 9A, view 2 (902C) is the I-view, and view 0 (902A), view 1 (902B), and view 3 (902D) are the P-views. [0071] Process 800 can optionally generate B-views at block 810. In Figure 9B, view 2 (902C) is the I-view, view 0 (902A) and view 3 (902D) are the P-views, and view 1 (902B) is the B-view. [0072] As discussed above, the different stereo pairs affect where the sound is perceived. In Dolby AC-4, each spatial audio object has its own 3D coordinate (pos3D_x, pos3D_y, pos3D_z). The coordinates are defined in relation to a normalized room. The room consists of two adjacent normalized unit cubes to describe the playback room boundaries as shown in the following figure. The origin is defined to be the front left corner of the room at the height of the main screen. Location (0,5; 0; 0) corresponds to the middle of the screen. Figure 10 shows an example of a spatial audio object coordinate system 1000 that can be used with one or more embodiments of the invention. • x-axis: describes latitude, or left/right position: - x = 0 corresponds to left wall (1010); - x = 1 corresponds to right wall (1012). • y-axis: describes longitude, or front/back position: - y = 0 corresponds to front wall (1014); - y = 1 corresponds to back wall (1016). • z-axis: describes elevation, or up/down position: - z = 0 corresponds to a horizontal plane at the height of the main screen, surround, and rear surround loudspeakers (1004); - z = 1 corresponds to the ceiling (1002); - z = -1 corresponds to the floor (1006). [0073] The intuitive solution to utilize the sound object for immersive auditory experience is to let each stereo pair own its own sound object. However, it is not coding efficient since the content in each stereo pair do not differ too much, only the location changes. To provide higher coding efficiency (save bit rate), there should be one copy of sound object in the entire MV-HEVC bit stream and move the sound object location according to the selected stereo pair. [0074] Figure 11 shows, in a flow diagram, an example of a spatial audio object adjustment process 1100 that can be used with one or more embodiments of the invention. In Figure 11, process 1100 begins by receiving the input for process 1100 at block 1102. (e.g., stereo pairs, disparities, u se r input {^^^^{^}, ^_^ ^{^} , _^ ^{^} ^, _^^ ^{^}}). [0075] Process 1100 performs a loop over the different stereo pairs to adjust the audio location (blocks 1104-1116). In one embodiment, there are different stereo pairs to maximize the perceived depth while maintaining the viewing comfort, the viewing direction and center of camera vary according to the selected stereo pair. Besides, each user has different IPD and each device has different optical and device parameters, which causes the perceived depth differently. In this embodiment, process1100 will modify the perceived audio to correctly place them along with the position and perceived depth. [0076] For each selected stereo pair, process 1100 can specify the camera center coordinate and the new coordinate unit axis

at block 1106. In one embodiment, the 3D coordinate of each sound object with location ^ can be recomputed in the new 3D coordinate based on its original 3D location and new camera unit axis for the stereo pair, together with the perceived depth adjustment. [0077] At block 1108, process 1100 moves the coordinate center (0.5,0.5,0) to camera center ^^{(^,^)} and computes the difference vector between new coordinate centered at

the sound object: ^^ᇱ = ^ െ ^^{(^,^)} (22) [0078] Process 1100 uses inner product <,> between difference vector, ^^ᇱ, and new coordinate axis vector, ^ ^{(^,^)} ௫ , ^^{(^,^)} ௬ , ^^{(^,^)} ௭ , to find the new coordinate in new coordinate system at block 1110:

[0079] At block 1112, process 1100 adjusts sound object location according to the perceived depth. With the reference { ^^^^{^}, ^_^ ^{^} , _^ ^{^} ^, _^^ ^{^}} and disparity ^^{^}, which is used to create the sound object location, and the current { ^^^^{^}, ^^{^} , ^{^}

} and disparity ^^{^}, process 1100 can compute the ratio between the perceived depth and the reference depth. Then, process 1100 uses the ratio to scale the coordinate (^^_௫ , ^^_௬ , ^^_௭). [0080] With the knowledge of disparity in reference system and current sound object, the perceived depth can be represented as follows:

[0081] This results in the perceived depth ratio as

[0082] Process 1100 further applies this ratio to scale (^^_௫ , ^^_௬ , ^^_௭)as

[0083] Process 1100 shifts back the origin of the final coordinate to (0.5, 0.5, 0) such that process 1100 can utilize the following audio rendering pipeline at block 1114. (^^_௫ + 0.5, ^^_௬ + 0.5, ^^_௭) (30) [0084] The processing loop ends and block 1116. [0085] In one embodiment, from above discussion, a summary of the metadata that can be used is in the list below: x Number of stereo pairs x Reference { ^^^, ^_^, _^^, _^^} that is used to create the location of sound object. x For each valid stereo pair (a, b), scene-based metadata o Specify which view is left view (a), and which view is right view (b). o Content disparity

o Camera center coordinate ^ and the new coordinate unit axis ^^{(^,^)}

௫ , ^_௬ , ^^{(^,^)} ௭ . o For each sound object ^ Disparity ^^{^} in reference environment [0086] An example of the metadata HMD_AV_info SEI is shown below. The SEI can be specified on scene basis.

Table 5. HMD_AV_info metadata.

Table 6. fp_rep_info_element metadata. [0087] In one embodiment, the fp_rep_info_element(OutSign, OutExp, OutMantissa, OutManLen) syntax structure sets the values of OutSign, OutExp, OutMantissa and OutManLen variables that represent a floating-point value. When the syntax structure is included in another syntax structure, the variable names OutSign, OutExp, OutMantissa and OutManLen are to be interpreted as being replaced by the variable names used when the syntax structure is included.

equal to 0 indicates that the sign of the floating-point value is positive. fp_sign_flag equal to 1 indicates that the sign is negative. The variable OutSign is set equal to fp_sign_flag. fp_exponent specifies the exponent of the floating-point value. The value of fp_exponent shall be in the range of 0 to 2⁷ í 2, inclusive. The value 2⁷ í 1 is reserved for future use by ITU-T | ISO/IEC. Decoders shall treat the value 2⁷ í 1 as indicating an unspecified value. The variable OutExp is set equal to fp_exponent. fp_mantissa_len_minus1 plus 1 specifies the number of bits in the fp_mantissa syntax element. The variable OutManLen is set equal to fp_mantissa_len_minus1 + 1. fp_mantissa specifies the mantissa of the floating-point value. The variable OutMantissa is set equal to fp_mantissa. For HMD_AV__info SEI, the floating points are specified by fp_rep_info_element(OutSign, OutExp, OutMantissa, OutManLen) syntax structure. num_views_minus1 plus 1 specifies number of views in the Multiview system. num_stereo_pairs_minus1 plus 1 specifies number of valid stereo pairs in the Multiview system. left_view_id[ i ] specifies the view id for the left view in the ith stereo pair. right_view_id[ i ] specifies the view id for the right view in the ith stereo pair. num_sound_obj_minus1[ i ] plus 1 specifies the number of objects in the reference enviroment. [0088] Figure 12 shows an example of a data processing system 1200 that can be used by or in a camera or other device to provide one or more embodiments described herein. The systems and methods described herein can be implemented in a variety of different data processing systems and devices, including general-purpose computer systems, special purpose computer systems, or a hybrid of general purpose and special purpose computer systems. Data processing systems that can use any one of the methods described herein include a camera, a smartphone, a set top box, a computer, such as a laptop or tablet computer, embedded devices, game systems, and consumer electronic devices, etc., or other electronic devices. [0089] Figure 12 is a block diagram of data processing system 1200 hardware according to an embodiment. Note that while Figure 12 illustrates the various components of a data processing system, it is not intended to represent any particular architecture or manner of interconnecting the components as such details are not germane to the present invention. It will also be appreciated that other types of data processing systems that have fewer components than shown or more components than shown in Figure 12 can also be used with one or more embodiments of the present invention. [0090] As shown in Figure 12, the data processing system 1200 includes one or more buses 1209 that serve to interconnect the various components of the system. The system in Figure 12 can include a camera or be coupled to a camera. One or more processing devices 1203 are coupled to the one or more buses 1209 as is known in the art. Memory 1205 may be DRAM or non-volatile RAM or may be flash memory or other types of memory or a combination of such memory devices. This memory is coupled to the one or more buses 1209 using techniques known in the art. The data processing system can also include non-volatile memory 1207, which may be a hard disk drive or a flash memory or a magnetic optical drive or magnetic memory or an optical drive or other types of memory systems that maintain data even after power is removed from the system. The non-volatile memory 1207 and the memory 1205 are both coupled to the one or more buses 1209 using known interfaces and connection techniques. A display controller 1221 is coupled to the one or more buses 1209 in order to receive display data to be displayed on a display device which can be one of displays. The data processing system 1200 can also include one or more input/output (I/0) controllers 1215 which provide interfaces for one or more I/0 devices, such as one or more cameras, touch screens, ambient light sensors, and other input devices including those known in the art and output devices (e.g., speakers). The input/output devices 1217 are coupled through one or more I/0 controllers 1215 as is known in the art. The ambient light sensors can be integrated into the system in Figure 12. [0091] While Figure 12 shows that the non-volatile memory 1207 and the memory 1205 are coupled to the one or more buses directly rather than through a network interface, it will be appreciated that the present invention can utilize non-volatile memory that is remote from the system, such as a network storage device which is coupled to the data processing system through a network interface such as a modem or Ethernet interface. The buses 1209 can be connected to each other through various bridges, controllers and/or adapters as is well known in the art. In one embodiment the I/0 controller 1215 includes one or more of a USB (Universal Serial Bus) adapter for controlling USB peripherals, an IEEE 1394 controller for IEEE 1394 compliant peripherals, or a Thunderbolt controller for controlling Thunderbolt peripherals. In one embodiment, one or more network device(s) 1225 can be coupled to the bus(es) 1209. The network device(s) 1225 can be wired network devices (e.g., Ethernet) or wireless network devices (e.g., Wi-Fi, Bluetooth) that receive images from a camera, etc. [0092] Although separate embodiments are enumerated below, it will be appreciated that these embodiments can be combined or modified, in whole or in part, into various different combinations. The combinations of these embodiments can be any one of all possible combinations of the separate embodiments. [0093] Embodiment 1 is a method comprising: receiving a video source with multiple views; computing disparity statistics for a plurality of stereo pairs of the multiple views of the video source, wherein the number of multiple views is greater than or equal to three; encoding the video source into an encoded video stream using the disparity statistics, wherein the encoded video stream includes the plurality of stereo pairs; and transmitting the encoded video stream. [0094] Embodiment 2 is a method of embodiment 1 further comprising computing metadata associated with the encoded video stream. [0095] Embodiment 3 is a method of embodiment 1 or 2, wherein the disparity statistics are a set of disparities between corresponding views in a stereo pair and a disparity is pixel coordinate difference between one point in the left view and the corresponding point in the right view. [0096] Embodiment 4 is a method of any one of embodiments 1 to 3, wherein the computing of the disparity statistics includes computing disparity statistics for each stereo pair of the plurality of stereo pairs. [0097] Embodiment 5 is a method of any one of embodiments 1 to 4, further comprising: computing a viewing distribution for each of the plurality of stereo pairs. [0098] Embodiment 6 is a method of embodiment 5, wherein the computing of the viewing distribution includes computing a viewing distribution for each stereo pair of the multiple views. [0099] Embodiment 7 is a method of any one of embodiments 1 to 6, further comprising: determining if a viewing distribution meets a target distribution; and receiving a new set of multiple views for the video source when the viewing distribution does not meet the target distribution [0100] Embodiment 8 is a method of embodiment 5, wherein the computing of the viewing distribution for the stereo pair comprises: computing the viewing distribution using a user profile according to a distribution of at least one of, interpulmonary distance, viewing distance, and a display parameter [0101] Embodiment 9 is a method of any one of embodiments 1 to 8, wherein the video stream is a Multiview High Efficiency Vi de o Coding video stream. [0102] Embodiment 10 is a method of any one of embodiments 1 to 9, wherein the encoding of the video stream comprises: determining a most watched view of the multiple view using the viewing distribution; and selecting an I-view from the viewing distribution by, computing a weight for each view from the viewing distribution, and selecting a view from the multiple views with the highest weight. [0103] Embodiment 11 is a method of embodiment 10, further comprising: selecting a P-view by determining a subset of the multiple views that to be selected views, computing a disparity range in the subset using the viewing distribution., and selecting the view in the subset with the lowest disparity range. [0104] Embodiment 12 is a method of embodiment 10, further comprising: computing a plurality of P-views from the viewing distribution; and computing a B-view from the plurality of P-views. [0105] Embodiment 13 is a method of any one of embodiments 1 to 12, further comprising: adjusting an audio object for the stereo pair using the disparity statistics. [0106] Embodiment 14 is a method of embodiment 13, wherein the adjusting of the audio object further comprises: compute location of the audio location in a new coordinate system; compute a difference vector between a new coordinate and the audio location; finding new coordinates for the audio object; and adjust audio object location. [0107] Embodiment 15 is a method of any one of embodiments 1 to 14, further comprising: determining if the encoded video stream meets a target bit rate; and receiving a new set of multiple views for the video source when the encoded video stream does not meet the target bit rate. [0108] Embodiment 16 is a method of any one of embodiments 1 to 15, further comprising: measuring a viewer head rotation theta; and modifying an interpulmonary distance (IPD) by computing the horizontal component of a head rotation IPD * cos(theta).

[0109] Embodiment 17 is a method of any one of embodiments 1 to 16, further comprising: selecting two or more viewpoints nearest to a desired mterpulmonary' distance (IPD); and interpolating a novel view using two or more viewpoints.

[0110] Embodiment 18 is an apparatus comprising a processing system and memory' and configured to perform any one of the methods in embodiments 1-17.

[0111] Embodiment 19 is a non-transitory machine-readable storage storing executable program instructions which when executed by a machine cause the machine to perform any one of the methods of embodiments 1-17.

[0112] It will be apparent from this description that one or more embodiments of the present invention may be embodied, at least in part, in software. That is, the techniques may be carried out in a data processing system in response to its one or more processors executing a sequence of instructions contained in a storage medium, such as a non-transitory' machine- readable storage medium (e.g., DRAM or flash memory), i various embodiments, hardwired circuitry may be used in combination with software instructions to implement the present invention. Thus, the techniques are not limited to any specific combination of hardware circuitry and software, or to any particular source for the instructions executed by the data processing system.

[0113] In the foregoing specification, specific exemplary embodiments have been described. It will be evident that various modifications may be made to those embodiments without departing from the broader spirit and scope set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

Claims

modifying an interpulmonary distance (IPD) by computing the horizontal component of a head rotation IPD * cos(theta). [0109] Embodiment 17 is a method of any one of embodiments 1 to 16, further comprising: selecting two or more viewpoints nearest to a desired interpulmonary distance (IPD); and interpolating a novel view using two or more viewpoints. [0110] Embodiment 18 is an apparatus comprising a processing system and memory and configured to perform any one of the methods in embodiments 1-17. [0111] Embodiment 19 is a non-transitory machine-readable storage storing executable program instructions which when executed by a machine cause the machine to perform any one of the methods of embodiments 1-17. [0112] It will be apparent from this description that one or more embodiments of the present invention may be embodied, at least in part, in software. That is, the techniques may be carried out in a data processing system in response to its one or more processors executing a sequence of instructions contained in a storage medium, such as a non-transitory machine- readable storage medium (e.g., DRAM or flash memory). In various embodiments, hardwired circuitry may be used in combination with software instructions to implement the present invention. Thus, the techniques are not limited to any specific combination of hardware circuitry and software, or to any particular source for the instructions executed by the data processing system. [0113] In the foregoing specification, specific exemplary embodiments have been described. It will be evident that various modifications may be made to those embodiments without departing from the broader spirit and scope set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

CLAIMS 1. A method comprising: receiving a video source with multiple views; computing disparity statistics for each stereo pair of a plurality of stereo pairs of the multiple views of the video source, wherein the number of multiple views is greater than or equal to three, wherein the disparity statistics comprise a set of disparities between corresponding views in a stereo pair comprising a left view and a right view, wherein a disparity comprises a pixel coordinate difference between one point in the left view and the corresponding point in the right view; encoding the video source into an encoded video stream using a coding structure, wherein the coding structure is determined using the disparity statistics, wherein the encoded video stream includes the plurality of stereo pairs and metadata describing each stereo pair; and transmitting the encoded video stream. 2. The method of claim 1, further comprising: computing the metadata associated with the encoded video stream. 3. The method of claims 1 or 2, further comprising: computing a viewing distribution for each of the plurality of stereo pairs. 4. The method of claim 3, wherein the computing of the viewing distribution includes computing a viewing distribution for each stereo pair of the multiple views. 5. The method of any one of claims 1 to 4, further comprising: determining if a viewing distribution meets a target distribution; and receiving a new set of multiple views for the video source when the viewing distribution does not meet the target distribution. 6. The method of claim 3, wherein the computing of the viewing distribution for the each stereo pair comprises: computing the viewing distribution using a user profile according to a distribution of at least one of, interpupillary distance, viewing distance, and a display parameter. 7. The method of any one of claims 1 to 6, wherein the encoded video stream is a Multiview High Efficiency Video Coding video stream. 8. The method of any one of claims 1 to 7, wherein the encoding of the video stream comprises: determining a most watched view of the multiple view using a viewing distribution; and selecting an intra-view (I-view) from the viewing distribution by, computing a weight for each view from the viewing distribution, and selecting a view from the multiple views with the highest weight. 9. The method of claim 8, further comprising: selecting a predicted view (P-view) by determining a subset of the multiple views that to be selected views, computing a disparity range in the subset using the viewing distribution, and selecting the view in the subset with the lowest disparity range. 10. The method of claim 8, further comprising: computing a plurality of P-views from the viewing distribution; and computing a bidirectional predicted view (B-view) from the plurality of P-views. 11. The method of any one of claims 1 to 10, further comprising: adjusting an audio object for the stereo pair using the disparity statistics. 12. The method of claim 11, wherein the adjusting of the audio object further comprises: computing audio location in a new coordinate system;