HK1261365A1

HK1261365A1 - System and method for optimizing loudness and dynamic range across different playback devices

Info

Publication number: HK1261365A1
Application number: HK19121333.9A
Authority: HK
Inventors: J·瑞德米勒; S·G·诺克罗斯; K·J·罗德恩
Original assignee: 杜比实验室特许公司; 杜比国际公司
Priority date: 2013-01-21
Filing date: 2016-01-29
Publication date: 2019-12-27

Description

System and method for optimizing loudness and dynamic range between different playback devices

The present application is a divisional application of the inventive patent application having application number 201480005314.9, filing date 1/15/2014 entitled "system and method for optimizing loudness and dynamic range between different playback devices".

Cross Reference to Related Applications

This application claims priority from: U.S. provisional application No. 61/754882 filed on 21/1/2013, U.S. provisional application No. 61/809250 filed on 5/4/2013; and U.S. provisional patent application No. 61/824010 filed on 16.5.2013, all of which are incorporated herein by reference in their entirety.

Technical Field

One or more embodiments relate generally to audio signal processing and, more particularly, to processing audio data bitstreams with metadata that indicates loudness and dynamic range characteristics of audio content based on playback environment and device.

Background

The subject matter discussed in the background section should not be assumed to be prior art merely because it is referred to in this section. Similarly, the problems mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been recognized in any prior art. The subject matter in the background section merely represents different approaches that may themselves be inventions.

The dynamic range of an audio signal is typically the ratio between the maximum and minimum possible values of the sound embodied in the signal, and is typically measured as a decibel value (based on 10). In many audio processing systems, dynamic range control (or dynamic range compression) is used to reduce loud sound levels, and/or amplify quiet sound levels to enable wide dynamic range source content to adapt to the narrower recorded dynamic range that can be stored and reproduced more easily using electronic devices. For Audiovisual (AV) content, the dialog reference level may be used to define "zero" points for compression by the DRC mechanism. DRC is used to grow content below a dialog reference level and to cut out content above that reference level.

In known audio coding systems, metadata associated with an audio signal is used to set the DRC level based on the type of content and the intended use. The DRC mode sets the amount of compression to be applied to the audio signal and defines the output reference level of the decoder. Such a system may be limited to two DRC level settings programmed into the encoder and selected by the user. For example, conventionally, a dialog normalized value of-31 dB is used for content played back on AVR-enabled or full dynamic range devices, while a dialog normalized value of-20 dB is used for content played back on television sets or similar devices. This type of system allows a single audio bitstream to be used for two common, but vastly different playback scenarios by using two different sets of DRC metadata. However, such systems are limited to pre-set dialog normalization values, and are not optimized for playback in a variety of different playback devices and listening environments that are possible with the advent of digital media and internet-based streaming technologies.

In current metadata-based audio coding systems, an audio data stream may include audio content (e.g., one or more channels of audio content) and metadata indicative of at least one characteristic of the audio content. For example, in an AC-3 bitstream, there are several audio metadata parameters that are specifically expected to change the sound of a program delivered to a listening environment. One of the metadata parameters is a dialog normalization parameter that indicates the average loudness level of dialogs occurring in the audio program (or the average loudness of the content) and is used to determine the audio playback signal level.

During playback of a bitstream containing a sequence of different audio program segments, each having a different dialog normalization parameter, an AC-3 decoder uses the dialog normalization parameter of each segment to perform a loudness processing that modifies the playback level or loudness of the segment so that the perceived loudness of the dialog of the segment is at a consistent level. Each encoded audio item in a sequence of encoded audio segments (items) will typically have a different dialog normalization parameter, and the decoder will scale the level of each item so that the playback level or loudness of the dialog for each item is the same or very similar, but this may require different amounts of gain to be added for the different items during playback.

In some embodiments, the dialog normalization parameter is set by the user, rather than being automatically generated, but a default dialog normalization value exists in the event that the user does not have a set value. For example, the content creator may make a loudness measurement through a device external to the AC-3 encoder and then pass the result (indicating the loudness of spoken dialog of the audio program) to the encoder to set the dialog normalization value. Therefore, the dialog normalization parameter is correctly set depending on the content creator.

There are several different reasons that a session normalization parameter may not be correct in an AC-3 bitstream. First, each AC-3 encoder has a default dialog normalization value that is used during bitstream generation if the dialog normalization value is not set by the content creator. This default value may be significantly different from the actual dialog loudness level of the audio. Second, even if the content creator measures loudness and sets the dialog normalization value accordingly, loudness measurement algorithms or meters that do not conform to the suggested loudness measurement method may have been used, resulting in incorrect dialog normalization values. Third, even though an AC-3 bitstream has been created with a session-normalized value that is correctly measured and set by the content creator, the session-normalized value may have changed to an incorrect value due to intermediate modules during transmission and/or storage of the bitstream. For example, in television broadcast applications, it is common for an AC-3 bitstream to be decoded, modified, and then re-encoded using incorrect dialog meta-data information. Thus, the dialog normalization value included in the AC-3 bitstream may be incorrect or inaccurate, and thus may adversely affect the quality of the listening experience.

Furthermore, the dialog normalization parameter does not indicate a loudness processing state of the corresponding audio data (e.g., a type of loudness processing that has been performed on the audio data). Additionally, currently employed loudness and DRC systems, such as those in Dolby Digital (DD) and Dolby Digital Plus (DD +) systems, are designed to present AV content in a consumer's living room or theater. In order to make such content suitable for playback in other environments and listening devices (e.g., mobile devices), post-processing must be "blindly" applied in the playback device to make the AV content suitable for the listening environment. In other words, the post-processor (or decoder) assumes that the loudness level of the received content is at a certain level (e.g., -31dB or-20 dB), and the post-processor sets that level to a predetermined fixed target level appropriate for the particular device. If the assumed loudness level or the predetermined target level is incorrect, the post-processing may have an opposite effect to its intended effect, i.e., the post-processing may cause the output audio to be lower than the user's expectations.

The disclosed embodiments are not limited to use with AC-3 bitstreams, E-AC-3 bitstreams, or Dolby E bitstreams, but for convenience such bitstreams will be discussed in connection with systems that include loudness processing state metadata. Dolby, Dolby Digital Plus, and Dolby E are trademarks of Dolby laboratories licensed corporation, which provides proprietary implementations of AC-3 and E-AC-3 known as Dolby and Dolby Digital, respectively.

Disclosure of Invention

Embodiments relate to a method for decoding audio data that receives a bitstream containing metadata associated with the audio data and analyzes the metadata in the bitstream to determine whether loudness parameters for a first group of audio playback devices are available in the bitstream. In response to determining that the parameter exists for the first set, the processing component renders audio using the parameter and the audio data. In response to determining that the parameter does not exist for the first group, the processing component analyzes one or more characteristics of the first group and determines a parameter based on the one or more characteristics. The method may further render the audio using the parameters and the audio data by transmitting the parameters and the audio data to a downstream module that renders the audio for playback. The parameter and audio data may also be used to render audio by rendering the audio data based on the parameter and audio data.

In one embodiment, the method further includes determining an output device that will present the received audio stream, and determining whether the output device belongs to a first group of audio playback devices; wherein the step of analyzing the metadata in the stream to determine whether the loudness parameters for the first group of audio playback devices are available is performed after the step of determining that the output device belongs to the first group of audio playback devices. In one embodiment, the step of determining that the output device belongs to the first set of audio playback devices comprises: an indication is received from a module connected to the output device indicating an identity (identity) of the output device or indicating an identity of a group of devices including the output device, and it is determined whether the output device belongs to the first group of audio playback devices based on the received indication.

Embodiments further relate to an apparatus or system comprising a processing component that performs the actions described above in the encoding method embodiments.

Embodiments further relate to an audio data decoding method that receives audio data and metadata associated with the audio data, analyzes the metadata in a bitstream to determine whether loudness information associated with loudness parameters of a first group of audio devices is available in the bitstream, and in response to determining that the loudness information is present for the first group, determines loudness information from the bitstream and transmits the audio data and loudness information for rendering audio, or if the loudness information is not present for the first group, determines loudness information associated with an output profile and transmits the determined loudness information for the output profile for rendering audio. In one embodiment, the step of determining loudness information associated with the output profile may further comprise analyzing characteristics of the output profile, determining parameters based on the characteristics, and transmitting the determined loudness information comprises transmitting the determined parameters. The loudness information may include loudness parameters of the output profile or characteristics of the output profile. In one embodiment, the method may further include determining a low bit rate encoded stream to be transmitted, wherein the loudness information includes characteristics of one or more output profiles.

Embodiments further relate to an apparatus or system comprising a processing component to perform the actions described above in the decoding method embodiments.

Drawings

In the following drawings like reference numerals are used to indicate like elements. Although the following figures depict various examples, the implementations described herein are not limited to the examples depicted in the figures.

Fig. 1 is a block diagram of an embodiment of an audio processing system configured to perform optimization of loudness and dynamic range, in accordance with some embodiments;

fig. 2 is a block diagram of an encoder used in the system of fig. 1 in accordance with some embodiments.

Fig. 3 is a block diagram of a decoder used in the system of fig. 1, in accordance with some embodiments.

Fig. 4 is an illustration of an AC-3 frame, including the AC-3 frame being partitioned into a plurality of segments.

FIG. 5 is a diagram of a Synchronization Information (SI) segment of an AC-3 frame, including the division of the AC-3 frame into multiple segments.

Fig. 6 is a diagram of a bitstream information (BSI) segmentation of an AC-3 frame, including the AC-3 frame being partitioned into multiple segments.

FIG. 7 is an illustration of an E-AC-3 frame, including the segmentation of the E-AC-3 frame into a plurality of segments.

Fig. 8 is a table illustrating the format of certain frames and metadata of an encoded bitstream according to some embodiments.

Fig. 9 is a table illustrating a format of loudness processing state metadata according to some embodiments.

Fig. 10 is a more detailed block diagram of the audio processing system of fig. 1 that may be configured to perform loudness and dynamic range optimization in accordance with some embodiments.

Fig. 11 is a table showing different dynamic range requirements for various playback devices and background listening environments in an exemplary use case.

Fig. 12 is a block diagram of a dynamic range optimization system, according to an embodiment.

Fig. 13 is a block diagram of an interface between different profiles for various different playback device categories, according to some embodiments.

Fig. 14 is a table illustrating the correlation between long-term loudness and short-term dynamic range of various defined profiles, according to an embodiment.

Fig. 15 shows an example of loudness profiles for different types of audio content according to an embodiment.

Fig. 16 is a flow diagram illustrating a method of optimizing loudness and dynamic range between a playback device and an application, according to an embodiment.

Detailed Description

Definitions and nomenclature

In the context of the present disclosure, including in the claims, the expression "performing an operation on a signal or data (e.g., filtering, scaling, transforming, or applying a gain to the signal or data)" is used in a broad sense to indicate that the operation is performed directly on the signal or data, or on a processed version of the signal or data (e.g., a version of the signal that has been subjected to preliminary filtering or pre-processing prior to performing the operation). The expression "system" is used in a broad sense to denote a device, system or subsystem. For example, a subsystem implementing a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to a plurality of data, where the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system. The term "processor" is used in a broad sense to refer to a system or device that is programmable or configurable (e.g., via software or firmware) to perform operations on data (e.g., audio, video, or other image data). Examples of processors include field programmable gate arrays (or other programmable integrated circuits or chipsets), digital signal processors programmed and/or otherwise configured to perform pipeline processing on audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chipsets.

The expressions "audio processor" and "audio processing unit" are used interchangeably and broadly indicate a system configured to process audio data. Examples of audio processing units include, but are not limited to, encoders (e.g., transcoders), decoders, codecs, pre-processing systems, post-processing systems, and bitstream processing systems (sometimes referred to as bitstream processing tools). The expression "processing state metadata" (e.g. in the expression "loudness processing state metadata") refers to separate and distinct data from corresponding audio data (audio content of an audio data stream also including processing state metadata). The processing state metadata is associated with the audio data, indicates a loudness processing state of the corresponding audio data (e.g., what type of processing has been performed on the audio data), and optionally also indicates at least one feature or characteristic of the audio data. In some embodiments, the association of the processing state metadata with the audio data is time synchronized. Thus, the current (newly received or updated) processing state metadata indicates that the corresponding audio data simultaneously includes the results of the indicated type of audio data processing. In some cases, the processing state metadata may include some or all of the parameters used in and/or derived from the processing history and/or indicated type of processing. Additionally, the processing state metadata may include at least one feature or characteristic of the corresponding audio data that has been calculated or extracted from the audio data. The processing state metadata may also include other metadata unrelated to or not derived from any processing of the corresponding audio data. For example, third party data, tracking information, identifiers, proprietary or standard information, user annotation data, user preference data, and the like may be added by a particular audio processing unit for communication to other audio processing units.

The expression "loudness processing state metadata" (or "LPSM") indicates processing state metadata that indicates a loudness processing state of the corresponding audio data (e.g., what type of processing has been performed on the audio data), and optionally also indicates at least one feature or characteristic (e.g., loudness) of the corresponding audio data. The loudness processing state metadata may include data that is not loudness processing state metadata (e.g., when considered separately). The terms "coupled" or "coupled" are used to indicate either a direct or indirect connection.

Systems and methods are described for an audio encoder/decoder that non-destructively normalizes the loudness and dynamic range of audio between various devices that require or use different target loudness values and have different dynamic range capabilities. Methods and functional components according to some embodiments send information about audio content from an encoder to a decoder for one or more device profiles. The device profile specifies a desired target loudness and dynamic range for one or more devices. The system is extensible so that new device profiles with different "nominal" loudness targets can be supported.

In one embodiment, the system generates the appropriate gain in the encoder based on loudness control and dynamic range requirements, or in the decoder under control from the encoder through parameterization of the original gain to reduce the data rate. The dynamic range system includes two mechanisms for implementing loudness control: an artistic dynamic range profile that provides content creator control on how audio will be played back; and a separate protection mechanism to ensure that overload does not occur for the various playback profiles. The system is also configured to allow the loudness and dynamic range gain and/or profile to be controlled correctly using other metadata (internal or external) parameters. The decoder is configured to support n-channel auxiliary inputs that will affect the decoder-side loudness and dynamic range.

In some embodiments, Loudness Processing State Metadata (LPSM) is embedded in one or more reserved fields (or slots) of a metadata segment of an audio bitstream that also includes audio data in other segments (audio data segments). For example, at least one segment of each frame of the bitstream includes LPSM, and at least one other segment of the frame includes corresponding audio data (i.e., audio data whose loudness processing state and loudness are indicated by LPSM). In some embodiments, the data volume of the LPSM may be small enough to be conveyed without affecting the bit rate allocated for conveying the audio data.

Delivering loudness processing state metadata in an audio data processing chain is particularly useful in situations where two or more audio processing units need to work in series with each other in the processing chain (or in the content lifecycle). Media processing problems such as quality, level and spatial degradation may occur where loudness processing state metadata is not included in the audio bitstream, such as may occur when two or more audio codecs are applied in the chain and single-ended volume adjustments are applied more than once during transmission of the bitstream to a media consumption device (or a presentation point of the audio content of the bitstream).

Loudness and dynamic range metadata processing system

Fig. 1 is a block diagram of one embodiment of an audio processing system that may be configured to perform optimization of loudness and dynamic range in accordance with some embodiments using certain metadata processing (e.g., pre-processing and post-processing) components. Fig. 1 illustrates an exemplary audio processing chain (audio data processing system) in which one or more of the elements of the system may be configured in accordance with an embodiment of the present invention. The system 10 of FIG. 1 includes the following elements coupled together as shown: a pre-processing unit 12, an encoder 14, a signal analysis and metadata correction unit 16, a transcoder 18, a decoder 20 and a post-processing unit 24. In a variant of the system shown, one or more of the elements are omitted, or an additional audio data processing unit is included. For example, in one embodiment, the post-processing unit 22 is part of the decoder 20 rather than a separate unit.

In some implementations, the pre-processing unit of fig. 1 is configured to accept PCM (time domain) samples containing audio content as input, and output processed PCM samples. Encoder 14 may be configured to accept PCM samples as input and output an encoded (e.g., compressed) audio bitstream indicative of audio content. Data indicative of a bitstream of audio content is sometimes referred to herein as "audio data". In some embodiments, the audio bitstream output from the encoder includes loudness processing state metadata (and optionally other metadata) and audio data.

The signal analysis and metadata correction unit 16 may accept one or more encoded audio bitstreams as input and determine (e.g., verify) whether the processing state metadata in each encoded audio bitstream is correct by performing signal analysis. In some embodiments, verification may be performed by a state verifier component (such as element 102 shown in FIG. 2), and one such verification technique is described below in the context of state verifier 102. In some embodiments, unit 16 is included in the encoder, and the verification is performed by unit 16 or verifier 102. If the signal analysis and metadata correction unit finds that the contained metadata is invalid, the metadata correction unit 16 performs a signal analysis to determine the correct value and replaces the incorrect value with the determined correct value. Thus, each encoded audio bitstream output from the signal analysis and metadata correction unit may include corrected processing state metadata as well as encoded audio data. The signal analysis and metadata correction unit 16 may be part of the pre-processing unit 12, the encoder 14, the transcoder 18, the decoder 20, or the post-processing unit 22. Alternatively, the signal analysis and metadata correction unit 16 may be a separate unit or part of another unit in the audio processing chain.

Transcoder 18 may accept the encoded audio bitstream as data and, in response, output a modified (or, differently encoded) audio bitstream (e.g., by decoding the input stream and re-encoding the decoded bitstream in a different encoding format). The audio bitstream output from the transcoder includes loudness processing state metadata (and optionally other metadata) and encoded audio data. The metadata is already contained in the bitstream.

The decoder 20 of fig. 1 may accept as input an encoded (e.g., compressed) audio bitstream and output (in response) a stream of decoded PCM audio samples. In one embodiment, the output of the decoder is or includes any of: a stream of audio samples and a corresponding stream of loudness processing state metadata (and optionally other metadata) extracted from an input encoded bitstream; a stream of audio samples and a corresponding stream of control bits determined by loudness processing state metadata (and optionally other metadata) extracted from an input encoded bitstream; or a stream of audio samples without a corresponding stream of processing state metadata or control bits determined by the processing state metadata. In this last case, the decoder may extract loudness processing state metadata (and/or other metadata) from the input encoded bitstream and perform at least one operation (e.g., validation) on the extracted metadata, but it does not output the extracted metadata or the control bits determined therefrom.

By configuring the post-processing unit of fig. 1 in accordance with an embodiment of the present invention, post-processing unit 22 is configured to accept a stream of decoded PCM audio samples and perform post-processing (e.g., volume adjustment of audio content) on the stream of decoded PCM audio samples using loudness processing state metadata (and optionally other metadata) received over the samples or control bits received over the samples (determined by a decoder from the loudness processing state metadata and optionally other metadata). The post-processing unit 22 may optionally also be configured to render the post-processed audio content for playback by one or more speakers. These speakers may be embodied in a variety of different listening or playback devices, such as a computer, television, stereo (home or cinema), mobile phone, or other portable playback device. The speakers may be of any suitable size and power rating and may be provided in the form of separate drivers, speaker boxes, surround sound systems, soundbars, headphones, earplugs, and the like.

Some embodiments provide an enhanced audio processing chain in which audio processing units (e.g., encoders, decoders, transcoders, and pre-and post-processing units) change their respective processing to be applied to audio data according to a simultaneous state (consistency) of media data indicated by loudness processing state metadata respectively received by the audio processing units. The audio data input 11 of any audio processing unit of the system 100 (e.g., the encoder or transcoder of fig. 1) may include loudness processing state metadata (and optionally other metadata) and audio data (e.g., encoded audio data). According to some embodiments, this metadata may be included in the input audio by other elements or other sources. The processing unit receiving the input audio (with metadata) may be configured to perform at least one operation on the metadata (e.g., verification) or in response to the metadata (e.g., adaptive processing of the input audio) and optionally also configured to include the metadata, a processed version of the metadata, or control bits determined from the metadata in its output audio.

Embodiments of an audio processing unit (or audio processor) are configured to perform adaptive processing of audio data based on a state of the audio data indicated by loudness processing state metadata corresponding to the audio data. In some embodiments, the adaptive processing is (or includes) loudness processing (if the metadata indicates that loudness processing or similar has not been performed on the audio data) but is not (or does not include) loudness processing (if the metadata indicates that loudness processing or similar has been performed on the audio data). In some embodiments, the adaptation process is or includes metadata validation (e.g., performed in a metadata validation subunit) to ensure that the audio processing unit performs other adaptation processes of the audio data based on the state of the audio data indicated by the loudness processing state metadata. In some embodiments, the verification determines the reliability of loudness processing state metadata associated with the audio data (e.g., contained in the bitstream). For example, if the metadata is verified as reliable, the results of one previously performed audio processing may be reused, while additional performance of the same type of audio processing may be avoided. On the other hand, if the metadata is found to have been tampered with (or unreliable), such media processing purportedly previously performed (as indicated by the unreliable metadata) may be repeated by the audio processing unit, and/or the audio processing unit may perform other processing on the metadata and/or the audio data. The audio processing unit may be further configured to signal loudness processing state metadata (e.g., present in the media bitstream) valid to other audio processing units downstream in the enhancement media processing chain if the audio processing unit determines that the loudness processing state metadata is valid (e.g., based on the extracted cryptographic value matching the reference cryptographic value).

For the embodiment of fig. 1, the pre-processing component 12 may be part of the encoder 14 and the post-processing component 22 may be part of the decoder 22. Alternatively, the pre-processing component 12 may represent a functional component separate from the encoder 14. Similarly, the post-processing component 22 may be embodied as a functional component separate from the decoder 22.

Fig. 2 is a block diagram of an encoder 100 that may be used in conjunction with the system 10 of fig. 1. Any of the components of the encoder 100 may be implemented as one or more processes and/or one or more circuits (e.g., ASICs, FPGAs, or other integrated circuits) in hardware, software, or a combination of hardware and software. The encoder 100 comprises a frame buffer 110, a parser 111, a decoder 101, an audio state verifier 102, a loudness processing stage 103, an audio stream selection stage 104, an encoder 105, a filler/formatter stage 107, a metadata generation stage 106, a dialog loudness measurement subsystem 108 and a frame buffer 109, which are connected as shown. Still alternatively, encoder 100 includes other processing elements (not shown). The encoder 100 (which is a transcoder) is configured to convert an input audio bitstream (which may be, for example, one of an AC-3 bitstream, an E-AC-3 bitstream, or a Dolby E bitstream) into an encoded output audio bitstream (which may be, for example, another of the AC-3 bitstream, the E-AC-3 bitstream, or the Dolby E bitstream), including performing adaptive and automatic loudness processing using loudness processing state metadata contained in the input bitstream. For example, the encoder 100 may be configured to convert an input Dolby E bitstream (typically in a format used in production and broadcast facilities, but not used in consumer devices receiving audio programs broadcast thereto) into an encoded output audio bitstream in AC-3 or E-AC-3 format (suitable for broadcast to consumer devices).

The system of fig. 2 also includes an encoded audio delivery system 150 (which stores and/or delivers the encoded bitstream output from the encoder 100) and a decoder 152. The encoded audio bitstream output from the encoder 100 may be stored by the sub-system 150 (e.g., in the form of a DVD or BluRay disc), or transmitted by the sub-system 150 (which may implement a transmission link or network), or both stored and transmitted by the sub-system 150. The decoder 152 is configured to decode the encoded bitstream (generated by the encoder 100) it receives via the subsystem 150, including extracting Loudness Processing State Metadata (LPSM) from each frame of the bitstream, and generating decoded audio data. In one embodiment, the decoder 152 is configured to perform adaptive loudness processing on the decoded audio data using LPSM and/or to forward the decoded audio data and LPSM to a post-processor configured to perform adaptive loudness processing on the decoded audio data using LPSM. Optionally, the decoder 152 comprises a buffer that stores (e.g., in a non-transitory manner) the encoded audio bitstream received from the subsystem 150.

Various implementations of the encoder 100 and decoder 152 are configured to perform the various embodiments described herein. The frame buffer 110 is a buffer memory coupled to receive the encoded input audio bitstream. In operation, buffer 110 stores (e.g., in a non-transitory manner) at least one frame in an encoded audio bitstream, and a sequence of frames of the encoded audio bitstream is asserted from buffer 110 to parser 111. The parser 111 is coupled and configured to extract Loudness Processing State Metadata (LPSM) and other metadata from each frame of the encoded input audio to assert at least the LPSM to the audio state verifier 102, loudness processing stage 103, stage 106, and subsystem 108 to extract audio data from the encoded input audio and assert the audio data to the decoder 101. Decoder 101 of encoder 100 is configured to decode audio data to generate decoded audio data and assert the decoded audio data to loudness processing stage 103, audio stream selection stage 104, subsystem 108, and optionally also to state verifier 102.

State verifier 102 is configured to authenticate and verify LPSMs (and optionally other metadata) asserted to state verifier 102. In some embodiments, the LPSM is (or is included in) a data block already included in the input bitstream (e.g., in accordance with an embodiment of the present invention). This block may include a cryptographic hash (hashed message authentication code or "HMAC") used to process the LPSM (and optionally other metadata) and/or base layer audio data (provided from the decoder 101 to the verifier 102). In these embodiments, the data block may be digitally signed so that the downstream audio processing unit can authenticate and verify the processing state metadata with relative ease.

For example, HMAC is used to generate a digest (digest), and the protection values included in the bitstream of the present invention may include the digest. The digest may be generated for the AC-3 frame as follows: (1) after the AC-3 data and LPSM are encoded, the frame data bytes (Link frame #1 and frame #2) and LPSM data bytes are used as inputs to a hash function HMAC. Other data that may be present within the auxiliary data (auxdata) field is not considered for the digest calculation. Such other data may be bytes that do not belong to either the AC-3 data or the LPSM data. The protection bits contained in the LPSM may not be considered for HMAC digest computation. (2) After the digest is computed, the digest is written into the fields in the bitstream reserved for the protection bits. (3) The final step in generating a complete AC-3 frame is to compute the CRC check bits. This is written to the last end of the frame and all data belonging to this frame (including the LPSM bits) is considered.

Other encryption methods, including but not limited to any of one or more non-HMAC encryption methods, may be used to authenticate the LPSM (e.g., in the authenticator 102) to ensure secure transmission and reception of LPSM and/or base layer audio data. For example, verification (using such encryption methods) may be performed in each audio processing unit receiving an embodiment of the inventive audio bitstream to determine whether loudness processing state metadata and corresponding audio data contained in the bitstream have been subjected to (and/or derived from) specific loudness processing (as indicated by the metadata) and have not been modified after such specific loudness processing has been performed.

The state verifier 102 asserts control data to the audio stream selection stage 104, the metadata generator 106, and the dialog relative measurement subsystem 108 to indicate the result of the verification operation. In response to the control data, stage 104 may select (and pass to encoder 105) any of the following: (1) an adaptive processing output of the loudness processing stage 103 (e.g., when the LPSM indicates that the audio data output by the decoder 101 is not subject to a particular type of loudness processing, and the control bits from the verifier 102 indicate that the LPSM is active); or (2) audio data output from the decoder 101 (e.g., when the LPSM indicates that the audio data output by the decoder 101 has been subjected to a particular type of loudness processing performed by the stage 103, and the control bits from the verifier 102 indicate that the LPSM is active). In one embodiment, the loudness processing stage 103 corrects loudness to a specified target and loudness range.

The stage 103 of the encoder 100 is configured to perform adaptive loudness processing on the decoded audio data output from the decoder 101 based on one or more audio data characteristics indicated by the LPSM extracted by the decoder 101. Stage 103 may be an adaptive transform domain real-time loudness and dynamic range control processor. Stage 103 may receive user input (e.g., a user target loudness/dynamic range value or a dialog normalization value), or other metadata input (e.g., one or more third party data, tracking information, identifiers, proprietary or standard information, usage annotation data, user preference data, etc.), and/or other input (e.g., from a fingerprinting process), and use such input to process decoded audio data output from decoder 101.

The dialog loudness measurement subsystem 108 is operable to determine, for example using LPSM (and/or other metadata) extracted by the decoder 101, a loudness of a segment of decoded audio (from the decoder 101) indicative of dialog (or other speech) when a control bit from the verifier 102 indicates that LPSM is inactive. When the LPSM indicates a previously determined loudness (from the decoder 101) of a dialog (or other speech) segment of decoded audio, the operation of the dialog loudness measurement subsystem 108 may be disabled when the control bits from the verifier 102 indicate that the LPSM is active.

There are useful tools for conveniently and easily measuring the level of dialogue in audio content (e.g., dolby lm100 sounder). Some embodiments of the APU (e.g., stage 108 of the encoder 100) are implemented to include means (or perform its function) to measure the average dialog loudness of the audio content of an audio bitstream (e.g., a decoded AC-3 bitstream asserted to stage 108 from the decoder 101 of the encoder 100). If stage 108 is implemented to measure the true average dialog loudness of the audio data, the measurement may include the step of isolating segments of the audio content that contain primarily speech. The audio segments, which are predominantly speech, are then processed according to a loudness measurement algorithm. For audio data decoded from an AC-3 bitstream, this algorithm may be a standard K-weighted loudness measure (according to international standard ITU-R bs.1770). Alternatively, other loudness measures may be used (e.g., those based on psychoacoustic models of loudness).

Isolation of speech segments is not necessary to measure the average dialog loudness of the audio data. However, it improves the accuracy of the measurement and provides more satisfactory results from the point of view of the listener. Since not all audio content contains dialog (speech), a loudness measurement of the entire audio content may provide a sufficient approximation of the dialog level of the audio (in the presence of speech).

The metadata generator 106 generates metadata to be included in the encoded bitstream to be output from the encoder 100 through the stage 107. The metadata generator 106 may pass the LPSM (and/or other metadata) extracted by the encoder 101 to the stage 107 (e.g., when the control bits from the verifier 102 indicate that the LPSM and/or other metadata is valid), or generate a new LPSM (and/or other metadata) and assert the new LPSM to the stage 107 (e.g., when the control bits from the verifier 102 indicate that the LPSM and/or other metadata extracted by the decoder 101 is invalid), or it may assert a combination of metadata extracted by the decoder 101 and newly generated metadata to the stage 107. Metadata generator 106 may include the loudness data generated by subsystem 108 and at least one value indicative of the type of loudness processing performed by subsystem 108 in the LPSM that metadata generator 106 asserts to stage 107 for inclusion in the encoded bitstream to be output from encoder 100. The metadata generator 106 may generate protection bits (which may be comprised of or include a hashed message authentication code or "HMAC") that are useful for at least one of decryption, authentication, or verification of the LPSM (and/or other metadata) to be included in the encoded bitstream and/or the base layer audio data to be included in the encoded bitstream. The metadata generator 106 may provide such protection bits to the stage 107 for inclusion in the encoded bitstream.

In one embodiment, the dialog loudness measurement subsystem 108 processes the audio data output from the decoder 101 to generate loudness values (e.g., gated or ungated dialog loudness values) and dynamic range values in response thereto. In response to these values, the metadata generator 106 may generate Loudness Processing State Metadata (LPSM) for inclusion (by the stuffer/formatter 107) in the encoded bitstream to be output from the encoder 100. In one embodiment, the loudness may be calculated based on techniques specified by ITU-RBS.1770-1 and ITU-R BS.1770-2 standards, or other similar loudness measurement standards. The gated loudness(s) may be dialogue gated loudness or relative gated loudness, or a combination of these gated loudness types, and the system may employ appropriate gating blocks depending on the application requirements and system constraints.

Additionally, optionally or alternatively, the subsystems 106 and/or 108 of the encoder 100 may perform additional analysis of the audio data to generate metadata indicative of at least one characteristic of the audio data for inclusion in the encoded bitstream to be output from the stage 107. The encoder 105 encodes (e.g., by compressing) the audio data output from the selection stage 104 and asserts the encoded audio to the stage 107 for inclusion in an encoded bitstream to be output from the stage 107.

The stage 107 multiplexes the encoded audio from the encoder 105 and the metadata (including the LPSM) from the generator 106 to generate an encoded bitstream to be output from the stage 107, such that the encoded bitstream has a format as specified by the embodiments. The frame buffer 109 is a buffer memory that stores (e.g., in a non-transitory manner) at least one frame of the encoded audio bitstream output from the stage 107, and then a sequence of frames of the encoded audio bitstream is asserted from the buffer 109 as output of the encoder 100 to the transport system 150.

The LPSM generated by the metadata generator 106 and included in the encoded bitstream by stage 107 indicates a loudness processing state of the corresponding audio data (e.g., the type of loudness processing that has been performed on the audio data), and a loudness of the corresponding audio data (e.g., measured dialog loudness, gated and/or ungated loudness, and/or dynamic range). Here, "gating" of loudness and/or level measurements performed on audio data refers to a particular level or loudness threshold, with calculated values exceeding the threshold being included in the final measurement (e.g., short-term loudness values below-60 dBFS being ignored in the final measurement). Absolute value gating refers to a fixed level or loudness, while relative value gating refers to a value that depends on the current "ungated" measurement.

In some implementations of the encoder 100, the encoded bitstream buffered in the memory 109 (and output to the delivery system 150) is an AC-3 bitstream or an E-AC-3 bitstream, and includes audio data segments (e.g., AB0-AB5 segments of the frame shown in fig. 4) and metadata segments, wherein the audio data segments indicate audio data, and each of at least some of the metadata segments includes Loudness Processing State Metadata (LPSM). Stage 107 inserts the LPSM into the bitstream in the following format. Each of the metadata segments including the LPSM is included in an "addbsi" field of a bitstream information ("BSI") segment of a frame of the bitstream or an auxdata field (e.g., an AUX segment shown in fig. 4) at the end of a frame of the bitstream.

A frame of a bitstream may include one or two metadata segments, each of which includes an LPSM, and if the frame includes two metadata segments, one exists in an addbsi field of the frame and the other exists in an AUX field of the frame. Each metadata segment comprising an LPSM includes an LPSM payload (or container) segment having the following format: a header (e.g., including a sync word identifying the beginning of the LPSM payload, followed by at least one identification value, such as LPSM format version, length, period, count, and substream association values indicated in table 2 below); and at least one dialog indication value (e.g., the parameter "dialog channel" of table 2) after the header indicating whether the corresponding audio data indicates dialog or does not indicate dialog (e.g., which channel of the corresponding audio data indicates dialog); indicating whether the corresponding audio data complies with at least one loudness rule compliance value of the indicated set of loudness rules (e.g., the parameter "loudness rule type" of table 2); at least one loudness processing value indicative of at least one loudness processing performed on the corresponding audio data (e.g., one or more of the parameters "dialog-gated loudness correction flag", "loudness correction type" in table 2); and at least one loudness value indicative of at least one loudness (e.g., peak loudness or average loudness) characteristic of the corresponding audio data (e.g., one or more of the parameters "ITU relative gated loudness", "ITU speech gated loudness", "ITU (EBU 3341) short-term 3s loudness", and "true peak" of table 2).

In some implementations, each of the metadata fragments of the "addbsi" field or the auxdata field of a frame inserted into the bitstream by stage 107 has the following format: a core header (e.g., including a sync word indicating the start of a metadata segment followed by identification values such as core element version, length, period, extended element count, and substream association values indicated in table 1 below); and at least one protection value (e.g., HMAC digest and audio fingerprint values of table 1) after the core header useful for at least one of decryption, authentication, or verification of at least one of the loudness processing state metadata or corresponding audio data; and an LPSM payload identification ("ID") and LPSM payload size value, also after the core header, in the case that the metadata segment includes LPSM, that identifies the subsequent metadata as an LPSM payload and indicates the size of the LPSM payload.

The LPSM payload (or container) segment (e.g., having the format specified above) follows the LPSM payload ID and LPSM payload size values.

In some embodiments, each of the metadata fragments in the auxdata field (or "addbsi" field) of a frame has a three-level structure: a high level structure including a flag indicating whether the auxdata (or addbsi) field includes metadata, at least one ID value indicating the type of metadata present, and optionally also a value indicating how many bits (in the case of metadata present) of metadata (e.g., of each type) are present. One type of metadata that may exist is LPSM, and another type of metadata that may exist is media survey metadata (e.g., nielsen media survey metadata); an intermediate hierarchy comprising core elements of each identified type of metadata (e.g., a core header, a protection value, and LPSM payload ID and LPSM payload size values, as mentioned above, of each identified type of metadata); and a low-level structure including each payload of a core element (e.g., LPSM payload if identified as present by the core element, or another type of metadata payload if identified as present by the core element).

The data values in such a three-level structure may be nested (nest). For example, the protection values of the LPSM payload and/or another metadata payload identified by the core element may be included after each payload identified by the core element (and thus after the core header of the core element). In one example, the core header may identify an LPSM payload and another metadata payload, a payload ID and payload size value of a first payload (e.g., an LPSM payload) may follow the core header, the first payload itself may follow the ID and size values, a payload ID and payload size of a second payload may follow the first payload, the second payload itself may follow the ID and size values, and the protection values of the two payloads (or the core element values and the two payloads) may follow the last payload.

In some embodiments, if the decoder 101 receives an audio bitstream generated according to embodiments of the present invention with a cryptographic hash, the decoder is configured to parse and retrieve the cryptographic hash from a block of data determined from the bitstream, the block including Loudness Processing State Metadata (LPSM). The verifier 102 may use cryptographic hashes to verify the received bitstream and/or associated metadata. For example, the verifier 102 finds the LPSM valid based on a match of the reference cryptographic hash and the cryptographic hash taken from the data block, may then disable the processor 103 from operating on the corresponding audio data, and cause the selection stage 104 to pass the (unchanged) video data. Additionally, other types of encryption techniques may alternatively or additionally be used instead of the cryptographic hash-based approach.

The encoder 100 of fig. 2 may determine (in response to the LPSM extracted by the decoder 101) that the post-processing/pre-processing unit has performed loudness processing on the audio data to be encoded (in elements 105, 106 and 107), and may therefore create (in the generator 106) loudness processing state metadata that includes particular parameters used in and/or derived from previously performed loudness processing. In some embodiments, the encoder 100 may create (and include in the encoded bitstream that it outputs) processing state metadata that indicates the processing history of the audio content, at least as long as the encoder knows the type of processing that has been performed on the audio content.

Fig. 3 is a block diagram of a decoder that may be used in conjunction with the system 10 of fig. 1. Any of the components or elements of decoder 200 and post-processor 300 may be implemented as one or more processes and/or one or more circuits (e.g., ASICs, FPGAs, or other integrated circuits) in hardware, software, or a combination of hardware and software. The decoder 200 includes a frame buffer 201, a parser 205, an audio decoder 202, an audio state verification stage (verifier) 203 and a control bit generation stage 204, which are connected as shown. Decoder 200 includes other processing elements (not shown). The frame buffer 201 (buffer memory) stores (e.g., in a non-transitory manner) at least one frame of the encoded audio bitstream received by the decoder 200. A sequence of frames of the encoded audio bitstream is asserted from buffer 201 to parser 205. The parser 205 is coupled and configured to extract Loudness Processing State Metadata (LPSM) and other metadata from each frame of the encoded input audio to assert at least the LPSM to the audio state verifier 203 and stage 204, assert the LPSM as output (e.g., to the post-processor 300), extract audio data from the encoded input audio, and assert the extracted audio data to the decoder 202. The encoded audio bitstream input to the decoder 200 may be one of an AC-3 bitstream, an E-AC-3 bitstream, or a Dolby E bitstream.

The system of fig. 3 also includes a post-processor 300. Post-processor 300 includes a frame buffer 301 and other processing elements (not shown) including at least one processing element coupled to buffer 301. The frame buffer 301 stores (e.g., in a non-transitory manner) at least one frame of the decoded audio bitstream received from the post-processor 300 of the decoder 200. The processing elements of post-processor 300 are coupled and configured to receive and adaptively process a sequence of frames of a decoded audio bitstream output from buffer 301 using metadata (including LPSM values) output from decoder 202 and/or control bits output from stage 204 of decoder 200. In one embodiment, the post-processor 300 is configured to perform adaptive loudness processing on the decoded audio data using the LPSM values (e.g., based on the loudness processing state and/or one or more audio data characteristics indicated by the LPSM). Various implementations of the decoder 200 and post-processor 300 are configured to perform different embodiments of the method according to embodiments described herein.

Audio decoder 202 of decoder 200 is configured to decode the audio data extracted by parser 205 to generate decoded audio data, and to assert the decoded audio data as output (e.g., to post-processor 300). State verifier 203 is configured to authenticate and verify LPSM (and optionally other metadata) that is asserted to state verifier 203. In some embodiments, the LPSM is (or is included in) a data block already included in the input bitstream (e.g., in accordance with an embodiment of the present invention). This block may include a cryptographic hash (hashed message authentication code or "HMAC") used to process the LPSM (and optionally other metadata) and/or base layer audio data (provided to the verifier 203 from the parser 205 and/or decoder 202). In these embodiments, the data block may be digitally signed so that the downstream audio processing unit can authenticate and verify the processing state metadata with relative ease.

Other encryption methods, including but not limited to any of one or more non-HMAC encryption methods, may be used to authenticate the LPSM (e.g., in the authenticator 203) to ensure secure transmission and reception of LPSM and/or base layer audio data. For example, verification (using such encryption methods) may be performed in each audio processing unit receiving an embodiment of the inventive audio bitstream to determine whether loudness processing state metadata and corresponding audio data contained in the bitstream have been subjected to (and/or derived from) specific loudness processing (as indicated by the metadata) and have not been modified after such specific loudness processing has been performed.

The state verifier 203 asserts control data to the control bit generator 204 and asserts the control data as an output (e.g., to the post-processor 300), indicating the result of the verify operation. In response to the control data (and optionally other metadata extracted from the input bitstream), stage 204 may generate (and assert to post-processor 300) any of the following: (1) control bits indicating that the decoded audio data output from the decoder 202 has been subjected to a particular type of loudness processing (e.g., when the LPSM indicates that the audio data output from the decoder 202 has been subjected to a particular type of loudness processing and the control bits from the verifier 203 indicate that the LPSM is active); or (2) a control bit indicating that the decoded audio data output from the decoder 202 should be subjected to a particular type of loudness processing (e.g., when the LPSM indicates that the audio data output from the decoder 202 is not subjected to a particular type of loudness processing, or when the LPSM indicates that the audio data output from the decoder 202 has been subjected to a particular type of loudness processing but the control bit from the verifier 203 indicates that the LPSM is inactive).

Alternatively, the decoder 200 asserts LPSM (and any other metadata) extracted from the input bitstream by the decoder 200 to the post-processor 300, and the post-processor 300 performs loudness processing on the decoded audio data using the LPSM, or performs verification of the LPSM, and then performs loudness processing on the decoded audio data using the LPSM if the verification indicates that the LPSM is valid.

In some embodiments, if decoder 201 receives an audio bitstream generated according to embodiments of the present invention with a cryptographic hash, the decoder is configured to parse and retrieve the cryptographic hash from a block of data determined from the bitstream, the block including Loudness Processing State Metadata (LPSM). The verifier 203 may use cryptographic hashes to verify the received bitstream and/or associated metadata. For example, if the verifier 203 finds the LPSM valid based on a match of the reference cryptographic hash and the cryptographic hash taken from the data block, it may signal a downstream audio processing unit (e.g., the post-processor 300, which may be or include a volume adjustment unit) to deliver the (unaltered) audio data of the bitstream. Additionally, alternatively, or alternatively, other types of encryption techniques may be used in place of the cryptographic hash-based approach.

In some implementations of the decoder 100, the received encoded bitstream (and buffered in the memory 201) is an AC-3 bitstream or an E-AC-3 bitstream, and includes audio data segments (e.g., AB0-AB5 segments of the frame shown in fig. 4) and metadata segments, wherein the audio data segments indicate audio data, and each of at least some of the metadata segments includes Loudness Processing State Metadata (LPSM). The decoder stage 202 is configured to extract LPSM having the following format from the bitstream. Each of the metadata segments including the LPSM is included in an "addbsi" field of a bitstream information ("BSI") segment of a frame of the bitstream or an auxdata field (e.g., an AUX segment shown in fig. 4) at the end of a frame of the bitstream. A frame of a bitstream may include one or two metadata segments, each of which includes an LPSM, and if the frame includes two metadata segments, one exists in an addbsi field of the frame and the other exists in an AUX field of the frame. Each metadata segment comprising an LPSM includes an LPSM payload (or container) segment having the following format: a header (e.g., including a sync word identifying the beginning of the LPSM payload, followed by at least one identification value, such as LPSM format version, length, period, count, and substream association values indicated in table 2 below); and at least one dialog indication value (e.g., the parameter "dialog channel" of table 2) after the header indicating whether the corresponding audio data indicates dialog or does not indicate dialog (e.g., which channel of the corresponding audio data indicates dialog); indicating whether the corresponding audio data complies with at least one loudness rule compliance value of the indicated set of loudness rules (e.g., the parameter "loudness rule type" of table 2); at least one loudness processing value indicative of at least one loudness processing performed on the corresponding audio data (e.g., one or more of the parameters "dialog-gated loudness correction flag", "loudness correction type" in table 2); and at least one loudness value indicative of at least one loudness (e.g., peak loudness or average loudness) characteristic of the corresponding audio data (e.g., one or more of the parameters "ITU relative gated loudness", "ITU speech gated loudness", "ITU (EBU 3341) short-term 3s loudness", and "true peak" of table 2).

In some implementations, the decoder stage 202 is configured to extract, from the "addbsi" field or the auxdata field of a frame of the bitstream, respective metadata segments having the following format: a core header (e.g., including a sync word identifying the beginning of a metadata segment followed by at least one identification value, such as the core element version, length, period, extended element count, and substream association values indicated in table 1 below); and at least one protection value (e.g., HMAC digest and audio fingerprint values of table 1) after the core header useful for at least one of decryption, authentication, or verification of at least one of the loudness processing state metadata or corresponding audio data; and an LPSM payload identification ("ID") and LPSM payload size value, also after the core header, in the case that the metadata segment includes LPSM, that identifies the subsequent metadata as an LPSM payload and indicates the size of the LPSM payload. The LPSM payload (or container) segment (e.g., having the format specified above) follows the LPSM payload ID and LPSM payload size values.

More generally, the encoded audio bitstream generated by the embodiments has a structure that provides a mechanism to mark metadata elements and sub-elements as core (mandatory) elements or extension (optional elements). This allows the data rate of the bitstream (including its metadata) to be scaled between various applications. The core (mandatory) elements in the bitstream syntax should also be able to signal the presence (in-band) and/or the remote location (out-of-band) of extension (optional) elements associated with the audio content.

In some embodiments, the core element needs to be present in each frame of the bitstream. Some of the sub-elements of the core element are optional and may be present in any combination. Extension elements need not be present in every frame of the bitstream (to limit bitrate overhead). Thus, the extension elements may be present in some frames and not in other frames. Some sub-elements of the extension element are optional and may be present in any combination, while some sub-elements of the extension element may be mandatory (i.e., if the extension element is present in a frame of the bitstream).

In some embodiments, an encoded audio bitstream comprising a sequence of audio data segments and metadata segments is generated (e.g., by an audio processing unit embodying the present invention). The audio data segments are indicative of audio data, each of at least some of the metadata segments includes Loudness Processing State Metadata (LPSM), and the audio data segments are time-division multiplexed with the metadata segments. In some embodiments, in this type, each metadata segment has a format to be described herein. In one format, the encoded bitstream is an AC-3 bitstream or an E-AC-3 bitstream, and each metadata segment including the LPSM is included as additional bitstream information (e.g., by stage 107 of encoder 100) in an "addbsi" field (shown in fig. 6) of a bitstream information ("BSI") segment of a frame of the bitstream or in an auxdata field of a frame of the bitstream. Each frame includes a core element in the addbsi field of the frame having the format shown in table 1 of fig. 8.

In one format, each of the addbsi (or auxdata) fields that contain the LPSM contains a core header (and optionally additional core elements); following the core header (or core header and other core elements), the subsequent LPSM values (parameters): payload ID (identifying metadata as LPSM) after core element value (e.g., as indicated in table 1); load size after load ID (indicating the size of LPSM load); and LPSM data (after the payload ID and payload size) having the format indicated in table 2 of fig. 9.

In a second format of the encoded bitstream, the bitstream is an AC-3 bitstream or an E-AC-3 bitstream, and the metadata segments that contain the LPSM (e.g., by stage 107 of encoder 100) are contained in the following fields: an "addbsi" field of a bitstream information ("BSI") segment of a frame of the bitstream, or an auxdata field (e.g., the AUX segment shown in fig. 4) at the end of a frame of the bitstream. The frame may include one or two metadata segments, each of which includes LPSM, and if the frame includes two metadata segments, one exists in the addbsi field of the frame and the other exists in the AUX field of the frame. Each metadata segment comprising LPSM has the format specified above with reference to table 1 or table 2 (i.e., it comprises the core elements specified in table 1, followed by the payload ID specified above (identifying the metadata as LPSM) and the payload size value, followed by the payload (LPSM data having the format indicated in table 2)).

In another implementation, the encoded bitstream is a Dolby E bitstream and includes the first N sample positions of a Dolby E guard band interval for each of the LPSM's metadata segments. A Dolby E bitstream including such metadata fragments includes LPSM, e.g., includes a value indicating the LPSM payload length signaled in the Pd words of the SMPTE 337M preamble (the SMPTE 337M Pa word repetition rate may remain the same as the associated video frame rate).

In formats in which the encoded bitstream is an E-AC-3 bitstream, each metadata segment that includes the LPSM is included (e.g., by stage 107 of encoder 100) in an "addbsi" field of a bitstream information ("BSI") segment of a frame of the bitstream. Additional aspects of encoding an E-AC-3 bitstream with LPSM in this format are described as follows: (1) during E-AC-3 bitstream generation, when the E-AC-3 encoder (which inserts LPSM values into the bitstream) is "on", for each frame (sync frame) generated, the bitstream should contain the metadata blocks (including LPSM) carried in the addbsi field of the frame. The bits carrying the metadata block should not increase the encoder bit rate (frame length); (2) each metadata block (including LPSM) should contain the following information:

loud _ correction _ type _ flag: where a "1" indicates that the loudness of the corresponding audio data is corrected upstream of the encoder, and a "0" indicates that the loudness is corrected by a loudness corrector embedded in the encoder (e.g., loudness processor 103 of encoder 100 of fig. 2); the speed _ channel indicates which source channel(s) contain speech (0.5 seconds before). If no speech is detected, it should be so indicated; speed _ loudness indicates the integrated speech loudness (0.5 seconds before) for each corresponding audio channel containing speech; ITU _ loudness indicates the integral ITU BS.1770-2 loudness of each corresponding audio channel; gain: loudness complex gain for inversion in the decoder (indicating reversibility).

When the E-AC-3 encoder (which inserts LPSM values into the bitstream) is "on" and is receiving AC-3 frames with a "true" flag, the loudness controller in the encoder (e.g., loudness processor 103 of encoder 100 of fig. 2) is bypassed. The "trusted" source speech normalization and DRC values are passed (e.g., by the generator 106 of the encoder 100) to the E-AC-3 encoder components (e.g., the stage 107 of the encoder 100). The LPSM block generation continues and the loud _ correction _ type _ flag is set to "1". The loudness controller bypass sequence is synchronized with the start of the decoded AC-3 frame where the "confidence" flag occurs. The loudness controller bypass sequence is implemented as follows: the level _ account control is reduced from a value of 9 to a value of 0 over 10 audio block periods (i.e., 53.3 milliseconds), and the level _ back _ end _ meter control is placed in bypass mode (this operation will result in a seamless transition). The term "trusted" bypass of the leveler implies that the dialog normalization value of the source bitstream is also reused at the output of the encoder (e.g., if the "trusted" source bitstream has a dialog normalization value of-30, then the output of the encoder should be utilized corresponding to the outbound dialog normalization value of-30).

A loudness controller embedded in the encoder (e.g., loudness processor 103 of encoder 100 of fig. 2) activates when the E-AC-3 encoder (which inserts LPSM values into the bitstream) is "on" and is receiving AC-3 frames that do not have a "true" flag. The LPSM block generation continues, and the cloudness _ correction _ type _ flag is set to "0". The loudness controller activation sequence is synchronized with the start of the decoded AC-3 frame where the "confidence" flag disappears. The loudness controller activation sequence is implemented as follows: the level _ amount control increases from a value of 0 to a value of 9 over 1 audio block period (i.e., 5.3 milliseconds), and the level _ back _ end _ meter control is placed in active mode (this operation will result in a seamless transition and includes a back _ end _ meter integral reset); and during encoding, a Graphical User Interface (GUI) indicates to the user the following parameters: "input audio program: trusted/untrusted ] "-the state of this parameter is based on the presence of a" confidence "flag in the input signal; and "real-time loudness correction: enable/disable the state of this parameter is based on whether this loudness controller embedded in the encoder is active or not.

When decoding an AC-3 or E-AC-3 bitstream containing LPSMs (in the described format) in the addbsi field of a bitstream information ("BSI") segment of each frame of the bitstream, the decoder parses LPSM block data (in the addbsi field) and passes the extracted LPSM values to a Graphical User Interface (GUI). The extracted set of LPSM values is refreshed every frame.

In yet another format, the encoded bitstream is an AC-3 bitstream or an E-AC-3 bitstream, and each of the metadata segments including the LPSM is included (e.g., by stage 107 of encoder 100) as additional bitstream information in an "addbsi" field (shown in fig. 6) of a bitstream information ("BSI") segment of a frame of the bitstream (or in an Aux segment). In this format (which is a variation of the format described above with reference to tables 1 and 2), each of the addbsi (or Aux) fields that contain LPSM contains the following LPSM values: the core elements specified in table 1, followed by a payload ID (identifying the metadata as LPSM) and a payload size value, followed by a payload (LPSM data) having the following format (similar to the elements indicated above in table 2): version of LPSM payload: a 2-bit field indicating a version of the LPSM payload; dialchan: indicating whether the left, right and/or center channel of the corresponding audio data contains a 3-bit field of spoken dialog. The bit allocation for the Dialchan field may be as follows: bit 0, indicating that a dialog exists in the left channel, is stored in the most significant bit of the dialchan field; and bit 2, indicating that a dialog exists in the center channel, is stored in the least significant bit of the dialchan field. Each bit of the dialchan field is set to "1" if the corresponding channel contains spoken dialog during the first 0.5 seconds of the program; louderegyp: a 3-bit field indicating the loudness rule standard to which the program loudness is compliant. Setting "louderegyp" to "000" indicates that the LPSM does not indicate that the loudness rules are compliant. For example, one value (e.g., 000) of this field may indicate that no compliance with the loudness rules standard is indicated, another value (e.g., 001) of this field may indicate that the audio data of the program complies with the ATSC a/85 standard, and another value (e.g., 010) of this field may indicate that the audio data of the program complies with the EBU R128 standard. In the example, if the field is set to any value other than "000", then the logcorrdialgat and logcorrtyp fields should follow in the payload; loudcorrialgat: a 1-bit field indicating whether dialog-gated loudness correction has been applied. The value of the logcorrdialgat field is set to "1" if the loudness of the program has been corrected using dialogue gating. Otherwise, it is set to "0"; loudcorrtyp: a 1-bit field indicating the type of loudness correction applied to the program. The value of the loudcorrtrpyp field is set to "0" if the loudness of the program has been corrected by an infinite look-ahead (file-based) loudness correction process. The value of this field is set to "1" if the program's loudness has been corrected using a combination of real-time loudness measurement and dynamic range control; louderrelgate: a 1-bit field indicating whether relative gated loudness data (ITU) is present. If the louderelgate field is set to "1", then there should be a 7-bit ituoudrelgat field in the payload; louderrelgat: a 7-bit field indicating relative program gating loudness (ITU). This field indicates the integrated loudness of the audio program measured according to ITU-R bs.1770-2 without any gain adjustment due to applied dialog normalization and dynamic range compression. Values 0 to 127 are interpreted as-58 LKFS to +5.5LKFS, step size 0.5 LKFS; loudspchgate: a 1-bit field indicating whether speech-gated loudness data (ITU) is present. If the loudspchgate field is set to "1", then there should be a 7-bit loudspchgate field following in the payload; loudspchgat: a 7-bit field indicating the loudness of a speech-gated program. This field indicates the integrated loudness of the entire corresponding audio program, measured according to equation (2) of ITU-R bs.1770-3, without any gain adjustment due to the applied dialog normalization and dynamic range compression. Values 0 to 127 are interpreted as-58 LKFS to +5.5LKFS, step size 0.5 LKFS; loudstrm3 se: a 1-bit field indicating whether short-term (3 seconds) loudness data is present. If this field is set to "1", then there should be a 7-bit loudstrm3s field following in the payload; loudstrm3 s: a 7-bit field indicating the ungated loudness of the first 3 seconds of the corresponding audio program measured according to ITU-R bs.1771-1, without any gain adjustment due to the applied dialog normalization and dynamic range compression. Values 0 to 256 are interpreted as-116 LKFS to +11.5LKFS, step size 0.5 LKFS; truepke: a 1-bit field indicating whether true peak loudness data is present. If the truepke field is set to "1", then an 8-bit truepk field should follow in the payload; and truepk: an 8-bit field indicating the true peak sample values of the program, which is measured according to annex 2 of ITU-R bs.1770-3 and without any gain adjustment due to applied dialog normalization and dynamic range compression. Values 0 to 256 are interpreted as-116 LKFS to +11.5LKFS, step size 0.5 LKFS.

The core element of the metadata segment in the auxdata field (or "addbsi" field) of a frame of an AC-3 bitstream or E-AC-3 bitstream includes a core header (optionally including an identification value, such as a core element version), and, after the core header: a value indicating whether the metadata for the metadata segment contains fingerprint data (or other protection values), a value indicating whether external data (related to audio data of the metadata corresponding to the metadata segment) is present, a payload ID and a payload size of each type of metadata of the core element identification (e.g., LPSM and/or metadata other than LPSM), and a protection value for at least one type of metadata of the core element identification. The metadata payload of the metadata segment follows the core header and (in some cases) is nested within the values of the core elements.

Optimized loudness and dynamic range system

The secure metadata encoding and delivery scheme described above is used in conjunction with a scalable and extensible system for optimizing loudness and dynamic range between different playback devices, applications, listening environments, as shown in fig. 1. In one embodiment, the system 10 is configured to normalize the loudness level and dynamic range of the input audio 11 among various devices that require different loudness values and have different dynamic range capabilities. To normalize loudness levels and dynamic range, system 10 includes different device profiles for audio content, and normalization is performed based on these profiles. The profile may be included by one of the audio processing units in the audio processing chain, and the included profile may be used by downstream processing units in the audio processing chain to determine a desired target loudness and dynamic range of the target device. Additional processing components may provide or process information for device profile management including, but not limited to, parameters such as the null band range, true peak threshold, loudness range, fast/slow time constants (coefficients), and maximum boost amount, gain control, and wideband and/or multi-band gain generation functions.

Fig. 10 illustrates a more detailed diagram of the system of fig. 1 for a system that provides optimized loudness and dynamic range control, in accordance with some embodiments. For the system 321 of fig. 10, the encoder stage includes a core encoder component 304 that encodes the audio input 303 into a suitable digital format for transmission to the decoder 312. The audio is processed such that it can be played back in a variety of different listening environments, each possibly requiring a different loudness and/or dynamic range target setting. Thus, as shown in fig. 10, the decoder outputs a digital signal that is converted to analog format by digital-to-analog converter 316 for playback through a variety of different driver types, including full-range speaker 320, mini-speaker 322, and headphones 324. These drivers illustrate only some examples of possible playback drivers, and any transducer or driver of any suitable size and type may be used. Additionally, the drivers/transducers 320-324 of FIG. 10 may represent any suitable playback device for use in any corresponding listening environment. Device types may include, for example, AVR, television, stereo, computer, mobile phone, tablet, MP3 player, and the like; and listening environments may include, for example, auditoriums, homes, in-cars, listening rooms, and so forth.

Since playback environments and driver types can range from very small private areas to very large public places, the span of possible and optimal playback loudness and dynamic range configurations can vary significantly depending on content type, background noise type, etc. For example, in a home theater environment, wide dynamic range content may be played through surround sound devices, while narrower dynamic range content may be played through conventional television systems (such as flat panel LED/LCD types), while very narrow dynamic range modes may be used for certain listening situations (e.g., at night or on devices with severe acoustic output power limitations (such as mobile phone/flat panel internal speaker or headphone output)) when large level variations are not desired. In portable or mobile listening scenarios, such as using a small computer or base speaker or headphones/earphones, the optimal playback dynamic range may vary depending on the environment. For example, the optimal dynamic range may be greater in quiet environments as compared to noisy environments. The embodiment of the adaptive audio processing system of fig. 10 will vary the dynamic range based on parameters such as listening device environment and playback device type to more clearly present the audio content.

Fig. 11 is a table showing different dynamic range requirements for various playback devices and background listening environments in an exemplary use case. Similar requirements can be made for loudness. Different dynamic range and loudness requirements generate different profiles for use by the optimization system 321. The system 321 includes a loudness and dynamic range measurement component 302 that analyzes and measures the loudness and dynamic range of the input audio. In one embodiment, the system analyzes the overall program content to determine an overall loudness parameter. In this context, loudness refers to the long-term program loudness, or average loudness, of a program, where a program is a single unit of audio content, such as a movie, television program, commercial, or similar program content. Loudness is used to provide an indication of the artistic dynamic range profile used by the content creator to control how audio will be played back. Loudness is related to the dialog normalization metadata value, which represents the average dialog loudness of a single program (e.g., movie, television, commercial, etc.). The short term dynamic range quantifies signal changes over a much shorter period of time than program loudness. For example, short term dynamic range may be measured on the order of seconds, while program loudness may be measured over a span of minutes or even hours. The short term dynamic range provides a program loudness independent protection mechanism to ensure that overload does not occur for various playback profiles and device types. In one embodiment, the loudness (long-term program loudness) target is dialog loudness-based, while the short-term dynamic range is relatively gated and/or ungated loudness-based. In this case, some DRC and loudness components in the system are context-aware with respect to content type and/or target device type and characteristics. As part of this context awareness capability, the system is configured to analyze one or more characteristics of the output device to determine whether the device is a member of a particular set of devices (such as AVR-type devices, televisions, computers, portable devices, etc.) that are optimized for certain DRC and loudness playback conditions.

The pre-processing component analyzes the program content to determine loudness, peak, true peak, and quiet periods to create unique metadata for each of a plurality of different profiles. In one embodiment, the loudness may be dialog-gated loudness and/or relative-gated loudness. Different profiles define various DRC (dynamic range control) and target loudness modes in which different gain values are generated in the encoder depending on the source audio content, the desired target loudness, and the characteristics of the playback device type and/or environment. The decoder may provide different DRC and target loudness modes (enabled by the above-mentioned profiles) and may include DRC and target loudness turn-off/disable that allows full dynamic range presentation, no audio signal compression and no relative normalization; DRC off/disable and-31 LKFS-targeted loudness normalization line mode for playback on a home theater system, providing medium dynamic range compression by gain values generated in the encoder (especially for this playback mode and/or device profile) and performing-31 LKFS-targeted loudness normalization; an RF mode for playback through a TV speaker, providing heavy dynamic range compression and performing loudness normalization targeted at-24, -23, or-20 LKFS, a neutral mode for playback through a computer or similar device, providing compression and performing loudness normalization targeted at-14 LKFS, and a portable mode providing very heavy dynamic range compression and performing loudness normalization targeted at-11 LKFS. Target loudness values of-31, -23/-20, -14, and-11 LKFS are contemplated as examples of different playback/device profiles that may be defined for a system according to some embodiments, and any other suitable target loudness values may be employed, and the system may generate suitable gain values for these playback modes and/or device profiles, among other things. Furthermore, the system may be extended and modified so that different playback devices and listening environments may be accommodated and loaded into the encoder by defining new profiles at the encoder or elsewhere. In this way, new and unique playback/device profiles may be generated to support improved or different playback devices for future applications.

In one embodiment, the gain value may be calculated at any suitable processing component of the system 321, such as at the encoder 304, the decoder 312, or the transcoder 308, or any associated pre-processing component associated with the encoder or any post-processing component associated with the decoder.

FIG. 13 is a block diagram illustrating an interface between different profiles for various different playback device categories according to one embodiment. As shown in fig. 13, an encoder 502 receives an audio input 501 and one of several different possible profiles 506. The encoder combines the audio data with the selected profile to generate an output bitstream file that is processed in a decoder component in or associated with the target playback device. For the example of fig. 13, the different playback devices may be a computer 510, a mobile phone 512, an AVR514, and a television 516, but many other output devices are possible. Each of the devices 510 through 516 includes or is coupled to a speaker (including drivers and/or transducers), such as drivers 320 through 324. The combination of the size, power rating, and processing of the playback device and associated speakers typically indicates which profile is optimal for a particular target. Thus, the profile 506 may be specifically defined for playback over AVR, TV, mobile speakers, mobile headphones, and the like. They may also be defined for particular operating modes or conditions (such as quiet mode, night mode, outdoor, indoor, etc.). The profiles shown in FIG. 13 are merely exemplary modes, and any suitable profile may be defined, including custom profiles for particular targets and environments.

Although fig. 13 illustrates an embodiment in which the encoder 502 receives the profile 506 and generates suitable parameters for loudness and DRC processing, it should be noted that the parameters generated based on the profile and audio content may be performed on any suitable audio processing unit, such as an encoder, a decoder, a transcoder, a pre-processor, a post-processor, etc. For example, each output device 510 through 516 of fig. 13 has or is coupled to a decoder component that processes metadata in the bitstream of the file 504 sent from the encoder 502 to enable loudness and dynamic range to be adapted to match the device or device type of the target output device.

In one embodiment, the dynamic range and loudness of the audio content is optimized for each possible playback device. This is achieved by maintaining long-term loudness as a target for each target playback mode and controlling the short-term dynamic range to optimize the audio experience (by controlling signal dynamics, sample peaks, and/or true peaks). Different metadata elements are defined for long-term loudness and short-term dynamic range. As shown in FIG. 10, component 302 analyzes the entire input audio signal (or a portion thereof, such as the speech component, if applicable) for the relevant characteristics of these individual DR components. This allows different gain values to be defined for the artistic gain versus clipping (overload protection) gain values.

These gain values for the long-term loudness and short-term dynamic range are then mapped to a profile 305 to generate parameters describing loudness and dynamic range control gain values. These parameters are combined with the encoded audio signal from the encoder 304 in a multiplexer 306 or similar component for creating a bitstream, which is transmitted to the decoder stage via a transcoder 308. The bit stream input to the decoder stage is demultiplexed in the demultiplexer 310. Which is then decoded in decoder 312. The gain component 314 applies the gain corresponding to the appropriate profile to generate digital audio data, which is then processed by the DACS unit 416 for playback by the appropriate playback device and drivers or transducers 320-324.

Fig. 14 is a table showing the correlation between long term loudness and short term dynamic range for a plurality of defined profiles, according to one embodiment. As shown in table 4 of fig. 14, each profile includes a set of gain values that indicate the amount of Dynamic Range Compression (DRC) applied in the decoder or each target device of the system. Each of the N profiles, indicated as profiles 1-N, sets a particular long-term loudness parameter (e.g., dialog normalization) and overload compression parameter by indicating a corresponding gain value applied in the decoder stage. The DRC gain values for the profile may be defined by an external source accepted by the encoder, or they may be generated internally in the encoder as default gain values if no external values are provided.

In one embodiment, the gain value for each profile is embodied in a DRC gain word that is calculated based on an analysis of certain characteristics of the audio signal, such as peak, true peak, short-term loudness of dialog or overall long-term loudness, or a combination (mix) thereof, to calculate the static gain based on the time constants required to achieve the final DRC gain for each possible device profile and/or target loudness/slow-rise and fast/slow-release and the selected profile (e.g., transfer characteristic or curve). As described above, these profiles may be preset in the encoder, decoder, or generated externally and sent to the encoder via external metadata from the content creator.

In one embodiment, the gain value may be a wideband gain that applies the same gain across all frequencies of the audio content. Alternatively, the gain may comprise a multi-band gain value, such that different gain values are applied for different frequencies or frequency bands of the audio content. In the multi-channel case, each profile may constitute a matrix of gain values indicating the gains for different frequency bands, rather than a single gain value.

Referring to fig. 10, in one embodiment, information regarding the nature or characteristics of the listening environment and/or the capabilities and configuration of the playback device is provided by the feedback link 330 decoder stage to the encoder stage. Profile information 332 is also input to the encoder 304. In one embodiment, a decoder analyzes metadata in a bitstream to determine whether loudness parameters for a first set of playback devices are available in the bitstream. If available, the parameters are sent downstream for rendering audio. Otherwise, the encoder analyzes certain characteristics of the device to derive the parameter. These parameters are then sent to the downstream rendering component for playback. The encoder also determines an output device (or a set of output devices including an output device) that will render the received audio stream. For example, the output device may be determined to be a cell phone or to belong to a group of similar portable devices. In one embodiment, the decoder uses the feedback link 330 to indicate the determined output device or set of output devices to the encoder. For this feedback, a module connected to the output device (e.g., a component in a sound card connected to a headset in a laptop or connected to a speaker) may indicate to the decoder the identity of the output device or the identity of a group of devices that includes the output device. The decoder transmits this information to the encoder via feedback link 330. In one embodiment, the encoder performs the decoder to determine the loudness and DRC parameters. In one embodiment, the decoder determines the loudness and DRC parameters. In this embodiment, rather than transmitting information over feedback link 330, the decoder uses information about the determined device or set of output devices to determine the loudness and DRC parameters. In another embodiment, another audio processing unit determines the loudness and DRC parameters, and the decoder transmits this information to the audio processing unit instead of the decoder.

FIG. 12 is a block diagram of a dynamic range optimization system, according to one embodiment. As shown in fig. 12, an encoder 402 receives input audio 401. The encoded audio is combined in the multiplexer 409 with the parameters 404 generated from the selected compression curve 422 and the dialog normalization value 424. The resulting bit stream is transmitted to a demultiplexer 411, which demultiplexer 411 generates an audio signal, which is decoded by the decoder 406. The parameters and dialog normalization values are used by the gain calculation unit 408 to generate gain levels that drive the amplifier 410 to amplify the decoder output. Fig. 12 shows how the dynamic range control is parameterized and inserted into the bitstream. Loudness may also be parameterized and inserted into the bitstream using similar components. In one embodiment, an output reference level control (not shown) may be provided to the decoder. Although the figures show the loudness and dynamic range parameters being determined and inserted at the encoder, similar determinations may be performed at other audio processing units (e.g., pre-processor, decoder, and post-processor).

Fig. 15 illustrates an example of loudness profiles for different types of audio content, according to one embodiment. As shown in fig. 15, exemplary curves 600 and 602 plot input loudness (per LFKS) against gain centered at 0 LKFS. Different types of content exhibit different curves, as shown in FIG. 15, where curve 600 may represent speech and curve 602 may represent standard capacitive content. As shown in fig. 15, the amount of gain experienced by the voice content is greater compared to the movie content. FIG. 15 is an example of a representative profile curve for certain types of audio content, and other profiles may be used. Certain aspects of the profile characteristics as shown in fig. 15 are used to derive relevant parameters for optimizing the system. In one embodiment, these parameters include: null bandwidth, shear ratio, boost ratio, maximum boost amount, FS rise, FS decay, lag, peak limit, and target horizontal loudness. Other parameters may also be used in addition to or in place of at least some of these parameters, depending on application requirements and system constraints.

Fig. 16 is a flow diagram illustrating a method for optimizing loudness and dynamic range between a playback device and an application, according to one embodiment. Although the figures show loudness and dynamic range optimization performed at the encoder, similar optimizations may be performed at other audio processing units (e.g., pre-processor, decoder, and post-processor). As shown in process 620, the method begins with the encoder stage receiving an input signal (603) from a source. The encoder or pre-processing component then determines whether the original signal has undergone processing to achieve the target loudness and/or dynamic range (604). The target loudness corresponds to long-term loudness and may be defined externally or internally. If the source signal is not subject to processing to achieve the target loudness and/or dynamic range, the system performs appropriate loudness and/or dynamic range control operations (608); otherwise, if the source signal has undergone such loudness and/or dynamic range control operations, the system enters a bypass mode to skip such loudness and/or dynamic range control operations, and allows the original processing to indicate appropriate long-term limits and/or dynamic ranges (606). The appropriate gain value for either the bypass mode 606 or the executed mode 608 (which may be a single wideband gain value or a frequency dependent multi-band gain value) is then applied in the decoder (612).

Bit stream format

As mentioned previously, the system for optimizing loudness and dynamic range employs a secure, extensible metadata format to ensure that metadata and audio content transmitted in a bitstream between an encoder and a decoder, or between a source and a rendering/playback device, are not separated from each other or interrupted during transmission over a network or other proprietary device, such as a service provider interface. The bitstream provides a mechanism for signaling the encoder and/or decoder components to change the loudness and dynamic range of the audio signal to suit the audio content and output device characteristics through appropriate profile information. In one embodiment, the system is configured to determine a low bitrate coded bitstream to be transmitted between the encoder and the decoder, and the loudness information encoded by the metadata includes characteristics of one or more output profiles. A description of a bitstream format for a loudness and dynamic range optimization system according to one embodiment follows.

The AC-3 encoded bitstream includes metadata, and 1 to 6 channels of audio content. The audio content is audio data that has been compressed using perceptual audio coding. The metadata includes several audio metadata parameters that are intended to change the sound of the program delivered to the listening environment. Each frame of the AC-3 encoded audio bitstream contains 1536 frames of audio content and metadata about the digital audio. For a sampling rate of 48kHz, this represents 32 milliseconds of digital audio or 31.25 frames of audio per second.

Each frame of the E-AC-3 encoded audio bitstream contains audio content and metadata for 256, 512, 768, or 1536 frames of digital audio, depending on whether the frame contains 1, 2, 3, or 6 blocks of audio data. For a sampling rate of 48kHz, this represents 5.333, 10.667, 16, or 32 milliseconds of digital audio, respectively, or 189.9, 93.75, 62.5, or 31.25 frames per second of audio, respectively.

As indicated in fig. 4, each AC-3 frame is divided into sections (segments) including a Synchronization Information (SI) section containing (as shown in fig. 5) a Synchronization Word (SW) and a first of two error correction words (CRC 1); a bitstream information (BSI) section containing most of metadata; six audio blocks (AB0 through AB5) containing data compressed audio content (and may also contain metadata); a discard bit (WI) containing any unused bits left over after the audio content is compressed; an Auxiliary (AUX) information section containing more metadata; and the second of the two error correction words (CRC 2).

As shown in fig. 7, each E-AC-3 frame is divided into sections (segments) including a Synchronization Information (SI) section containing (as shown in fig. 5) a Synchronization Word (SW); a bitstream information (BSI) section containing most of metadata; between 1 and 6 audio blocks (AB0 to AB5) containing data compressed audio content (and may also contain metadata); a discard bit (WI) containing any unused bits left over after the audio content is compressed; an Auxiliary (AUX) information section containing more metadata; and an error correction word (CRC).

In an AC-3 (or E-AC-3) bitstream, there are several audio metadata parameters that are specifically intended to change the sound of a program delivered to a listening environment. One of the metadata parameters is a dialog normalization parameter, which is included in the BSI segment.

As shown in fig. 6, the BSI segment of the AC-3 frame includes a 5-bit parameter ("dialnorm") indicating the dialog normalization value of the program. If the audio coding mode ("acmod") of the AC-3 frame is "0," indicating a biphone or "1 + 1" channel configuration, then a 5-bit parameter ("dialnorm 2") indicating a dialog normalization value for the second audio program carried in the same AC-3 frame is included.

The BSI segment also includes a flag ("addbsie") indicating the presence (or absence) of additional bitstream information after the "addbsie" bit, a parameter ("addbsil") indicating the length of any additional bitstream information after the "addbsil" value, and up to 64 bits of additional bitstream information ("addbsi") after the "addbsil" value. The BSI segment may include other metadata values not specifically shown in fig. 6.

Aspects of one or more embodiments described herein may be implemented in an audio system that processes audio signals for transmission over a network including one or more computers or processing devices executing software instructions. Any of the embodiments described may be used alone or in any combination in combination with each other. While various embodiments have been conceived in view of the various deficiencies of the prior art (which have been discussed or referred to in one or more places of the specification), embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some or only one of the deficiencies that may be discussed in the specification, while some embodiments may not address any of these deficiencies.

Various aspects of the systems described herein may be implemented in a suitable computer-based sound processing network environment for processing digital or digitized audio files. Portions of the adaptive audio system may include one or more networks containing any desired number of individual machines, including one or more routers (not shown) for buffering and routing data transmitted between the computers. Such networks may be constructed based on a variety of different computer protocols, and may be the internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.

One or more of the components, blocks, processes, or other functional components may be implemented by a computer program that controls the execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed herein, in terms of their behavioral, register transfer, logic composition, and/or other characteristics, may be implemented using any number of combinations of hardware, firmware, and/or data and/or instructions embodied in various machine-readable or computer-readable media. Computer-readable media in which such formatted data and/or instructions are embodied include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media.

Unless the context clearly requires otherwise, in the context of the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; i.e., in the sense of "including but not limited to". Words using the singular or plural number also include the plural or singular number, respectively. In addition, "text," "below," "above," "below," and similar words refer to the application as a whole and not to any particular portions of the application. When the word "or" is used in reference to a list of two or more items, the word encompasses all of the following interpretations of the word: any one of the items in the list, all of the items in the list, or any combination of the items in the list.

While one or more implementations have been illustrated and described in accordance with specific embodiments, it is to be understood that the one or more implementations are not limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Accordingly, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.

Claims

1. An audio processing unit for decoding one or more frames of an encoded audio bitstream, the encoded audio bitstream comprising audio data and a metadata container, and the metadata container comprising one or more metadata payloads, the audio decoder comprising:

an input buffer for storing the one or more frames of the encoded audio bitstream;

an audio decoder coupled to the input buffer for decoding the audio data;

a parser coupled to or integrated with the audio decoder, the parser configured to parse the audio data,

wherein the metadata container begins with a sync word identifying a beginning of the metadata container, the one or more metadata payloads include parameters specifying a Dynamic Range Compression (DRC) profile selected from a plurality of DRC profiles, wherein each of the plurality of DRC profiles corresponds to a unique compression curve having an associated time constant, and is followed by protection data that can be used to decrypt, authenticate, or verify the one or more metadata payloads.

2. The audio processing unit of claim 1, wherein the time constants comprise both slow and fast rise time constants and both slow and fast release time constants.

3. The audio processing unit of claim 2, wherein the unique compression curve is further defined by a null band range and a maximum boost amount.

4. The audio processing unit of claim 1, wherein the parameter specifies a DRC profile for the portable device indicating that relatively heavy compression should be applied to the audio data.

5. The audio processing unit of claim 1, wherein the metadata container is stored in an AC-3 or E-AC-3 reserved data space selected from the group consisting of an auxdata field, an addbsi field, and combinations thereof.

6. The audio processing unit of claim 1, wherein the one or more metadata loads include a program loudness load containing data indicative of a measured loudness of an audio program.

7. The audio processing unit of claim 6, wherein the program loudness payload includes a field that indicates whether an audio channel contains spoken dialog.

8. The audio processing unit of claim 6, wherein the program loudness load includes a field indicating a loudness measurement method that has been used to generate loudness data contained in the program loudness load.

9. The audio processing unit of claim 6, wherein the audio processing unit is configured to perform adaptive loudness processing using the program loudness payload.

10. The audio processing unit of claim 1, wherein the encoded audio bitstream is an AC-3 bitstream or an E-AC-3 bitstream.

11. The audio processing unit of claim 1, wherein each metadata payload includes a payload identifier that is unique and located at a beginning of each metadata payload.

12. A non-transitory computer readable medium comprising at least one frame of an encoded audio bitstream, the at least one frame comprising:

audio data; and

a metadata container comprising one or more metadata payloads, and protection data that can be used to decrypt, authenticate, or verify the one or more metadata payloads;

wherein the metadata container begins with a sync word identifying a beginning of the metadata container, the one or more metadata payloads comprising parameters specifying a Dynamic Range Compression (DRC) profile selected from a plurality of DRC profiles, wherein each of the plurality of DRC profiles corresponds to a unique compression curve having a time constant and is followed by the protection data.

13. The non-transitory computer-readable medium of claim 12, wherein the time constants comprise both slow and fast rise time constants and both slow and fast release time constants.

14. The non-transitory computer readable medium of claim 12, wherein the unique compression curve is further defined by a margin and a maximum boost.

15. The non-transitory computer-readable medium of claim 12, wherein the parameter specifies a DRC profile for the portable device indicating that relatively heavy compression should be applied to the audio data.