WO2025230732A1

WO2025230732A1 - Objective audio content quality assessment

Info

Publication number: WO2025230732A1
Application number: PCT/US2025/025016
Authority: WO
Inventors: Yanmeng GUO; Yifei Liu
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2024-05-02
Filing date: 2025-04-16
Publication date: 2025-11-06
Anticipated expiration: 2026-11-02

Abstract

Described herein is a method of performing objective audio quality assessment of audio content. In particular, the method may comprise determining a plurality of audio context types related to the audio content. The method may further comprise classifying and segmenting the audio content into a plurality of segments based on the determined audio context types, such that each audio segment is dominated by one audio context. The method may yet further comprise performing a respective objective audio quality evaluation on each audio segment based on the associated audio context.

Description

OBJECTIVE AUDIO CONTENT QUALITY ASSESSMENT

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of priority from PCT Application No. PCT/CN2024/091074 filed on May 2, 2024, and U.S. Provisional Application No. 63/655,839 filed on June 4, 2024, each of which is incorporated by reference herein in its entirety.

TECHNICAL FIELD

The present disclosure is directed to the general area of audio processing, and more particularly, to methods, apparatus, and systems for performing objective audio content quality assessment.

BACKGROUND

Recently, objective audio quality assessment has become more and more important because of the explosive growth of content creation and sharing applications. Up to now, there may exist a lot of objective audio quality assessment methods and systems that are applicable to assigned audio content. In general, these techniques can predict the subjective assessment reliably and efficiently if the content is included in the pre-defined range. However, the reliability may significantly decrease if the content type does not match.

Such limitation is typically caused by pre-set attributes and corresponding objective metrics. In general, objective evaluation methods provide an assessment based on constant rules of attributes and metrics, but the trade-offs between the attributes may vary with content type, and attributes that are crucial for certain content may be not important or even meaningless for other content. Moreover, usually the objective metrics can only predict the subjective perception in a certain range, and thus should not be utilized if the use case does not match. As a result, the objective evaluation fails to work for diverse audio content, especially content produced by non-professional users (e.g., content of vlogs, etc.).

On the contrary, people evaluate audio quality' based on flexible rules, which may include the attributes and also the preferred trade-off between these. The factors that influence the evaluation are related to the content and may also be referred to as "context’ throughout the present disclosure. For example, if the content is music, then reverberation may be considered as welcomed; but if it is speech, then people would like it to be dry and clear. Therefore, if the objective evaluation were to be conducted within a certain context, in which the attributes and balance points would be known and constant, the evaluation could be reliable and applicable for diverse user-generated content.

In view thereof, there is a need for improved techniques or mechanisms of performing objective audio quality assessment for audio content.

SUMMARY

In view of the above, the present disclosure generally provides a method of performing objective audio quality assessment of audio content, a corresponding system, an apparatus, a program, as well as a computer-readable storage media, having the features of the respective independent claims.

According to a first aspect of the present disclosure, a method of performing objective audio quality⁷ assessment of audio content is provided.

In particular, the method may comprise determining a plurality of audio context types (e.g., music, speech, etc.) related to the audio content. The audio context types may be pre-defined or pre-determined in any suitable manner, for example depending on respective use cases or the like. The method may further comprise classifying and segmenting the audio content into a plurality of segments based on the determined audio context types, such that each audio segment is dominated by one audio context. Finally, the method may also comprise performing a respective objective audio quality evaluation on each audio segment based on the associated audio context. As will be understood and appreciated by the skilled person, any suitable objective audio quality evaluation may be performed depending on various implementations and/or requirements, which may include (but is certainly not limited thereto) evaluation of any suitable metric/attribute such as signal to noise ratio, total harmonic distortion, or the like. This is not to be limited in the present disclosure.

As can be understood and appreciated by the skilled person, the audio content may be of any suitable type/format (e.g., real-time (online) audio stream, pre-recorded audio, etc.), originated from any suitable source (e.g., produced by a professional or non-professional user, etc.,) and sunk to any suitable target (e.g., an end-user). Accordingly, generally speaking, such techniques proposed throughout the present disclosure may be performed or applied by any suitable person and in any suitable scenario/ environment where audio quality evaluation/assessment may be needed. For instance, such objective audio quality assessment techniques may be used by a (professional or non-professional) content creator, a consumer, a content distributor (e.g., a broadcast service provider), or even a (manual or automated) audio processor who may be based on the quality evaluation perform further (audio) processing. This is not to be limited in the present disclosure.

Configured as described above, the proposed method may generally provide an efficient yet flexible manner for performing objective audio quality' assessment of audio content, which may be considered context-relevant or context-aware. Particularly, since the audio content is cut into segments where each segment is only dominated by one context, the objective evaluation can subsequently be conducted on segments of mono context thereby greatly improving the evaluation reliability'. Further, also thanks to the consistency of the context in each segment, key attributes that may influence the subjective evaluation can be clarified, such that objective metrics can be used to predict the attributes and the trade-offs of the subjective assessment, in order to evaluate or predict the audio quality for the content.

In some example implementations, the plurality' of audio context types may be determined based on a use case relating to the audio content. As can be understood and appreciated by the skilled person, the use case(s) may be pre-determined or pre-defined depending on various implementations and/or requirements. For instance, the use case (scenario) may comprise an indoor use case, an outdoor use case, a vlog, a broadcast service (e.g., podcast), or the like. The use case may also be defined in a more coarse or detailed manner, or in a hierarchical, parallel, or tree arrangement, if deemed necessary. For instance, the outdoor use case may be further classified/categorized into an outdoor sports event use case, an outdoor music event use case, or the like. Accordingly, the plurality of audio context types may also be determined in an analogous manner. This is not to be limited in the present disclosure.

In some example implementations, the plurality of audio context types may comprise at least one of: speech, music, or ambient sound. Certainly, any other suitable context type may be defined or determined depending on various implementations and/or requirements, as can also be understood and appreciated by the skilled person. This is not to be limited in the present disclosure.

In some example implementations, the plurality of audio context types may be determined based on at least one of: an ambience or background of the audio content, a channel layout of the audio content (e.g., stereo or 5. 1), a recording format of the audio content, or a speaker gender (e.g., male or female). Similarly, also here the determination of the plurality of audio context types should not be understood to constitute a limitation of any kind and any other criterion may be used to determine or define the suitable audio context types, as illustrated above.

In some example implementations, the plurality of audio context types may be in a hierarchical arrangement, a parallel arrangement, a tree arrangement, or the like. Similar to the above-mentioned defmition/arrangement for the use case, also the plurality of audio context types may be defined/arranged in any suitable form. As can be understood and appreciated by the skilled person, the above arrangements are merely some possible illustrative examples and thus should not be understood to constitute a limitation of any kind.

In some example implementations, the plurality of audio context types may be determined based on machine learning (e.g., Al-based model training), heuristic rules, or a combination thereof.

In some example implementations, the classification and segmentation of the audio content may performed automatically, manually, or as an iterative process. As can also be understood and appreciated by the skilled person, in some possible cases, manual and automatic classification and segmentation of the audio content may be performed jointly (i.e., in combination) in any suitable manner.

In some example implementations, the classification and segmentation of the audio content may be performed as an iterative process (between the classification and segmentation). Particularly, the iterative process may comprise: performing a classification of the audio content; segmenting the audio content based on the classification; and refining the classification based on the segmented audio content for finer segmentation. For instance, in some example cases, it may be possible to start from a (relatively) raw classification for short clips and segment the audio content based on it. Then, based on the (raw) segments, the classification result may be further refined to provide a more accurate result for further finer segmentation. As such, the result of the classification and segmentation of the audio content can be improved, which may in turn further improve the final result of the objective audio quality evaluation of the audio content. In some example implementations, the classification of the audio content may be based on one or more of model-based training, feature analysis, rule-based classification, or manual annotation. Of course, any other suitable mechanism may be used for the classification of the audio content, as can be understood and appreciated by the skilled person.

In some example implementations, the method may further comprise determining (e.g., assigning) a respective objective audio quality evaluation algorithm associated with each audio context type for performing the respective objective audio quality evaluation. The objective audio quality' evaluation algorithm may for example be (pre-)defined or (pre- jconfigured in any suitable manner, for example based on machine learning-based or rulebased techniques.

In some example implementations, each objective audio quality evaluation may be configured for obtaining a subjective evaluation associated with the respective audio context through one or more objective metrics and related evaluation rules.

In some example implementations, the objective audio quality' evaluation may be rule-based, model-based (e.g., Al training model-based), or a combination thereof.

In some example implementations, the method may further comprise determining an overall audio quality assessment for the audio content based on one or more results of the objective audio quality⁷ evaluation.

In some example implementations, the overall audio quality assessment may comprise at least one of: a summary or report of one or more objective audio quality' evaluation results, a normalized or weighted score determined based on one or more objective audio quality’ evaluation results, or statistics thereof. Of course, any other suitable form of the overall audio quality assessment may be implemented, as can be understood and appreciated by the skilled person.

In some example implementations, the audio content may comprise at least one of: an audio file, an audio frame, an audio clip, or an audio stream. Of course, as indicated earlier, any other suitable form or format of the audio content may be used, depending on various implementations and/or use cases.

According to a second aspect of the present disclosure, a system configured for performing objective audio quality assessment of audio content is provided. The system may comprise respective suitable means (e.g., units or entities) that are configured to perform the objective audio quality assessment of the audio content. In some possible (non-limiting) examples, the system may comprise a context type definition module configured to determine a plurality of audio context types related to the audio content. The system may further comprise a context classification and segmentation module configured to classify and segment the audio content into a plurality of audio segments based on the determined audio context types, such that each audio segment is dominated by one audio context. The system may yet further comprise one or more objective audio quality evaluation modules configured to perform a respective objective audio quality evaluation on each respective audio segment based on the associated audio context.

According to a third aspect of the present disclosure, an apparatus including a processor and a memory coupled to the processor is provided. The processor may be adapted to cause the apparatus to carry out all steps according to any of the example methods described in the foregoing aspect.

According to a fourth aspect of the present disclosure, a computer program is provided. The computer program may include instructions that, when executed by a processor, cause the processor to carry out all steps of the example methods described throughout the present disclosure.

According to a fifth aspect of the present disclosure, a computer-readable storage medium is provided. The computer-readable storage medium may store the aforementioned computer program.

It will be appreciated that apparatus features and method steps may be interchanged in many ways. In particular, the details of the disclosed method(s) can be realized by the corresponding apparatus (or system), and vice versa, as the skilled person will appreciate. Moreover, any of the above statements made with respect to the method(s) are understood to likewise apply to the corresponding apparatus (or system), and vice versa.

BRIEF DESCRIPTION OF DRAWINGS

Example embodiments of the present disclosure are explained below with reference to the accompanying drawings, wherein like reference numbers indicate like or similar elements, and wherein Fig. 1 is a schematic illustration showing an example system architecture for performing objective audio content quality assessment according to embodiments of the present disclosure,

Fig. 2 is a schematic illustration showing an example diagram of an objective audio content quality assessment flow according to embodiments of the present disclosure,

Fig. 3 is a schematic illustration showing an example diagram of an objective audio content quality assessment system according to embodiments of the present disclosure,

Fig. 4 is a schematic flowchart illustrating an example of a method of performing objective audio quality assessment of audio content according to embodiments of the present disclosure, and

Fig. 5 is a schematic block diagram of an example apparatus for performing methods according to embodiments of the present disclosure.

DETAILED DESCRIPTION

As indicated above, identical or like reference numbers in the present disclosure may, unless indicated otherwise, indicate identical or like elements, such that repeated description thereof may be omitted for reasons of conciseness.

Particularly, the Figures (Figs.) and the following description relate to preferred embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.

Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the disclosed system (or method) for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein.

Furthermore, in the figures, where connecting elements, such as solid or dashed lines or arrows, are used to illustrate a connection, relationship, or association between or among two or more other schematic elements, the absence of any such connecting elements is not meant to imply that no connection, relationship, or association can exist. In other words, some connections, relationships, or associations between elements are not shown in the drawings so as not to obscure the present invention. In addition, for ease of illustration, a single connecting element is used to represent multiple connections, relationships or associations between elements. For example, where a connecting element represents a communication of signals, data, or instructions, it should be understood by those skilled in the art that such element represents one or multiple signal paths, as may be needed, to affect the communication.

As mentioned above, objective audio quality assessment has become more and more important because of the explosive growth of content creation and sharing applications. Up to now; there may exist a lot of objective audio quality assessment methods and systems that are applicable to assigned audio content. In general, the available objective evaluation techniques provide the assessment based on constant rules of attributes and metrics. However, the tradeoffs betw een those attributes may vary with content type, and crucial attributes for certain content may be not important or even meaningless for some other possible content. Moreover, objective metrics typically can only predict the subjective perception in a certain range, and should not be utilized if the underlying use case does not match. Thus, existing objective evaluation techniques cannot work for diverse audio content, especially those produced by non-professional users.

On the contrary, people may evaluate audio quality based on flexible rules, which may include the attributes and also the preferred trade-off betw een them. The factors that influence the evaluation are related to the content and may also be referred to as ‘context’ throughout the present disclosure. For example, if the content is music, then reverberation may be considered w elcomed; but if it is speech, then people w ould like it to be dry and clear. Therefore, if the objective evaluation were to be conducted within a certain context, in which the attributes and balance points would be known and constant, the evaluation could be reliable and applicable to diverse user-generated content.

Therefore, in a broad sense, the present disclosure generally aims at providing a (contextrelevant) objective quality assessment mechanism for diverse audio content, such that it could help (e.g., content creators, professionals, and amateurs alike) to objectively measure the quality of the created content that will be perceived subjectively. It is generally based on the fact that the attributes that influence the subjective audio quality evaluation vary with some factors, which may be designated as context types throughout the present disclosure. For instance, in a music type context, an attribute such as reverb is a sign of good quality, whereas, in a speech type context, the same reverb attribute is a sign of poor quality. However, as noted above, existing objective evaluation techniques may usually be based on constant attributes and thresholds. Therefore, in simple terms, the present disclosure generally proposes to classify the audio content into contexts before objective evaluation, and in each context, a unique set of key attributes and thresholds which is close to the subjective evaluation may be used, so that the subjective evaluation can be predicted more reliably by objective metrics.

Referring now to the drawings, Fig. 1 is a schematic illustration showing an example system architecture 100 for performing objective audio content quality assessment according to embodiments of the present disclosure. It may be worthwhile to note that, as will become apparent in the description below, this system architecture is merely one possible example for illustrative purposes only and should not be understood to constitute a limitation of any kind, as can also be understood and appreciated by the skilled person.

Particularly, as illustrated in the system architecture 100 of Fig. 1, audio content 110 (e.g., in the form of an audio file, an audio clip, an (online) audio stream, or the like) is obtained. The audio content 100 is applied to a (context-relevant) audio content quality⁷ assessment 120 which will be discussed in more detail below. Subsequently, an objective quality evaluation 130 is performed. Optional, though not explicitly shown in Fig. 1, a respective evaluation/assessment result, e.g., in the form of an overall score, scores per segment, a summary/report, statistics, figures, or the like, may be obtained, such that, depending on various implementations and/or circumstances, further suitable processing (e.g., to further improve audio quality) may be performed based on such evaluation/assessment result.

Now reference is made to Fig. 2, which schematically illustrates a (more detailed) flow diagram 200 of an example objective audio content quality assessment according to embodiments of the present disclosure. Similar to Fig. 1, it is to be noted that diagram 200 as shown in Fig. 2 merely represents one possible implementation and any other suitable implementation may of course be feasible as well. Moreover, identical or like reference numbers (or blocks) in Fig. 2 may, unless indicated otherwise, indicate identical or like elements in Fig. 1. such that repeated description thereof may be omitted for reasons of conciseness.

In simple words, Fig. 2 schematically illustrates an example basic flowchart 200 of the proposed context-relevant objective audio content quality assessment, where the content type is defined based on the use case and available training or sample dataset (block 220). then the input audio content (block 210) is classified and segmented into mono-context segments (block 230) and evaluated by corresponding objective assessment methods or algorithms respectively (block 240) in order to provide the final overall objective assessment (block 250).

To be more specific, the context type definition block 220 may be configured to define suitable context types for subsequent audio content classification. This context type definition may be based on any suitable aspect of the content, including (but not limited to) the target of the audio content, the ambience, background, surrounding or environment of the audio content, the recording format of the audio content, the channel layout (e.g., mono, stereo, 5.1, 7. 1, etc.) of the content, the speaker genders (e.g.. male, female, non binary, etc.), or a combination thereof. Of course, any other suitable context type may be defined or determined depending on various implementations and/or requirements, as can also be understood and appreciated by the skilled person. One of the main reasons to classify the audio context is to help improve the objective assessment reliability by mapping the objective evaluations to the matched and explicit context. Therefore, generally speaking, the basic principle here may be considered to classify the audio based on the importance and balance of attributes that may be considered to affect the subjective evaluation, so that the objective evaluation methods/algorithms can be performed to predict the subjective assessment reliably. Notably, the context t pe definition can be conducted manually or automatically, or in a combination thereof.

For instance, in some possible examples, if all the testing contents are mono speech, it may be possible to define the context type based on the gender of the speaker. But in some other possible examples, if there are some stereo contents included (in addition to the mono speeches), the channel number may or should be considered as well, since the spatial information of stereo data may generally be considered to be of importance in subjective evaluation.

Notably, in practice, there may be a very small or large number of context types, depending on various requirements, such as the usage or use case, the diversity of audio contents, the size of the testing audio contents, or the like. Accordingly, in some possible implementations, the context definition may be a hierarchical process. For instance, the base classes (of the context t pes) may be defined to correspond to the key attributes, and more detailed attributes may then be considered for classification in the next level(s). Of course, as has been illustrated earlier, any other suitable arrangement, such as parallel, tree-like, or even star-like arrangement, of the context type definition may be feasible, depending on various implementations and/or circumstances.

In some possible examples, each context type may be related to a corresponding objective evaluation method/algorithm with certain corresponding (e.g., pre-determined or preconfigured) parameters or metric thresholds. Thus, in some possible cases, it may be considered useful or sometimes even necessary to have enough subjective evaluation cases for each context, since the objective method/algorithm and the related parameters should be determined based on reliable subjective experience, otherwise the objective evaluation cannot be considered reliable in that context.

Particularly, due to the diversity and/or complexity of the audio content in real life, the testing audio content may switch between several contexts, especially for streaming media. Therefore, they cannot or should not be easily evaluated by only one objective evaluation system. Accordingly, the context classification and segmentation block 230 may be configured to cut the audio content into segments each of which may only contain one (dominant) context, so that the evaluation can be subsequently conducted for each audio segment.

To be more specific, in some possible example implementations, the segmentation itself may be based on the corresponding context classification. Notably, in general, the classification may be considered to be reliable only if the input audio signal is long enough. Therefore, in some possible cases, there may be some iteration between respective segmentation and classification. For instance, it may be possible to start from a (relatively) raw classification for example for short audio clips, and segment the audio content based on such raw classification. Then, based on the obtained (raw) segments, the classification result may be further refined to provide more accurate results for subsequent finer segmentation, in an iterative manner. Since the context classification and segmentation processes could be performed iteratively , the sequence (as to whether the context classification or the segmentation starts first) generally does not really matter. Thus, in the present disclosure, the term context classification and segmentation block (or the like) may also be referred to as segmentation and context classification, such that an analogous or similar description illustrated above likewise applies.

Notably, this (possibly iterative optimization) process may be model-based, rule-based, or implemented in any other suitable manner. Furthermore, it is also to be noted that the context classification process itself may also be realized as rule-based or model-based, or even conducted manually. In some possible cases, the actual implementation of the classification process may be considered to be related to the definition of context type illustrated earlier. For instance, if the context is defined based on the target signal, such as speech, music, or ambient sound, then the context classification may be implemented as simply as a speech/music/ambience classifier (or in any other suitable manner). Of course, as can also be understood and appreciated by the skilled person, such context classification may be implemented based on Al-based model training, feature analysis, rule-based classification, manual annotation, or the like, depending on various use cases and/or circumstances.

Furthermore, the objective quality evaluation block 240 may be configured to contain one or more objective evaluation (sub-)sy stems of certain method(s), algorithm(s), objective metric(s), threshold(s), weight(s) and evaluation rule(s). Each (sub-)system may be configured to be mapped with a respective context, and more particularly, should be sensitive to the respective key attributes in order to be to effectively reflect the subjective preference of such content/context.

For instance, in some possible cases, for possible contexts of mono speech, stereo music, and binaural ambient sound, the respective attributes and corresponding metrics would be different, so each necessary/associated key attribute should be included in the respective objective method/ algorithm. In comparison, in some other possible cases, if the contexts are different types of speech, such as spontaneous conversation and lecture, then they may use the same set of metrics but the parameters should be adjusted to match the respective subjective assessments. As some illustrative examples, in the case of speech content, when the use case is indoor, then quality of the speech sound (e.g., loudness, etc.) may need to be considered (e.g., prioritized), whereas for the outdoor use case, then intelligibility or detectability of the speech may need to be considered (e.g., prioritized).

Finally, the (overall) quality assessment block 250 may be configured to summarize the objective results of all (or a part of) the segments, thereby providing the final assessment for the whole audio content.

As can be understood and appreciated by the skilled person, such (overall) quality assessment may be implemented in any suitable form. For instance, in some possible cases, a possible assessment may be as simple as giving a respective score for each segment and obtaining or calculating the weighted means thereof as the final result. But in some other possible cases, it may be considered more practical to show more detailed information, such as the length of each segment, the performance of each attribute, or the like. Notably, for the use cases that need to further improve the audio content later based on this quality assessment, the evaluation may also be used for example as a diagnostic suggestion or cost function for further analysis or to guide the next iteration in some possible implementations. Of course, any other suitable implementation for the (overall) quality assessment may be feasible as well. This is not to be limited in the present disclosure.

Next, reference is made to Fig- 3, which schematically illustrates an example system diagram 300 of an objective audio content quality assessment according to embodiments of the present disclosure. Notably, identical or like reference numbers (or blocks) in Fig. 3 may, unless indicated otherwise, indicate identical or like elements in Fig. 1 or 2, such that repeated description thereof may be omitted for reasons of conciseness. Although Figs. 2 and 3 may appear to look similar, it may be worthwhile to note that Fig. 2 may be seen to depict the present disclosure more from a functional flow perspective whereas Fig. 3 may be seen to depict the present disclosure more from a system architecture perspective.

As can be seen from the example system diagram 300 of Fig. 3, broadly speaking, in order to (objectively) predict the subjective assessment of the audio content, it is generally proposed a (context-relevant) objective audio quality assessment system that comprises the following:

• a context type definition module 320 configured to define or determine the context types and optionally to assign each context to a matched objective assessment method/ algorithm;

• a context classification and segmentation module 330 configured to cut the input audio content 310 into segments in which the content (of each segment) is dominated by one context;

• an objective quality evaluation module 340 (which itself may comprise or be implemented as one or more sub-modules 340-1, 340-2. 340-N) configured to assess the quality of each segment according to its respective context; and

• optionally, an overall assessment module 350 configured to provide the final assessment 360 based on the objective evaluation of the segments (or a part thereof).

These modules will now be described in more detail as follows.

To begin with, as illustrated earlier with respect to Fig. 2. the context type definition module 320 may be configured to define the set of context types, for example in accordance with the use case(s), such as broadcast services, TV shows, vlogs, etc., For instance, if the underlying use case is a vlog (which is generally considered to represent a (non-professional) user generated content), then the possible context types may be defined to comprise at least speech, music and ambient sound. In some possible examples, the context type definition module 320 may be further configured to assign a (or more) respective matched objective assessment method/ algorithm to each type, for example based on suitable machine learning or (e.g., pre-configured heuristic) rules, or even in a combination thereof.

In some possible implementations, the context type(s) may be defined according to subjective experiments on possible training content set(s), so that the attributes that may generally be considered to influence the subjective evaluation could be clarified, and a corresponding objective evaluation module could be designed in order to predict the attributes and to provide the evaluation based on subjective trade-offs between those attributes.

As noted above, in some possible implementations, the context type may also be defined according to the use case, so that the content could be discriminated effectively. For example, if the user cares more about speech, music, or animal sound quality, then the context ty pes may at least contain speech, music, and animal sound. In some possible cases, the context types may be in a hierarchical arrangement, a parallel arrangement, a tree arrangement, or in any other suitable arrangement. For instance, following the above example, for the music contents, it may be possible to further classify them based on genres or instruments, etc.

Next, the context classification and segmentation module 330 may be configured to cut the content into segments, and each segment may be dominated by only one context, so that the objective evaluation can be conducted on segments of mono context to improve the overall reliability.

However, due to the complexity of real content and use cases in practice, each audio content may contain one or more contexts. For example, in some possible cases, audio content may be the concatenation of dialogs and music segments, so it is generally not feasible to classify the whole content appropriately if the audio content is not segmented.

The segmentation and classification may be conducted separately, iteratively or in other orders through algorithms of heuristic rules or machine learning models, or even be conducted manually.

In some possible implementations, the context classification and segmentation process may be configured to operate jointly to fulfill the overall task. In such cases, there may be some iteration between the two parts, since they may depend on each other’s result.

As can be understood and appreciated by the skilled person, the classification process itself may be realized as a model-based or rule-based classification (or in any other suitable manner), depending on for example the context definition and/or use case. In some possible implementations, the context classification may include both frame-based and file-based methods, for example targeting different accuracy levels.

Similarly, the segmentation may also involve several methods/algorithms, including for example initial raw segmentation and one or more rounds of (gradually) refined segmentations based on the classification results.

In some possible cases, if the input content is an audio stream, the segmentation may be seen to work as a switch between the contexts based on the classification results of each frame or short clip.

Regarding the objective quality evaluation process, as noted above, this may be achieved by one or more objective quality evaluation (sub-)modules 340-1, 340-2, ... , 340-N (where the number N may be determined in accordance with the preceding segmentation process), such that objective qualify evaluation may be conducted in each segment to predict the subjective evaluation through objective metrics and related rules. Any one or more of the objective qualify evaluation may be heuristic rules and/or machine learning models, and can work with or without reference audio signal. Also, any one or more of the objective qualify evaluation may be configured to provide a qualify score or statistics of objective metrics, or in any other suitable form. Because of the consistency of the context in each segment, the key attributes that may be considered influential to the subjective evaluation can be clarified, and accordingly, the objective metrics can predict the attributes and their trade-offs of the subjective assessment to predict the audio quality for the content. In some possible implementations, any one or more of the objective quality evaluation metrics may be based on auditory models, signal time-frequency features or statistical models of the audio signal, depending on various use cases and/or requirements.

Finally, the overall quality assessment module 350 may be configured to summarize the objective results of all (or a part of) the segments to provide a final assessment 360 for the whole content. Depending on various implementations and/or use cases, it may be the context evaluation results for one or more segments, the statistics of the metrics, a normalized overall score, or in any other suitable form.

It may be worthwhile to note that the system architecture 300 as shown in Fig. 3 merely represents one possible implementation thereof, and any other suitable implementation may of course be feasible as well, as can also be understood and appreciated by the skilled person. This is not to be limited in the present disclosure.

To summarize the above, the present disclosure generally seeks to propose a system (and a corresponding method) that may be configured to define various context types and assign a respective objective assessment method to each context t pe. The system may be further configured to receive input content and cut the content into segments. In each segment, the system may be configured to identity⁷ a dominant context ty pe. The system may then be configured to evaluate each segment using the objective assessment method corresponding to the dominant context type of that segment. Finally, the system may be configured to provide an overall assessment of said input content.

Fig- 4 is a schematic flowchart illustrating an example of a method 400 of performing objective audio quality' assessment of audio content according to embodiments of the present disclosure.

In particular, the method 400 as shown in Fig. 4 may start at step S410 by determining a plurality of audio context types related to the audio content. Subsequently, in step S420 the method 400 may comprise classity ing and segmenting the audio content into a plurality of segments based on the determined audio context types, such that each audio segment is dominated by one audio context. The method 400 may yet further comprise at step S430 performing a respective objective audio quality evaluation on each audio segment based on the associated audio context. As will be understood and appreciated by the skilled person, the objective audio quality evaluation may be achieved by adopting any suitable (existing or new) objective audio quality evaluation method/algorithm, depending on various implementations and/or requirements.

Configured as described above, the proposed method may generally provide an efficient yet flexible manner for performing objective audio quality assessment for audio content, which may be considered context-relevant or context-aware. Particularly, since the audio content is cut into segments where each segment is only dominated by one context, the objective evaluation can subsequently be conducted on segments of mono context thereby greatly improving the evaluation reliability. Further, also thanks to the consistency of the context in each segment, key attributes that may influence the subjective evaluation can be clarified, such that objective metrics can be used to predict the attributes and the trade-offs of the subjective assessment, in order to evaluate or predict the audio quality for the content.

Finally, the present disclosure likewise relates to apparatus for performing methods and techniques described throughout the present disclosure. Fig. 5 generally shows an example of such apparatus 500. In particular, the apparatus 500 comprises a processor 510 and a memory 520 coupled to the processor 510. The memory 520 may store instructions for the processor 510. The processor 510 may also receive, among others, suitable input data 530 (e.g., audio signal, audio stream, audio clip, audio file, etc.), depending on various use cases and/or implementations. The processor 510 may be adapted to carry out the methods/techniques (e.g., the method 400 as illustrated above with reference to Fig. 4) described throughout the present disclosure and to generate correspondingly output data 540 (e.g., an overall quality assessment), depending on various use cases and/or implementations.

Interpretation

A computing device implementing the techniques described above can have the following example architecture. Other architectures are possible, including architectures with more or fewer components. In some implementations, the example architecture includes one or more processors (e.g., dual-core Intel® Xeon® Processors), one or more output devices (e.g., LCD), one or more network interfaces, one or more input devices (e.g., mouse, keyboard, touch-sensitive display) and one or more computer-readable mediums (e.g., RAM, ROM, SDRAM, hard disk, optical disk, flash memory, etc.). These components can exchange communications and data over one or more communication channels (e.g., buses), which can utilize various hardware and software for facilitating the transfer of data and control signals between components.

The term “computer-readable medium” refers to a medium that participates in providing instructions to processor for execution, including without limitation, non-volatile media (e.g., optical or magnetic disks), volatile media (e.g., memory) and transmission media. Transmission media includes, without limitation, coaxial cables, copper wire and fiber optics.

Computer-readable medium can further include operating system (e.g., a Linux® operating system), network communication module, audio interface manager, audio processing manager and live content distributor. Operating system can be multi-user, multiprocessing, multitasking, multithreading, real time, etc. Operating system performs basic tasks, including but not limited to: recognizing input from and providing output to network interfaces and/or devices; keeping track and managing files and directories on computer-readable mediums (e g., memory or a storage device); controlling peripheral devices; and managing traffic on the one or more communication channels. Network communications module includes various components for establishing and maintaining network connections (e g., software for implementing communication protocols, such as TCP/IP, HTTP, etc.).

Architecture can be implemented in a parallel processing or peer-to-peer infrastructure or on a single device with one or more processors. Software can include multiple software components or can be a single body of code.

The described features can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program is a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program can be written in any form of programming language (e.g.. Objective-C. Java), including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, a browser-based web application, or other unit suitable for use in a computing environment.

Suitable processors for the execution of a program of instructions include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors or cores, of any kind of computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memories for storing instructions and data. Generally, a computer will also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magnetooptical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).

To provide for interaction with a user, the features can be implemented on a computer having a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor or a retina display device for displaying information to the user. The computer can have a touch surface input device (e.g., a touch screen) or a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer. The computer can have a voice input device for receiving voice commands from the user.

The features can be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application serv er or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them. The components of the system can be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, e.g.. a LAN. a WAN, and the computers and networks forming the Internet.

The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data (e.g., an HTML page) to a client device (e.g.. for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server. A system of one or more computers can be configured to perform particular actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the present invention discussions utilizing terms such as “processing”, “computing”, “calculating”, “determining”, “analyzing” or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing devices, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.

Reference throughout this invention to “one example embodiment”, “some example embodiments” or “an example embodiment” means that a particular feature, structure or characteristic described in connection with the example embodiment is included in at least one example embodiment of the present invention. Thus, appearances of the phrases “in one example embodiment”, “in some example embodiments” or “in an example embodiment” in various places throughout this invention are not necessarily all referring to the same example embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this invention, in one or more example embodiments.

As used herein, unless otherwise specified the use of the ordinal adjectives “first”, “second”, “third”, etc., to describe a common object, merely indicate that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

Also, it is to be understood that the phraseology and terminology used herein are for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having” and variations thereof are meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Unless specified or limited otherwise, the terms “mounted”, “connected”, “supported”, and “coupled” and variations thereof are used broadly and encompass both direct and indirect mountings, connections, supports, and couplings.

In the claims below and the description herein, any one of the terms comprising, comprised of or which comprises is an open term that means including at least the elements/features that follow, but not excluding others. Thus, the term comprising, when used in the claims, should not be interpreted as being limitative to the means or elements or steps listed thereafter. For example, the scope of the expression a device comprising A and B should not be limited to devices consisting only of elements A and B. Any one of the terms including or which includes or that includes as used herein is also an open term that also means including at least the elements/features that follow the term, but not excluding others. Thus, including is synonymous with and means comprising.

It should be appreciated that in the above description of example embodiments of the present invention, various features of the present invention are sometimes grouped together in a single example embodiment. Fig., or description thereof for the purpose of streamlining the present invention and aiding in the understanding of one or more of the various inventive aspects. This method of invention, however, is not to be interpreted as reflecting an intention that the claims require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed example embodiment. Thus, the claims following the Description are hereby expressly incorporated into this Description, with each claim standing on its own as a separate example embodiment of this invention.

Furthermore, while some example embodiments described herein include some but not other features included in other example embodiments, combinations of features of different example embodiments are meant to be within the scope of the present invention, and form different example embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed example embodiments can be used in any combination.

In the description provided herein, numerous specific details are set forth. However, it is understood that example embodiments of the present invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Thus, while there has been described what are believed to be the best modes of the present invention, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the present invention, and it is intended to claim all such changes and modifications as fall within the scope of the present invention. For example, any formulas given above are merely representative of procedures that may be used. Functionality may be added or deleted from the block diagrams and operations may be interchanged among functional blocks. Steps may be added or deleted to methods described within the scope of the present disclosure.

Various aspects of the present invention may be appreciated from the following Enumerated Example Embodiments (EEEs):

EEE1. A method and system of objective audio quality assessment for diverse audio content that can evaluate the quality of sound file or stream for content creation, analysis, transformation and playback systems both in off-line and real-time usage includes the modules of: a. A context type definition module to define the context types and assign each context to a matched objective assessment method; b. A segmentation and classification module to cut the content into segments in which the content is dominated by one context; c. An objective quality evaluation module to assess the quality of each segment according to its context; and d. An over-all assessment module to provide the final assessment based on the objective evaluation of each segment.

EEE2. As in EEE1, where the objective audio quality assessment can work with the following processing options: audio content is input as whole file, clips or online stream.

EEE3. As in EEE1, where the audio quality assessment can be based on heuristic rules, machine learning based methods like decision tree, Adaboost, GMM, SVM, HMM, DNN, CNN or RNN, or the hybrid of them.

EEE4. As in EEE1, where the audio quality assessment includes a method to define the type of contexts for context classification.

EEE5. As in EEE1 or EEE4. where the context type definition can be based on target content, background sound, recording format and other factors, and can be arranged in hierarchy, parallel or trees.

EEE6. As in EEE1 or EEE5, where the context type definition can be conducted manually or automatically based on heuristic rules or machine learning models.

EEE7. As in EEE1, where the segmentation and classification can be conducted separately, iteratively or in other orders through algorithms of heuristic rules or machine learning models, or be conducted manually.

EEE8. As in EEE1, where each of the objective quality evaluation metrics can be based on auditory' models, signal time-frequency features or statistical models of the audio signal.

EEE9. As in EEE1, where each of the objective quality evaluation can be based on heuristic rules or machine learning models, and can work with or without reference audio signal.

EEE10. As in EEE1, where each of the objective quality evaluation can provide a quality score or statistics of objective metrics.

EEE11. As in EEE1, where the over-all assessment can be the statistics of the objective evaluation results or part of the evaluation result from some or all of the segments.

Claims

1. A method of performing objective audio quality assessment of audio content, comprising: determining a plurality of audio context types related to the audio content; classifying and segmenting the audio content into a plurality of segments based on the determined audio context types, such that each audio segment is dominated by one audio context; and performing a respective objective audio quality evaluation on each audio segment based on the associated audio context.

2. The method according to claim 1 , wherein the plurality of audio context types is determined based on a use case relating to the audio content.

3. The method according to claim 1 or 2, wherein the plurality of audio context types comprise at least one of: speech, music, or ambient sound.

4. The method according to any one of the preceding claims, wherein the plurality of audio context types is determined based on at least one of: an ambience or background of the audio content, a channel layout of the audio content, a recording format of the audio content, or a speaker gender.

5. The method according to any one of the preceding claims, wherein the plurality' of audio context types is in a hierarchical arrangement, a parallel arrangement or a tree arrangement.

6. The method according to any one of the preceding claims, wherein the plurality of audio context types is determined based on machine learning, heuristic rules, or a combination thereof.

7. The method according to any one of the preceding claims, wherein the classification and segmentation of the audio content is performed automatically, manually, or as an iterative process.

8. The method according to claim 7. wherein the classification and segmentation of the audio content is performed as an iterative process and the iterative process comprises: performing a classification of the audio content; segmenting the audio content based on the classification; and refining the classification based on the segmented audio content for finer segmentation.

9. The method according to any one of the preceding claims, wherein the classification of the audio content is based on one or more of model-based training, feature analysis, rule-based classification, or manual annotation.

10. The method according to any one of the preceding claims, further comprising: determining a respective objective audio quality evaluation algorithm associated with each audio context type for performing the respective objective audio quality evaluation.

11. The method according to any one of the preceding claims, wherein each objective audio quality evaluation is configured for obtaining a subjective evaluation associated with the respective audio context through one or more objective metrics and related evaluation rules.

12. The method according to according to any one of the preceding claims, wherein the objective audio quality evaluation is rule-based, model-based, or a combination thereof.

13. The method according to any one of the preceding claims, further comprising: determining an overall audio quality⁷ assessment for the audio content based on one or more results of the objective audio quality evaluation.

14. The method according to claim 13, wherein the overall audio quality assessment comprises at least one of: a summary or report of one or more objective audio quality' evaluation results, a normalized or weighted score determined based on one or more objective audio quality evaluation results, or statistics thereof.

15. The method according to any one of the preceding claims, wherein the audio content comprises at least one of: an audio file, an audio frame, an audio clip, or an audio stream.

16. A system configured for performing objective audio quality assessment of audio content, comprising: a context type definition module configured to determine a plurality of audio context types related to the audio content; a context classification and segmentation module configured to classify and segment the audio content into a plurality of audio segments based on the determined audio context ty pes, such that each audio segment is dominated by one audio context; and one or more objective audio quality evaluation modules configured to perform a respective objective audio quality evaluation on each respective audio segment based on the associated audio context.

17. An apparatus, comprising a processor and a memory coupled to the processor, wherein the processor is adapted to cause the apparatus to cany' out the method according to any one of the proceeding claims.

18. A program comprising instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 16.

19. A computer-readable storage medium storing the program according to claim 18.