WO2007034375A2 - Determination d'une mesure de distorsion pour codage audio - Google Patents
Determination d'une mesure de distorsion pour codage audio Download PDFInfo
- Publication number
- WO2007034375A2 WO2007034375A2 PCT/IB2006/053261 IB2006053261W WO2007034375A2 WO 2007034375 A2 WO2007034375 A2 WO 2007034375A2 IB 2006053261 W IB2006053261 W IB 2006053261W WO 2007034375 A2 WO2007034375 A2 WO 2007034375A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- encoding
- audio signal
- frequency domain
- response
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/0204—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
Definitions
- the invention relates to determination of one or more distortion measures for audio encoding and in particular, but not exclusively, to determination of distortion measures associated with different quantization options.
- Digital encoding of various source signals has become increasingly important over the last decades as digital signal representation and communication increasingly has replaced analogue representation and communication.
- mobile telephone systems such as the Global System for Mobile communication
- digital speech encoding are based on digital speech encoding.
- distribution of media content is increasingly based on digital content encoding.
- the merit of these models is that they can predict the effects of the temporal properties of the original signal on the masking strength that is created by the signal.
- the models may allow an accurate determination of a distortion measure associated with a given quantization or encoding option which is used by the audio encoder.
- the models are not well suited for practical applications as they are very complex to evaluate and require exceedingly high computational resource.
- the structure of the models is such that in order to determine how the quantization noise floor needs to be shaped, the model calculates a so called internal representation of the original signal and of the quantized signal. These internal representations are then compared and the difference can be interpreted as an indication of the audibility of the quantization noise.
- this process needs to be repeated many times for each encoding segment before a suitable noise floor shape is found that does not lead to audible artefacts but results in an efficient encoding. For this reason, these advanced models are currently not applied in audio coding thereby resulting in suboptimal encoding and a degraded signal quality to data rate ratio.
- an improved system for determining distortion measures and audio encoding would be advantageous and in particular a system allowing increased flexibility, reduced complexity, reduced computational demand, improved distortion measurement determination and/or improved audio encoding would be advantageous.
- an apparatus for determining a distortion measure for an audio encoding of an audio signal comprising: first means for generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; second means for determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic.
- the invention may allow separate determination of a frequency domain sensitivity characteristic taking temporal characteristics into account followed by a frequency domain determination of a distortion measure.
- the invention may provide reduced computational resource demand.
- the invention may enable or facilitate the use of advanced temporal psycho-acoustical models in practical encoders.
- the invention may allow an improved audio encoding.
- the invention may improve performance and/or reduce resource requirements in audio encoders that evaluate a plurality of encoding options before selecting the encoding option providing the lowest perceivable distortion.
- the distortion measure may be a perceptual distortion measure.
- the audio encoding may be segmented and the time scale of the time domain model may be smaller than the segmentation interval.
- the frequency domain representations may be obtained by block based time domain to frequency domain transforms corresponding to the segments.
- the time domain model may specifically be a psycho-acoustical perceptual model taking into account temporal masking variations within the individual audio encoding segments of the audio encoding.
- the first means comprises means for dividing the audio signal into psycho-acoustic bands and is arranged to perform the time domain analysis on the individual psycho-acoustic bands.
- the time domain model may be applied to each individual psycho-acoustic band and may in particular be evaluated individually for each psycho-acoustic band without consideration of any other band.
- the feature may allow efficient and/or accurate distortion measurement determination and may result in improved audio encoding.
- the first means comprises means for determining an error sensitivity weighting for each psycho-acoustic band and means for determining an error sensitivity value for each psycho-acoustic band in response to a predetermined subband error estimate and the error sensitivity weighting for each psycho- acoustic band.
- the sensitivity value for a psycho-acoustic band may be determined by an individual weighting of a predetermined band error estimate without consideration of other bands.
- the feature may facilitate implementation, reduce complexity, reduce computational resource requirements and/or improve performance.
- the apparatus further comprises conversion means for generating an error sensitivity value for each frequency subband of the audio encoding in response to error sensitivity values of each of the psycho- acoustic bands.
- the conversion means is arranged to determine the error sensitivity value for each frequency subband of the audio encoding in response to a characteristic of an audio encoding filter of the audio encoding.
- the time domain model comprises first weighting means for determining a first weight value for the predetermined error estimate in response to a forward masking model.
- the first weight value may be determined individually for each psycho-acoustic band and may be applied individually to each psycho-acoustic band. Thus, psycho-acoustic bands may be processed individually and separately.
- the first weighting means is arranged to reduce the first weight value for increasing signal levels of the audio signal.
- the weight value may be reduced for increasing signal levels to indicate a reduced sensitivity to errors.
- the weight value may be increased for increasing signal levels.
- the signal level of the audio signal may be a signal level in a given time interval and/or following low pass filtering with a given dynamic performance and/or may be in the individual psycho-acoustic bands.
- the feature may allow a practical determination of an accurate sensitivity indication taking into account temporal characteristics.
- the first weighting means is arranged to limit the first weight value.
- the first weighting means is arranged to normalize the first weight value in response to a signal level of the audio signal. This may improve performance and may provide a sensitivity indication more accurately reflecting the human audio perceptual characteristics.
- the time domain model comprises second weighting means for determining a second weight value for the predetermined error estimate in response to a temporal modulation model.
- the second weight value may be determined individually for each psycho- acoustic band and may be applied individually to each psycho-acoustic band.
- psycho- acoustic bands may be processed individually and separately.
- the second weighting means is arranged to determine the second weight value in response to a temporal signal variance of the audio signal.
- the second weight value may for example be modified to indicate reducing sensitivity for an increasing variance.
- the temporal signal variance may for example be determined as a standard deviation relative to an average value.
- the temporal signal variance may be a temporal variance within a time interval and/or of a low pass filtered version of the audio signal and/or may be individually determined for each psycho-acoustic band.
- the second means is arranged to adjust the distortion measure in response to a correlation characteristic between the frequency domain error signal and the audio signal.
- the first means is arranged to modify the frequency domain sensitivity characteristic in response to a temporal onset delay of a transient of the audio signal relative to the onset of an encoding segment of the audio encoding.
- the first means is arranged to modify the frequency domain sensitivity characteristic in response to a time envelope of the audio signal within a first frequency interval.
- an encoder as outlined above.
- the encoder comprises means for generating frequency domain representations of the audio signal and frequency domain error signals for a plurality of encoding options and means for selecting an encoding option from the plurality of encoding options in response to distortion measures determined for the plurality of encoding options by the second means.
- a method of determining a distortion measure for an audio encoding of an audio signal comprising: generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; and determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic.
- a method of encoding an audio signal comprising: generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; for a plurality of encoding options determining a distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; selecting an encoding option from the plurality of encoding options in response to the distortion measures; and encoding the audio signal using the selected encoding option.
- a method of transmitting an audio signal comprising: generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; for a plurality of encoding options determining a distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; selecting an encoding option from the plurality of encoding options in response to the distortion measures; encoding the audio signal using the selected encoding option; and transmitting the encoded audio signal.
- a transmitter for transmitting an audio signal comprising: means for generating a frequency domain sensitivity characteristic for the audio signal in response to a predetermined error estimate and a time domain analysis of the audio signal using a time domain model having a time scale smaller than a duration of an analysis segment of the audio encoding; means for, for a plurality of encoding options, determining the distortion measure in response to a frequency domain representation of the audio signal, a frequency domain error signal and the frequency domain sensitivity characteristic; means for selecting an encoding option from the plurality of encoding options in response to the distortion measures; means for encoding the audio signal using the selected encoding option; and transmitting the encoded audio signal.
- Fig. 1 illustrates a transmission system 100 for communication of an audio signal in accordance with some embodiments of the invention
- Fig. 2 illustrates an example of an audio encoder in accordance with some embodiments of the invention
- Fig. 3 illustrates an example of an apparatus for determining a distortion measure in accordance with some embodiments of the invention.
- AAC Advanced Audio Coder
- Fig. 1 illustrates a transmission system 100 for communication of an audio signal in accordance with some embodiments of the invention.
- the transmission system 100 comprises a transmitter 101 which is coupled to a receiver 103 through a network 105 which specifically may be the Internet.
- the transmitter 101 is a signal recording device and the receiver is a signal player device 103 but it will be appreciated that in other embodiments a transmitter and receiver may used in other applications and for other purposes.
- the transmitter 101 and/or the receiver 103 may be part of a transcoding functionality and may e.g. provide interfacing to other signal sources or destinations.
- the transmitter 101 comprises a digitizer 107 which receives an analog signal that is converted to a digital PCM signal by sampling and analog-to-digital conversion.
- the transmitter 101 is coupled to the encoder 109 of Fig. 1 which encodes the PCM signal in accordance with an encoding algorithm.
- the encoder 100 is coupled to a network transmitter 111 which receives the encoded signal and interfaces to the Internet 105.
- the network transmitter may transmit the encoded signal to the receiver 103 through the Internet 105.
- the receiver 103 comprises a network receiver 113 which interfaces to the Internet 105 and which is arranged to receive the encoded signal from the transmitter 101.
- the network receiver 111 is coupled to a decoder 115.
- the decoder 115 receives the encoded signal and decodes it in accordance with a decoding algorithm.
- the receiver 103 further comprises a signal player 117 which receives the decoded audio signal from the decoder 115 and presents this to the user.
- the signal player 113 may comprise a digital-to-analog converter, amplifiers and speakers as required for outputting the decoded audio signal.
- Fig. 2 illustrates the encoder 109 in more detail.
- the encoder 109 generates an AAC (Advanced Audio Coder) compatible encoded signal.
- the encoder 109 comprises a receiver 201 which receives the PCM time domain audio signal that is to be encoded.
- the receiver 201 is coupled to a transform processor 203 which transforms the time domain audio signal to a frequency domain representation.
- the transform processor 203 can implement a Modified Discrete Cosine Transform (MDCT) thereby generating the encoding subbands as is well known for AAC encoders.
- MDCT Modified Discrete Cosine Transform
- the transform processor 203 is coupled to an encoding unit 205 which performs the encoding of the frequency domain representation of the audio signal in accordance with the AAC standard.
- the encoding unit 205 performs a quantization of the data values of the individual subbands. The quantization applied to the individual subbands depends on the perceptual significance of the introduced quantization noise to the subbands.
- the encoding unit 205 evaluates a plurality of different possible quantization options (or settings).
- the encoding unit 205 is coupled to a distortion processor 207 which is operable to determine a distortion measure for a given quantization option. For each possible quantization option, a distortion measure is determined by the distortion processor 207 and the encoding unit 205 selects the quantization option which results in the lowest distortion.
- the encoding unit 205 then proceeds to encode the audio signal using the selected encoding option.
- the encoding unit 205 is furthermore coupled to a bit stream processor (209) which generates a bit stream comprising the encoded audio signal. The bit stream is then fed to the network transmitter 111 for transmission to the receiver 103.
- the distortion processor 207 is arranged to determine a distortion measure which is indicative of the perceptual distortion that will be experienced by a human listener. Specifically, the distortion processor 207 includes an accurate perceptual model of human listening which allows an evaluation of the perceived distortion for a given quantization option to be determined.
- the distortion processor 207 comprises a temporal distortion processor 211 which generates a frequency domain sensitivity characteristic for the audio signal.
- the temporal distortion processor 211 receives the time domain audio signal from the receiver 201 and evaluates a perceptual time domain model to generate the frequency domain sensitivity characteristic.
- the temporal distortion processor 211 takes into account the temporal characteristics of the audio signal to generate a frequency domain sensitivity characteristic which is an indication of the masking characteristics of the time domain audio signal and thus indicates a users sensitivity to quantization errors.
- the frequency domain sensitivity characteristic is determined by a time domain evaluation using a time domain model that has a time scale smaller than the duration of an analysis segment of the encoder 109.
- the transform processor 203 and the encoding unit 205 operate on encoding segments that are individually transformed into the frequency domain and encoded.
- encoding segments are typically of the order of 3 to 20 msec.
- the time scales of the time domain model of the temporal distortion processor 211 are smaller than the segmentation duration and therefore allow temporal variations within the individual encoding segment to be taken into account when determining the frequency domain sensitivity characteristic. This allows for a determination of the frequency domain sensitivity characteristic which more closely reflects a users perception of the quantization noise.
- the frequency domain sensitivity characteristic is determined on the basis of a predetermined error estimate which does not depend on the specific audio signal or the quantization option, but rather is some estimate of the average quantization noise that can occur in a specific encoding subband.
- the frequency domain sensitivity characteristic is determined as an individual sensitivity or masking value for each encoding subband.
- the frequency domain sensitivity characteristic can indicate the sensitivity of a user to a quantization of a subband value of an encoding segment which reflects not only the frequency characteristics for the encoding segment but also the time domain characteristics within the encoding segment or even in other encoding segments.
- the time domain model is typically very complex and is evaluated only once for each encoding segment. Thus, for each encoding segment, a single frequency domain sensitivity characteristic is determined. This characteristic is then fed to the frequency distortion processor 213 which determines a distortion measure by a frequency domain evaluation.
- the frequency distortion processor 213 receives a frequency domain representation of the audio signal from the encoding unit 205 as well as a frequency domain representation of the error signal resulting from a given quantization option. A total distortion measure is then determined by evaluating the error signal relative to the frequency domain audio signal taking into account the frequency domain sensitivity characteristic. This evaluation is relatively simple and has low computational resource requirements as it is only performed in the frequency domain and using the subbands generated by the transform processor 203. Accordingly, the encoding unit 205 feeds the frequency distortion processor
- Fig. 3 illustrates the distortion processor 207 in more detail.
- the distortion processor 207 and the psycho-acoustical model which is used in the example will be described in more detail in the following.
- the distortion processor 207 comprises a temporal distortion processor 211 which implements time domain processing based on the time domain input signal that is provided to this block.
- a number of outputs are provided as inputs to the frequency distortion processor 213.
- these outputs closely resemble a masking curve presented on a perceptually relevant scale, such as the ERB scale as described in "Equivalent Rectangular Bandwidth"; B.R. Glasberg and B.C.J. Moore, (1990), 'Derivation of auditory filter shapes from notched- noise data', Hearing Research, Vol. 47, pp. 103-138.
- the frequency distortion processor 213 receives the inputs from the temporal distortion processor 211 containing a representation of the masking curve.
- there are separate inputs consisting of a frequency domain representation of the input audio signal and the quantization error signal determined for the quantization option for which the distortion measure is determined.
- the ability to operate on frequency domain representations of the input signals is a significant advantage of the implementation.
- some kind of transform of the input signal to the frequency domain is used for coding the signal (e.g. the Modified Discrete Cosine Transform (MDCT)). It is on this transform domain representation that quantization is performed and accordingly it is highly efficient to evaluate the perceptual distortion that is introduced by these quantization operations using the same signal representations without the need to first apply a signal transform to another domain.
- MDCT Modified Discrete Cosine Transform
- time domain information is taken into account by the temporal distortion processor 211 in the encoder 109 of Fig. 2, this does not require that the evaluation of quantization options are also made in the time domain. Rather, the described approach allows the frequency domain data available in the encoder 109 to be directly evaluated. This provides a significant computational advantage as the determination of the appropriate quantization noise floor generally requires that a large number of quantization options are evaluated for each audio segment that needs to be encoded.
- the temporal distortion processor 211 a number of processing stages are present.
- the audio signal is fed to an auditory filter bank 301 (specifically a Basilar Membrane or auditory filter bank) which models properties of the basilar membrane that can be found in the cochlea of the human auditory system.
- the filter bank comprises a number of gammatone filters which are implemented as complex IIR filters, such as for example described in P.I.M. Johannesma, (1972), 'The pre-response stimulus ensemble of neurons in the cochlear nucleus' in Proceedings of the symposium on hearing theory, IPO, Eindhoven, The Netherlands.
- filters have a band-pass characteristic which follows the estimated auditory filter bandwidth as a function of the center frequency such as e.g. proposed in B.R. Glasberg and B.C.J. Moore, (1990), 'Derivation of auditory filter shapes from notched-noise data', Hearing Research, Vol. 47, pp. 103-138.
- the filters are separated uniformly on an ERB rate scale.
- the filter bank 301 generates a plurality of psycho-acoustic bands.
- the following processing is performed separately and individually for each psycho-acoustic band.
- an envelope extractor 303 generates absolute signal values for the individual signals in the psycho-acoustic bands. Since the auditory filters of the filter bank 301 in the specific embodiment are implemented as complex filters, an envelope of the filtered outputs can be extracted simply by taking the absolute value of each complex filter output sample. This envelope will be used for determining the relevant temporal properties of the input signal to estimate their effect on the masking capabilities of the input signal.
- the envelope values are provided as an input to a forward masking processor
- each adaptation loop consists of a gain control unit that is driven by the output of a low-pass filter (e.g. an RC network) operating on the input signal.
- a low-pass filter e.g. an RC network
- the transform of the adaptation loops is an approximately logarithmic transform of the input (x) to the output (y) of the adaptation loops:
- the extracted envelope (x) is also provided to a first weight processor 307 which generates a weight value for each psycho-acoustic band.
- the weight value is used to modify a predetermined error estimate to generate a sensitivity indication for the psycho- acoustic band.
- the first weight processor 307 implements a number of functions.
- the first function is related to the net effect that is achieved by the five cascaded adaptation loops.
- the gains of all five adaptation loops can be combined into one total gain.
- the combined gain (as a function of time) can be obtained by dividing the sample-by-sample output values of the last adaptation loop, y, by the input signal of the first adaptation loop, x:
- This gain function is indicative of the momentary sensitivity of the auditory system to an input signal.
- the gain is large (no adaptation)
- any change in the input signal will have a large effect on the output of the adaptation loops, e.g., the auditory system is very sensitive to changes in the input signal and a high weight value is determined.
- the gain is small (much adaptation)
- a change in the input signal will have only a small effect on the output of the adaptation loops, e.g. the auditory system is rather insensitive to changes in the input signal and a low weight value is determined.
- This insensitivity can occur for example shortly after the offset of a sustained input signal (forward masking).
- the second function performed by the first weight processor 307 is that of limiting or clipping the weight value.
- This function is related to the observation that the adaptation loops need some time to adapt to a sudden increase in level in the input signal. This implies that in response to a large transient, the masking effect thereof will not be instantaneously reflected because of the delay of the adaptation loops. As a consequence, the adaptation loops will show a large overshoot at the onset of a masker, which quickly reduces after the onset. This prediction is not in line with the actually observed masking effects for human listeners. These observations suggest that the masking effect is already present directly after onset of the masker.
- a clipping operation is performed by the first weight processor 307.
- the steady-state effect of the adaptation loops is that they map the input signal to a dB scale.
- the gain function can be expressed by:
- the third function performed by the first weight processor 307 is related to the use of the gain function as a weight value for a predetermined estimated error signal.
- a predetermined error value is assumed and this error value is weighted by the first weight value generated by the first weight processor 307 (in the specific example, the predetermined error value is multiplied by the predetermined error estimate value).
- the gain and thus in the example also the weight value
- the sensitivity to the error signal is large and vice versa. From auditory perception it is known that the detectability of an error signal is roughly constant for a constant error-signal-to- masker ratio.
- the output gain is normalized such that for the steady-state case, a constant error-signal-to-masker ratio leads to constant predicted detectability of the error signal.
- the normalized gain (g e ) that is used in this embodiment is:
- the gain function that is derived is used as a weight value and is multiplied by the predetermined error estimate ( ⁇ ) to generate a sensitivity value. Therefore the integrated product (d) of the normalized gain function and the estimated error signal in the steady- state case will be:
- the adaptation loops perform an approximate dB transform (201oglO(x)).
- the product which is an estimate of the sensitivity of listeners to the error, is proportional to the error-signal-to-masker ratio, assuming that x serves as the masker.
- the distortion processor 207 comprises an error store 309 which contains the predetermined error estimate.
- the error store 309 comprises a stored predetermined error estimate value for each of the psycho-acoustic bands.
- the predetermined error estimate ( ⁇ ) represents the average temporal envelope of an error signal such as it would look after basilar membrane filtering. It is assumed that the error signal is centered within the pass band of the basilar membrane filter.
- the average temporal envelope of an error signal can be estimated very well in case of a transform coder.
- the quantization noise that is introduced in such a coder is typically smeared out across the whole segment to be encoded following (on average) the envelope of the prototypical filter shape. Only the level (and not the average shape) of the quantization noise depends on the signal that is quantized and the quantization step size used, and it is therefore possible for the temporal distortion processor 211 to use the same fixed error estimate for all the possible quantization options that may be applied.
- the temporal distortion processor 211 comprises a first multiplier 311 which multiplies the predetermined error estimate values and the first weight values from the first weight processor 307 to generate the sensitivity values d (one for each of the psycho-acoustic bands). The multiplied values may then be summed.
- the multiplication of the estimated error values and the weight values enables accurate modeling of the masking effect taking into account the dynamical properties of the input signal. Specifically, the integration of this multiplication over time represents an estimate of the ability of human listeners to hear the introduced error signal.
- the input signal consists of a transient signal that is placed in the middle of the segment under consideration.
- the gain that is determined will be high in the interval just before the transient, because the adaptation loops will not have adapted to the transient signal and the clipping is not active.
- the model will be highly sensitive to the presence of an error signal.
- the level of the error signal will be proportional to the level of the transient that is present within the segment.
- the gain function just before the transient will reflect a sensitivity corresponding to a much lower input signal level.
- the prediction is that the multiplication of the gain function and the estimated error signal just before the onset of the transient will lead to very high values, e.g.
- the gain factor will be reduced considerable, first by the clipping and if the transient has a sufficiently long sustain, later by the steady-state response of the adaptation loops. Thus for this part of the segment, much less sensitivity results in line with insights from psycho-acoustics.
- the gain function will recover slowly. Even in the next encoding segment, the gain function can still be low as a result of the presence of a transient in the previous segment. Therefore, in the next segment, reduced sensitivity to the presence of an error signal can result, which predicts the effect of forward masking as discovered within the field of psycho-acoustics.
- frequency domain sensitivity characteristic depends on the temporal position of a transient of the audio signal within an encoding segment of the audio encoding.
- the perceptual can adjusts its masked threshold (and distortion predictions) depending on the position of the transient within a segment.
- the model provides for the temporal distortion processor 211 modifying the frequency domain sensitivity characteristic in response to a temporal onset delay of a transient of the audio signal relative to the onset of an encoding segment of the audio encoding.
- the frequency domain sensitivity characteristic is dependent on the time envelope of the audio signal within a given frequency interval.
- the adaptation loops will be fully adapted to the input signal and the gain factors will be low corresponding to less sensitivity to the presence of an error signal.
- the above described method is able to accurately predict the masking behavior of dynamically changing signals by constructing a sensitivity index (d) for each psycho- acoustic band using a predetermined estimated error signal.
- the approach can accurately predict how many bits are needed to avoid pre-echos and the method is also able to specify this for each frequency range separately such that extra bits are only allocated in frequency ranges where they are required.
- the temporal distortion processor 211 furthermore implements a modulation model.
- the temporal distortion processor 211 comprises a second weight processor 313 which receives the filtered signal from the adaptation loops and generates a second weight value for the predetermined error estimate.
- the output of the adaptation loops is also used to determine the nature of the envelope modulations that are observed in the input signal. From psycho-acoustical measurements, it is known that signals with very flat temporal envelopes (such as very tonal signals) create comparably less masking than signals with a moderate degree of modulation in the envelope (such as a noise signal).
- the second weight processor 313 evaluates the degree of modulation within each psycho-acoustic band and generates an output weight value which is applied to the sensitivity index from the first multiplier 311. In particular, the signal from the first multiplier is fed to a second multiplier 313 which scales the distortion index in accordance with the second weight value determined by the second weight processor 313.
- d m is the sensitivity index adjusted to incorporate modulation effects
- C is a calibration constant (a value of 2 is typically appropriate)
- O E is the standard deviation of the adaptation loop output
- ⁇ is the mean of the adaptation loop output
- O N is the standard deviation a band- pass Gaussian-noise envelope
- ⁇ N is the mean of a band-pass Gaussian-noise envelope.
- An alternative method for deriving d m may be based on a band-pass filtering of the envelope modulations resulting from the input audio signal.
- the band-pass filter can be chosen such that they correspond with the modulations that are typically introduced by the quantizing operation in the encoder.
- the modulations are related to the prototypical window that is used. When strong modulations are inherently present in the envelope, sensitivity to the modulations introduced by the quantization step will be low. In contrast when very few modulations are present sensitivity will be very high. It will be described that other means or algorithms for modifying the determined sensitivities may be used.
- the determined sensitivity values, d m are indicative of the human perceptual sensitivity to errors within the individual psycho-acoustic bands.
- the second multiplier 315 is coupled to a subband converter 317 which converts these sensitivities to corresponding sensitivities in the encoding subbands used by the encoder 109 and specifically to the subbands generated by the transform processor 203.
- the band-pass error signals that are introduced by an audio codec are not necessarily limited to a single psycho-acoustic band. Rather, due to the band-pass filtering that takes place in the auditory system, the error signal will not only have an effect on the auditory filter spectrally centered on the error signal, but also in adjacent auditory filters.
- a spreading matrix S ⁇ can be used as a mapping of the encoding subbands (b) to the psycho-acoustic bands and can take into account the spectral spreading that results from the particular window shape (or prototype filter) that is used by the transform processor 203.
- the subband converter 317 Given the sensitivity indices, d m , which are determined for each psycho-acoustic band, the subband converter 317 determines the sensitivity in the subband domain of the encoder 109 and specifically transforms the sensitivity to a masking curve that is provided from the temporal distortion processor 211 to the frequency distortion processor 213.
- ni b represents the masking curve in encoding subband b.
- This masking curve incorporates a number of features that take account of the temporal masking properties of the input signal and of the synthesis window that is used by the transform processor 203. Specifically, the matrix S ⁇ is determined in response to the transfer spectrum of the prototype filter used for the time domain to frequency domain transform.
- the processing performed by the temporal distortion processor 211 is only performed once for each encoding segment and is independent of the specific quantization option applied to the audio signal.
- the masking curve that results from this calculation is provided to the frequency distortion processor 213 and can be re-used for every individual encoding quantization option that is evaluated. Typically this is repeated for many different encoding options for each segment.
- the perceptual distortion calculations are performed for a particular segment based on the masking curve that is provided by the temporal distortion processor 211 and on frequency domain representations of the audio signal and the error signal for the specific quantization option being evaluated.
- the error signal and the audio signal can be provided in the encoding subband domain where the quantization is actually taking place and no requirement for any further conversion is introduced.
- the distortion measure can be determined as follows.
- the error signal energy in the encoding subbands, ri b are divided by the masking curve values, ni b , and integrated across the complete spectrum. This result can be understood by considering that the total perceptual distortion, D, results from the addition of perceptual distortions resulting from the individual subbands Df.
- ⁇ is a frequency index referring to the frequency domain data (transform coefficients)
- (u b is the starting frequency of a subband used by the proposed model
- A( ⁇ ) the error spectrum (in the transform domain) due to quantization.
- the frequency distortion processor 213 is furthermore arranged to adjust the distortion measure in response to a correlation characteristic between the frequency domain error signal and the audio signal.
- x( ⁇ ) is the spectrum of the original (unquantized) signal
- ⁇ represents the number of subbands of the perceptual model that are taken together in the evaluation of the energy criterion.
- the energy of the difference between the audio signal and the error signal depends on the correlation between both signals.
- the subbands are chosen such that they are %-th of a critical band and ⁇ is chosen to be four, such that the total range that is taken together is one auditory critical band in total.
- the numerator of the right hand division is then the change in energy in one critical band, the max operator in the denominator is a normalization to ensure that relative energy changes are measured.
- the described encoder provides a computationally efficient method for modeling the audibility of the quantization operation that is applied to an audio signal.
- the method takes into account knowledge that is included in the most advanced psycho- acoustical models.
- the model specifically processes the temporal properties of the input signal, such as inherent modulations in the input signal and transient behavior of the input signal, to more accurately determine the audibility of the quantization operation.
- the described approach takes into account correlations between quantization noise and the original signal.
- an accurate prediction of the audibility of pre-echos can be derived, and the proposed model can correctly inform the encoder of how many extra bits are needed in each subband to avoid pre- echos (quantization noise occurring before a transient) becoming audible.
- the masking effect can persist for some time (forward masking).
- the proposed model can inform the encoder of the level of masking that can be expected after the offset of a transient or sustained sound.
- modulation properties of the original signal By taking into account the modulation properties of the original signal, the weaker masking capabilities of a tonal signal, with very few inherent modulations, as compared to a noisy signal, with much stronger masking capabilities, can accurately be predicted.
- modulations are determined based on the presence of peaks in the spectrum of the input signal, which is an indirect and inaccurate approach in comparison to the direct observation of the time signal in the present model.
- the invention can be implemented in any suitable form including hardware, software, firmware or any combination of these.
- the invention may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors.
- the elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units and processors.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Un processeur de distorsion (207) est agencé pour déterminer une mesure de distorsion pour un codage audio d'un signal audio. Le processeur de distorsion (207) comprend un processeur de distorsion temporel (211) qui génère une caractéristique de sensibilité d'un domaine de fréquence pour le signal audio, en réponse à une estimation d'erreur prédéterminée et à une analyse du domaine temporel du signal audio au moyen d'un modèle de domaine temporel ayant une échelle de temps inférieure à une durée d'un segment d'analyse du codage audio. La caractéristique de sensibilité du domaine de fréquence est introduite dans un processeur de distorsion de fréquence (213) et peut être sous la forme d'une courbe de masquage du domaine de fréquence. Le processeur de distorsion de fréquence (213) détermine la mesure de distorsion en réponse à une représentation du domaine de fréquence du signal audio, d'un signal d'erreur du domaine de fréquence et de la caractéristique de sensibilité du domaine de fréquence. La caractéristique de sensibilité du domaine de fréquence est indépendante de la quantification appliquée, et des options de quantification différente peuvent être évaluées en utilisant un seul traitement de domaine de fréquence.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP05108811 | 2005-09-23 | ||
| EP05108811.0 | 2005-09-23 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2007034375A2 true WO2007034375A2 (fr) | 2007-03-29 |
| WO2007034375A3 WO2007034375A3 (fr) | 2008-12-04 |
Family
ID=37889199
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2006/053261 Ceased WO2007034375A2 (fr) | 2005-09-23 | 2006-09-13 | Determination d'une mesure de distorsion pour codage audio |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2007034375A2 (fr) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10517701B2 (en) | 2015-01-13 | 2019-12-31 | Align Technology, Inc. | Mandibular advancement and retraction via bone anchoring devices |
| US10537463B2 (en) | 2015-01-13 | 2020-01-21 | Align Technology, Inc. | Systems and methods for positioning a patient's mandible in response to sleep apnea status |
| US10588776B2 (en) | 2015-01-13 | 2020-03-17 | Align Technology, Inc. | Systems, methods, and devices for applying distributed forces for mandibular advancement |
| US20210082447A1 (en) * | 2018-05-30 | 2021-03-18 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Audio Similarity Evaluator, Audio Encoder, Methods and Computer Program |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1631954B1 (fr) * | 2003-05-27 | 2007-02-14 | Koninklijke Philips Electronics N.V. | Codage audio |
-
2006
- 2006-09-13 WO PCT/IB2006/053261 patent/WO2007034375A2/fr not_active Ceased
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10517701B2 (en) | 2015-01-13 | 2019-12-31 | Align Technology, Inc. | Mandibular advancement and retraction via bone anchoring devices |
| US10537463B2 (en) | 2015-01-13 | 2020-01-21 | Align Technology, Inc. | Systems and methods for positioning a patient's mandible in response to sleep apnea status |
| US10588776B2 (en) | 2015-01-13 | 2020-03-17 | Align Technology, Inc. | Systems, methods, and devices for applying distributed forces for mandibular advancement |
| US11207208B2 (en) | 2015-01-13 | 2021-12-28 | Align Technology, Inc. | Systems and methods for positioning a patient's mandible in response to sleep apnea status |
| US11259901B2 (en) | 2015-01-13 | 2022-03-01 | Align Technology, Inc. | Mandibular advancement and retraction via bone anchoring devices |
| US11376153B2 (en) | 2015-01-13 | 2022-07-05 | Align Technology, Inc. | Systems, methods, and devices for applying distributed forces for mandibular advancement |
| US20210082447A1 (en) * | 2018-05-30 | 2021-03-18 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Audio Similarity Evaluator, Audio Encoder, Methods and Computer Program |
| US12051431B2 (en) * | 2018-05-30 | 2024-07-30 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Audio similarity evaluator, audio encoder, methods and computer program |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2007034375A3 (fr) | 2008-12-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8812308B2 (en) | Apparatus and method for modifying an input audio signal | |
| RU2494477C2 (ru) | Устройство и способ генерирования выходных данных расширения полосы пропускания | |
| RU2345506C2 (ru) | Многоканальный синтезатор и способ для формирования многоканального выходного сигнала | |
| US8805696B2 (en) | Quality improvement techniques in an audio encoder | |
| US6934677B2 (en) | Quantization matrices based on critical band pattern information for digital audio wherein quantization bands differ from critical bands | |
| JP4212591B2 (ja) | オーディオ符号化装置 | |
| EP2030199B1 (fr) | Codage prédictif linéaire d'un signal audio | |
| AU2011244268A1 (en) | Apparatus and method for modifying an input audio signal | |
| WO2018069900A1 (fr) | Système audio et procédé pour malentendants | |
| JP2021519949A (ja) | チャネル間時間差を推定するための装置、方法またはコンピュータプログラム | |
| EP3826011A1 (fr) | Procédé d'estimation du bruit dans un signal audio, estimateur de bruit, codeur audio, décodeur audio et système de transmission de signaux audio | |
| WO2007034375A2 (fr) | Determination d'une mesure de distorsion pour codage audio | |
| Norvell | Gaussian mixture model based audio coding in a perceptual domain | |
| HK1161443B (en) | Apparatus and method for modifying an input audio signal | |
| Bayer | Mixing perceptual coded audio streams |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 06821083 Country of ref document: EP Kind code of ref document: A2 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 06821083 Country of ref document: EP Kind code of ref document: A2 |