WO2009077950A1 - An adaptive time/frequency-based audio encoding method - Google Patents
An adaptive time/frequency-based audio encoding method Download PDFInfo
- Publication number
- WO2009077950A1 WO2009077950A1 PCT/IB2008/055244 IB2008055244W WO2009077950A1 WO 2009077950 A1 WO2009077950 A1 WO 2009077950A1 IB 2008055244 W IB2008055244 W IB 2008055244W WO 2009077950 A1 WO2009077950 A1 WO 2009077950A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frequency
- peakedness
- domain signal
- time
- encoding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
- G10L19/22—Mode decision, i.e. based on audio signal content versus external parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
- G10L19/20—Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding
Definitions
- the invention relates to an adaptive time/frequency-based audio encoding method for encoding an input signal that is divided into a plurality of frequency-domain signals into an output data stream, said encoding method comprising encoding each frequency-domain signal in one of a time-based encoding mode or a frequency-based encoding mode. Further the invention relates to an adaptive time/frequency-based audio encoder for encoding an input signal into an output data stream, and a computer program product.
- speech codecs work well for speech and audio codecs well for audio. From a coding efficiency point of view speech signals are characterized by the voiced speech parts that are generated by an excitatory signal that is filtered by the vocal tract. Speech coders effectively regenerate the excitation signal and the filtering. The parameters for this regeneration process form a very efficient representation of the speech signal. The signal is effectively represented as a time domain signal corresponding nicely to the speech production process. Therefore, speech coders are often termed as time domain coders (TDC). Audio signals on the other hand vary relatively slowly over time, in contrast to speech, and often consist of tonal components that are stable across longer temporal intervals.
- TDC time domain coders
- the excitatory nature of the speech signal is represented in the neural encoding of an audio signal and any modification, i.e. temporal smearing, is highly perceptible. This is likely to occur due to the quantization in the spectral domain that is performed by the audio coder.
- Speech coders are not very suitable for encoding audio because of the constant spectral lines that occur in tonal music signals. Spectral resolution of speech coders at low frequencies is too poor to capture these components well. In addition, the structure of spectral lines at higher frequencies can create a characteristic modulation pattern due to beating patterns between adjacent components that cannot be modeled well by the excitatory-based speech coder. To benefit from the advantages of the audio coders and speech coder systems for joint audio and speech coding have been proposed. One of such systems has been disclosed in the patent application US2007/0106502.
- an adaptive time/frequency-based audio encoder which obtains high compression efficiency by making efficient use of encoding gains of two encoding methods in which a frequency-domain transform is performed on input audio data such that time-based encoding is performed on a band of the audio data suitable for voice compression and frequency-based encoding is performed on remaining bands of the audio data.
- the proposed encoder comprises a transformation and mode determination unit to divide an input audio signal into a plurality of frequency-domain signals and to select a time-based encoding mode or frequency-based encoding mode for each respective frequency-domain signal, an encoding unit to encode each frequency-domain signal in the respective encoding modes selected by the transformation and mode determination unit, and a bitstream output unit to output encoded data, division information, and encoding mode information for each encoded frequency-domain signal.
- the encoding mode to be used for the respective frequency-domain signal is determined based on at least one of a linear coding gain, a spectral change between linear prediction filters of adjacent frames, a predicted pitch delay, and a predicted long-term prediction gain.
- spectral measures comprise: a linear predictive coding gain, a spectral change between linear predictive filters of consecutive frames, or a spectral tilt (first reflection coefficient).
- energy measures comprise: signal energy, or a change in signal energy between sub frames.
- Said long-term prediction estimates comprise: an estimated pitch delay, or estimated prediction gains.
- This object is achieved by selecting of the time-based encoding mode or the frequency-based encoding mode for the respective frequency-domain signal based on a peakedness of said respective frequency-domain signal. Said peakedness relates much better to the strengths of the time-based and frequency-based encoding than spectral measure, energy measures, or long-term prediction measures do. Therefore, the resulting speech/audio quality is improved. In other words, the choice of the encoding mode is better tailored to the actual content of the frequency-domain signal. In particular, if a signal has a high peakedness, these peaks are typically perceptually relevant and the quantized signal is associated with a low bit rate since the majority of values are quantized to the zero level.
- the proposed measure has a direct relation to the essential property of the coder and provides a high quality at a low bit rate.
- the measures used so far are more indirect, involve more parameters, require complex inference rules and are thus more prone to non-optimal decisions.
- the peakedness of the frequency-domain signal is a spectral peakedness.
- the frequency domain signal may be the absolute value of the Fourier transform of a signal, the discrete cosine transform DCT or related representations like the MDCT or MLT.
- the strong tonal components appear as large peaks in the representation and a high spectral peakedness is thus indicative of steady tonal music which can most efficiently be encoded by a frequency domain coder.
- the spectral peakedness can be established for different bands of the entire frequency spectrum.
- the frequency-based encoding mode is selected when the spectral peakedness exceeds a predetermined threshold. If the spectral peakedness of a particular frequency band is sufficiently high, the decision can be made to use a frequency domain encoding method for this band.
- the advantage of using spectral peakedness is its direct coupling to the efficiency of a frequency-domain encoding method.
- the predetermined threshold takes on a value of a temporal peakedness corresponding to a time-representation of said respective frequency- domain signal.
- the decision about using one or another encoding mode is made according to most dominant peakedness measure. Thus if the spectral peakedness is larger than the temporal peakedness the frequency-based encoding mode is chosen, otherwise the time-based encoding is chosen.
- the peakedness of the frequency-domain signal is a temporal peakedness corresponding to a time-representation of said frequency-domain signal.
- the temporal peakedness measure is determined from the time-domain representation of the frequency domain signal corresponding to a frequency band.
- This time-domain representation can be further pre-processed.
- spectral flattening yields a good signal on which the temporal peakedness can best be established.
- the flattening stage may be omitted.
- the time-based encoding mode is selected when the temporal peakedness exceeds a predetermined threshold.
- the predetermined threshold takes on a value of a spectral peakedness of said respective frequency-domain signal. In such a case, one can consider whether a time-domain peakedness or the frequency-domain peakedness is highest and apply the appropriate mode of encoding to the components of this frequency band. In doing so, the information concerning both measures is balanced in the final mode decision thus arriving at the optimal decision.
- selecting a time based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal is further based on at least one of spectral measures, energy measures, or long-term prediction estimates. Since there are inevitably signal pieces where the temporal and spectral peakedness are almost equal, it is of advantage to use other sources of information to arrive at a better decision concerning the mode. This may be done by using in these cases the more indirect decision criteria as have been disclosed in S. A. Ramprashad, The multimode transform predictive coding paradigm, IEEE Trans. Speech Audio Process, 11 (2): 117-129, March 2003.
- the division information is comprised in the output data stream. This is especially advantageous when the division into the plurality of frequency-domain signals varies over time and needs to be communicated to the decoder in order to allow an appropriate decoding.
- the invention further provides encoder as well as a computer program product enabling a programmable device to perform the encoding and/or decoding method according to the invention.
- Fig. 1 shows a flow chart for an adaptive time/frequency-based audio encoding method for encoding an input signal into an output data stream in accordance with the invention
- Fig. 2 shows a representation of plurality of frequency-domain signals and the corresponding to them time-domain representation together with the peakedness measure corresponding to each of the respective frequency- domain signals;
- Fig. 3 shows example architecture of an adaptive time/frequency-based audio encoder for encoding an input signal into an output data stream in accordance with the invention
- Fig. 4 shows an example block diagram illustrating transformation and mode determination unit of the adaptive time/frequency-based audio encoder
- Fig. 5 schematically shows an example of an encoding mode determination unit that determines whether time-based encoding mode or a frequency-based encoding mode is to be used for the respective frequency-domain signal.
- Fig. 1 shows a flow chart for an adaptive time/frequency-based audio encoding method for encoding an input signal into an output data stream in accordance with the invention.
- the proposed encoding method can be summarized to comprise the following steps.
- First step 110 comprises dividing an input audio signal into a plurality of frequency- domain signals and selecting a time-based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal.
- Second step 120 comprises encoding each frequency-domain signal in the respective encoding mode.
- Third step 130 comprises combining encoded data and encoding mode information of each respective frequency domain signal into an output data stream.
- selecting a time-based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal is based on peakedness of said respective domain signal.
- the input signal is transformed into frequency domain. Subsequently, it is divided into a plurality of the frequency domain signals.
- Said frequency- domain signals correspond to e.g. frequency bands.
- Said frequency bands can be fixed, i.e. the thresholds separating the bands are fixed.
- the division in bands is preferably logarithmic or linear.
- the band sizes can vary from band to band, or they can be of the same size.
- the number of bins can be arbitrary.
- the number of bands should be determined depending on the actual audio content.
- Coding of audio or speech data is typically performed in frames of input data in order to be able to track or adapt to the possibly time-varying character of the input signal.
- Said frame sizes can be fixed or they can vary over time.
- the invention focuses on the issue of selecting the encoding mode that is suitable for the frequency-domain signal.
- Said frequency domain signal is defined as the frequency signal corresponding to a frequency band of e.g. a frame.
- the peakedness of an audio signal relates much better to the strengths of the time-based and frequency-based encoding than spectral measure, energy measures, or long-term prediction measures, as indicated in S. A. Ramprashad, The multimode transform predictive coding paradigm, IEEE Trans. Speech Audio Process, 11 (2):117-129, March 2003 or the patent application US2007/0106502, do. Therefore, the resulting speech/audio quality and the corresponding encoding efficiency are improved. In other words, the choice of the encoding mode is better tailored to the actual content of the frequency-domain signal.
- each of the plurality of frequency-domain signal is encoded according to the encoding mode selected for them.
- the encoded data for each of plurality of the frequency-domain signals is combined with its respective encoding mode into the output data stream.
- the peakedness of the frequency-domain signal is a spectral peakedness.
- the spectral measure essentially determines the amount of samples largely deviating from a zero or a mean value relative to the amount of samples close to the zero or mean value for a frequency-domain representation of the input signal or a preprocessed version thereof. If only a small number of large values are present in the signal, these are perceptually very relevant and the remaining large number of small values can be efficiently compressed thus arriving at a high-quality efficient code. It is then advantageous to encode such frequency-domain signal with the frequency-based encoding.
- the constituent encoders i.e. time-based encoding and frequency-based encoding, operate by segmenting the input signal into frames and encoding the resulting signal frames. It is efficient to determine the peakedness by considering intervals defined within the frame of the input signal and calculating the spectral peakedness for the entire frame from the results obtained for said intervals.
- the regular intervals allow to closely follow the dynamics of the input signal. From the point of view of the implementation it is preferable that the regular intervals coincide with the shortest update that occurs in the frequency and time domain encoding. However, the intervals over which the spectral peakedness is to be calculated could be even smaller than the shortest frames used by the time and frequency encoding.
- the spectral peakedness is expressed as: whereby X(k) is the respective frequency-domain signal for a frequency interval with a width F.
- Said spectral-domain representation is real valued and comprises e.g. MDCT coefficients or absolute values of complex amplitude values resulting from a Discrete Fourier Transform or any other frequency-domain representation like MDCT or MLT. The advantage of this particular measure is its simplicity of calculation.
- Peakedness refers here to any measure that correlates with the degree of presence of peaks in a signal (spectral or temporal).
- Various measures are known to be used for this purpose.
- the normalized fourth moment that is used to measure the degree of fluctuations in an envelope signal (cf. Hartmann and Pumplin, "Noise power fluctuations and masking of sine signals," 1988, J. Acoust. Soc. Am., Vol. 83, pp. 2277- 2289) can be used.
- This is also the basis for the formula for the spectral peakedness given above.
- the kurtosis measure can be used as a measure of spectral peakedness.
- a common spectral flatness measure is the ratio of the geometric mean of bins of the magnitude spectrum divided by the arithmetic mean of the same bins.
- Various other statistical methods are feasible to define a measure for spectral peakedness.
- the frequency-based encoding mode is selected when the spectral peakedness exceeds a predetermined threshold. If the spectral peakedness of a particular frequency band is sufficiently high, the decision is made to use a frequency domain encoding method for this band as it results in a high audio/speech quality at the high encoding efficiency.
- a value of 1.65 is an example for the predetermined threshold. This value has been found to be giving good results in tests, however other values are also possible.
- the predetermined threshold takes on a value of a temporal peakedness corresponding to a time-representation of said respective frequency- domain signal.
- the decision about using one or another encoding mode is made according to most dominant peakedness measure. Thus if the spectral peakedness is larger than the temporal peakedness the frequency-based encoding mode is chosen, otherwise the time-based encoding is chosen.
- the peakedness of the frequency-domain signal is a temporal peakedness corresponding to a time-representation of said frequency-domain signal.
- said frequency-domain signal is transformed to the time domain.
- the temporal peakedness is determined.
- the temporal measure essentially determines the amount of samples largely deviating from a zero or a mean value relative to the amount of samples close to the zero or mean value for said time-domain representation of the frequency-domain signal.
- the temporal peakedness is expressed as:
- x w (n) is a temporal signal representation of the respective frequency-domain signal across the interval where N is the number of samples of the associated time-domain signal.
- the size of the interval can vary. However, it is advantageous to determine the peakedness measure across a number of short, overlapping intervals within a frame rather than calculating the peakedness measure across the frame. It is also possible to divide the temporal signal representation into subintervals, and subsequently to determine the peakedness per subinterval and combine (e.g. a max operator) these peakedness values into a single peakedness value for the entire temporal signal representation.
- the temporal peakedness measure is determined from the time-domain representation of the frequency-domain signal.
- Said frequency-domain signal preferably corresponds to a frequency band of the input signal.
- This time-domain representation can be a pre-processed signal obtained by transforming the respective frequency-domain signal to the time domain.
- spectral flattening yields a good signal on which the temporal peakedness can best be established.
- the flattening stage may be omitted.
- the time-based encoding mode is selected when the temporal peakedness exceeds a predetermined threshold. If the temporal peakedness of a particular frequency band is sufficiently high, the decision is made to use a time-based encoding mode for this band as it results in a high audio/speech quality at the high encoding efficiency.
- a predetermined threshold A value of 1.7 is an example for the predetermined threshold. This value has been found to be giving good results in tests, however other values are also possible.
- the predetermined threshold takes on a value of a spectral peakedness of said respective frequency-domain signal.
- the decision about using one or another encoding mode is made according to most dominant peakedness measure.
- the temporal peakedness is larger than the spectral peakedness the time-based encoding mode is chosen, otherwise the frequency-based encoding is chosen.
- Fig. 2 shows a representation of plurality of frequency-domain signals and the corresponding to them time-domain representation together with the peakedness measure corresponding to each of the respective frequency- domain signals.
- the top plot 210 in Fig. 2 depicts the frequency domain representation of the input signal.
- Said frequency domain representation is obtained by e.g. Fourier-based transform or filter bank applied on the input signal.
- a chunk of 512 samples of the input signal has been windowed and transformed using a critically sampled filter bank.
- Said frequency domain representation of said chunk of the input signal has been divided into 14 frequency-domain signals each of which corresponds to a respective frequency band.
- Said frequency bands are determined in the logarithmic manner resulting for a small band size for lower frequencies and a large band size for larger frequencies.
- RMS Root-Mean- Square
- the normalized fourth moment correlation is calculated as a spectral measure.
- a spectral measure per frequency band are multiplied by two, and they are indicated in the top plot 210 of Fig. 2 by a circle or a star.
- the bottom plot 220 in Fig. 2 depicts a time domain representation of the respective frequency-domain signals.
- the real valued data within each band i.e. for each frequency-domain signal
- the first component of the vector corresponds to real valued first data point in said frequency-domain signal.
- the real part of the second vector component is the second data point, and the imaginary part is the third data point.
- the real part of the third vector component is the fourth data point, and the imaginary part is fifth data point, etc.
- the complex vector created in this way is assumed to represent the positive frequency half of a real valued signal. Alternatively other ways of constructing said vector could be used.
- said vector is transformed to the time domain using an inverse Fourier transform.
- the time domain representation is more peaked than its corresponding frequency domain representation.
- a star is used to indicate the temporal peakedness in the bottom plot and the circle is used to indicate the spectral peakedness in the corresponding band in the top plot.
- the decision about which of the encoding modes should be used is based on the relation between the spectral and temporal peakedness for said frequency-domain signal.
- the frequency-based encoding is used.
- the temporal peakedness is larger than the spectral peakedness for the respective frequency-domain signal the time-based encoding is used.
- the frequency- domain signal has a single pronounced peak in this band, while the time domain representation of this frequency-domain signal has rather balanced variation of values approximately around 0.8. Since the spectral peakedness is clearly smaller than the temporal peakedness for said band the frequency-based encoding is used to encode the frequency- domain signal corresponding to this band.
- the time-based encoding is used to encode the frequency-domain signal when the temporal peakedness is larger than the spectral peakedness for said frequency-domain signal and when additionally the temporal peakedness is larger than the predetermined threshold taking a value of e.g. 1.7. Otherwise, the frequency-based encoding is used to encode the frequency-domain signal.
- selecting a time based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal is further based on at least one of spectral measures, energy measures, or long-term prediction estimates.
- spectral measures comprise: LP coding gain, the spectral change between LP filters of consecutive frame, or the like.
- the energy measures comprise: the signal energy, the change in signal energy between subframes, or the like.
- the long-term prediction estimates comprise: estimated pitch delay, estimated long-term prediction gains, or the like. Said measures are extensively discussed in S. A. Ramprashad, The multimode transform predictive coding paradigm, IEEE Trans. Speech Audio Process, 11 (2): 117-129, March 2003.
- the decision about the use of a specific encoding mode is then made as follows.
- the time- based encoding is used to encode the frequency-domain signal when a combination (e.g. a weighted sum) of the temporal peakedness and the pitch corresponding to a frequency- domain signal is larger than the spectral peakedness for said frequency-domain signal and when additionally the temporal peakedness is larger than the predetermined threshold taking a value of e.g. 1.7. Otherwise, the frequency-based encoding is used to encode the frequency- domain signal.
- a combination e.g. a weighted sum
- the frequency-based encoding is used to encode the frequency- domain signal.
- Various other options can be used to improve the encoding mode selection.
- constraints are imposed on e.g. tilt, or energy of frequency-domain signal.
- Said constraint takes a form of e.g. a threshold limitation or other more sophisticated form.
- Said mode determining means can also be used to determine the optimal division of the input signal into the plurality of the frequency-domain signals. For instance, considering the method where the encoding method is determined by the absolute or relative difference between the spatial and temporal peakedness of the divided frequency-domain signal, the division can be determined as one which maximizes in some sense said difference.
- an indicator for the division information is comprised in the output data stream. Said indicator is for example a code for specific division information of the input signal, or an address of a device, e.g. a server on the Internet, wherefrom said division information can be retrieved.
- a decoder based on said indicator can be configured to operate according to the division information used to produce the data stream to be decoded.
- Fig. 3 shows example architecture of an adaptive time/frequency-based audio encoder 300 for encoding an input signal into an output data stream in accordance with the invention.
- Said encoder comprises a transformation and mode determination unit 310, an encoding unit 320, and a merger 330.
- the transformation and mode determination unit 310 divides the input audio signal into plurality of frequency-domain signals and to select the time-based encoding mode or the frequency-based encoding mode for each respective frequency-domain signal. Then, the transformation and mode determination unit 310 outputs a frequency domain-signal 321 determined to be encoded in the time-based encoding mode, a frequency-domain signal 322 determined to be encoded in the frequency-based encoding mode, and encoding mode information 331 for each frequency-domain signal.
- dividing the input signal and the encoding mode selection is performed in a single unit 310, however, a separate functional unit (implemented in hardware or software) could be assigned to perform each of these functions.
- the encoding unit 320 encodes each frequency-domain signal in the respective encoding modes selected by the transformation and mode determination unit 310.
- the unit 320 performs time-based encoding on the frequency-domain signal 321 and performs frequency-based encoding on the frequency-domain signal 322.
- the encoding unit 320 outputs encoded data 333 on which the time-based encoding has been performed and encoded data 334 on which the frequency-based encoding has been performed.
- the merger 330 combines encoded data 333 and 334, and encoding mode information 331 for each respective encoded frequency-domain signal to produce the output data stream 341.
- Fig. 4 shows an example block diagram illustrating transformation and mode determination unit 310 of the adaptive time/frequency-based audio encoder 300.
- the transformation and mode determination unit 310 comprises a frequency-domain transform unit 400 and an encoding mode determination unit 410.
- the frequency-domain transform unit 400 transforms the input audio signal 311 into a full frequency-domain signal 421.
- Said full frequency-domain signal 421 having a frequency spectrum such as e.g. the one illustrated in the top plot 210 of Fig. 2.
- Said frequency-domain representation is obtained by using e.g. Fourier-based transform or filter bank applied on the input signal.
- the encoding mode determination unit 410 divides the full frequency-domain signal 421 into a plurality of frequency-domain signals according to a preset standard and selects either the time-based encoding mode or the frequency-based encoding mode for each frequency-domain signal based on peakedness of said frequency-domain signal.
- the encoding mode determination unit 410 outputs the frequency domain signal 321 determined to be encoded in the time-based encoding mode, the frequency-domain signal 322 determined to be encoded in the frequency- based encoding mode, the encoding mode information 331, and when required the division information 332 for each frequency-domain signal.
- Fig. 5 schematically shows an example of an encoding mode determination unit 410 that determines whether time-based encoding mode or a frequency-based encoding mode is to be used for the respective frequency-domain signal. Said unit 410 determines an encoding mode based on a spectral peakedness and temporal peakedness of the input signal 421.
- the input signal 421 is fed into a signal selector 511, which outputs the frequency-domain signal 422 corresponding to e.g. the selected frequency band.
- Said frequency-domain signal 422 is further fed into the unit 514 which calculates the spectral peakedness of said frequency-domain signal 422 based e.g. on the normalized fourth moment correlation (as discussed before).
- the frequency-domain signal 422 is fed into an inverse transform unit 512 that transforms said frequency-domain signal into time-domain signal.
- Unit 513 derives a temporal peakedness measure for the signals provided from the unit 512.
- Said temporal peakedness is calculated using e.g. the normalized fourth moment correlation (as discussed before).
- said temporal peakedness is determined across a number of short, overlapping intervals within a frame rather than calculating the peakedness measure across the frame at once.
- the temporal and spectral peakedness measures obtained from units 513 and 514 are fed into a unit 515, which combines these two measures to make a decision about the value of the predetermined frequency.
- the predetermined frequency can be determined in many ways by means of a formula or by means of heuristics. Alternatively other ways of assessing the dominance of the spectral or temporal peakedness can be used that utilize a formula or heuristics. Below one of the heuristics which can be used to determine the encoding mode is described.
- the time-based encoding is used to encode the frequency-domain signal when the temporal peakedness is larger than the spectral peakedness for said frequency-domain signal and when additionally the temporal peakedness is larger than the predetermined threshold taking a value of e.g. 1.7. Otherwise, the frequency-based encoding is used to encode the frequency-domain signal.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention proposes an enhanced adaptive time/frequency-based audio encoding method for encoding an input signal into an output data stream that allows a high efficiencyencoding at a high audio or speech quality. The invention proposes selecting the time-basedencoding mode or thefrequency-based encoding mode for therespective frequency-domain signalofthe plurality of the frequency-domain signals pertaining to the input signalbased on a peakednessofsaid respective frequency-domain signal.
Description
An adaptive time/frequency-based audio encoding method
TECHNICAL FIELD
The invention relates to an adaptive time/frequency-based audio encoding method for encoding an input signal that is divided into a plurality of frequency-domain signals into an output data stream, said encoding method comprising encoding each frequency-domain signal in one of a time-based encoding mode or a frequency-based encoding mode. Further the invention relates to an adaptive time/frequency-based audio encoder for encoding an input signal into an output data stream, and a computer program product.
TECHNICAL BACKGROUND
It is known that speech codecs work well for speech and audio codecs well for audio. From a coding efficiency point of view speech signals are characterized by the voiced speech parts that are generated by an excitatory signal that is filtered by the vocal tract. Speech coders effectively regenerate the excitation signal and the filtering. The parameters for this regeneration process form a very efficient representation of the speech signal. The signal is effectively represented as a time domain signal corresponding nicely to the speech production process. Therefore, speech coders are often termed as time domain coders (TDC). Audio signals on the other hand vary relatively slowly over time, in contrast to speech, and often consist of tonal components that are stable across longer temporal intervals. Representing such an audio signal as a frequency domain signal (via transform coding) results in a very sparse signal, where most signal (transform) components are not relevant and only the most important information needs to be stored. This is also highly efficient in terms of bit rate. Therefore, such an audio coder is often termed as a frequency domain coder (FDC). From an audio quality point of view, speech signals pose a challenge for audio coders because of the fast rate of change of speech signals. There can be substantial changes within the course of a single analysis frame of the audio coder. These changes are tracked well by the auditory system, especially at mid and high frequencies (> 500 Hz) and modifications of these dynamics are highly audible. In addition, the excitatory nature of the
speech signal is represented in the neural encoding of an audio signal and any modification, i.e. temporal smearing, is highly perceptible. This is likely to occur due to the quantization in the spectral domain that is performed by the audio coder.
Speech coders are not very suitable for encoding audio because of the constant spectral lines that occur in tonal music signals. Spectral resolution of speech coders at low frequencies is too poor to capture these components well. In addition, the structure of spectral lines at higher frequencies can create a characteristic modulation pattern due to beating patterns between adjacent components that cannot be modeled well by the excitatory-based speech coder. To benefit from the advantages of the audio coders and speech coder systems for joint audio and speech coding have been proposed. One of such systems has been disclosed in the patent application US2007/0106502. In this application an adaptive time/frequency-based audio encoder has been described, which obtains high compression efficiency by making efficient use of encoding gains of two encoding methods in which a frequency-domain transform is performed on input audio data such that time-based encoding is performed on a band of the audio data suitable for voice compression and frequency-based encoding is performed on remaining bands of the audio data. The proposed encoder comprises a transformation and mode determination unit to divide an input audio signal into a plurality of frequency-domain signals and to select a time-based encoding mode or frequency-based encoding mode for each respective frequency-domain signal, an encoding unit to encode each frequency-domain signal in the respective encoding modes selected by the transformation and mode determination unit, and a bitstream output unit to output encoded data, division information, and encoding mode information for each encoded frequency-domain signal. The encoding mode to be used for the respective frequency-domain signal is determined based on at least one of a linear coding gain, a spectral change between linear prediction filters of adjacent frames, a predicted pitch delay, and a predicted long-term prediction gain. These signal parameters have originally been disclosed in S. A. Ramprashad, The multimode transform predictive coding paradigm, IEEE Trans. Speech Audio Process, 11 (2):117-129, March 2003. In this publication a multimode transform predictive coding has been proposed which uses three types of measures, namely, spectral measures, energy measures, and long- term prediction estimates in order to choose the most appropriate encoding mode. Said spectral measures comprise: a linear predictive coding gain, a spectral change between linear predictive filters of consecutive frames, or a spectral tilt (first reflection coefficient). Said
energy measures comprise: signal energy, or a change in signal energy between sub frames. Said long-term prediction estimates comprise: an estimated pitch delay, or estimated prediction gains.
The disadvantage of the measures used to determine the encoding mode as discussed above is that they might result in inappropriate choices of an encoding mode for respective frequency-domain signals. This in turn results in lower encoding efficiency and a lower perceived audio or speech quality.
SUMMARY OF THE INVENTION It is an object of the invention to provide an enhanced adaptive time/frequency-based audio encoding method for encoding an input signal into an output data stream that allows a high efficiency encoding at a high audio or speech quality.
This object is achieved by selecting of the time-based encoding mode or the frequency-based encoding mode for the respective frequency-domain signal based on a peakedness of said respective frequency-domain signal. Said peakedness relates much better to the strengths of the time-based and frequency-based encoding than spectral measure, energy measures, or long-term prediction measures do. Therefore, the resulting speech/audio quality is improved. In other words, the choice of the encoding mode is better tailored to the actual content of the frequency-domain signal. In particular, if a signal has a high peakedness, these peaks are typically perceptually relevant and the quantized signal is associated with a low bit rate since the majority of values are quantized to the zero level. In this way, the proposed measure has a direct relation to the essential property of the coder and provides a high quality at a low bit rate. The measures used so far are more indirect, involve more parameters, require complex inference rules and are thus more prone to non-optimal decisions.
In an embodiment, the peakedness of the frequency-domain signal is a spectral peakedness. In this case, the frequency domain signal may be the absolute value of the Fourier transform of a signal, the discrete cosine transform DCT or related representations like the MDCT or MLT. In this domain, the strong tonal components appear as large peaks in the representation and a high spectral peakedness is thus indicative of steady tonal music which can most efficiently be encoded by a frequency domain coder. The spectral peakedness can be established for different bands of the entire frequency spectrum.
In a further embodiment, the frequency-based encoding mode is selected when the spectral peakedness exceeds a predetermined threshold. If the spectral peakedness of a
particular frequency band is sufficiently high, the decision can be made to use a frequency domain encoding method for this band. The advantage of using spectral peakedness is its direct coupling to the efficiency of a frequency-domain encoding method.
In a further embodiment, the predetermined threshold takes on a value of a temporal peakedness corresponding to a time-representation of said respective frequency- domain signal. In case both frequency-based encoding and time-based encoding can be used due to high values of the peakedness in the corresponding domains, i.e. spectral and temporal domain, the decision about using one or another encoding mode is made according to most dominant peakedness measure. Thus if the spectral peakedness is larger than the temporal peakedness the frequency-based encoding mode is chosen, otherwise the time-based encoding is chosen.
In a further embodiment, the peakedness of the frequency-domain signal is a temporal peakedness corresponding to a time-representation of said frequency-domain signal. The temporal peakedness measure is determined from the time-domain representation of the frequency domain signal corresponding to a frequency band. This time-domain representation can be further pre-processed. In particular, it is known that spectral flattening yields a good signal on which the temporal peakedness can best be established. However, considering sufficiently small frequency bands, the flattening stage may be omitted. If only a small number of large values is present in the time-domain representation, then these are perceptually very relevant and the remaining large number of small values can be efficiently compressed thus arriving at a high-quality efficient code by using a time-domain encoding method. The advantage of using temporal peakedness is its direct coupling to the efficiency of a time-domain encoding method.
In a further embodiment, the time-based encoding mode is selected when the temporal peakedness exceeds a predetermined threshold.
In a further embodiment, the predetermined threshold takes on a value of a spectral peakedness of said respective frequency-domain signal. In such a case, one can consider whether a time-domain peakedness or the frequency-domain peakedness is highest and apply the appropriate mode of encoding to the components of this frequency band. In doing so, the information concerning both measures is balanced in the final mode decision thus arriving at the optimal decision.
In a further embodiment, selecting a time based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal is further based on at least one of spectral measures, energy measures, or long-term prediction estimates.
Since there are inevitably signal pieces where the temporal and spectral peakedness are almost equal, it is of advantage to use other sources of information to arrive at a better decision concerning the mode. This may be done by using in these cases the more indirect decision criteria as have been disclosed in S. A. Ramprashad, The multimode transform predictive coding paradigm, IEEE Trans. Speech Audio Process, 11 (2): 117-129, March 2003.
In a further embodiment, the division information is comprised in the output data stream. This is especially advantageous when the division into the plurality of frequency-domain signals varies over time and needs to be communicated to the decoder in order to allow an appropriate decoding.
The invention further provides encoder as well as a computer program product enabling a programmable device to perform the encoding and/or decoding method according to the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments shown in the drawings, in which:
Fig. 1 shows a flow chart for an adaptive time/frequency-based audio encoding method for encoding an input signal into an output data stream in accordance with the invention;
Fig. 2 shows a representation of plurality of frequency-domain signals and the corresponding to them time-domain representation together with the peakedness measure corresponding to each of the respective frequency- domain signals;
Fig. 3 shows example architecture of an adaptive time/frequency-based audio encoder for encoding an input signal into an output data stream in accordance with the invention;
Fig. 4 shows an example block diagram illustrating transformation and mode determination unit of the adaptive time/frequency-based audio encoder;
Fig. 5 schematically shows an example of an encoding mode determination unit that determines whether time-based encoding mode or a frequency-based encoding mode is to be used for the respective frequency-domain signal.
Throughout the figures, same reference numerals indicate similar or corresponding features. Some of the features indicated in the drawings are typically
implemented in software, and as such represent software entities, such as software modules or objects.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Fig. 1 shows a flow chart for an adaptive time/frequency-based audio encoding method for encoding an input signal into an output data stream in accordance with the invention. The proposed encoding method can be summarized to comprise the following steps. First step 110 comprises dividing an input audio signal into a plurality of frequency- domain signals and selecting a time-based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal. Second step 120 comprises encoding each frequency-domain signal in the respective encoding mode. Third step 130 comprises combining encoded data and encoding mode information of each respective frequency domain signal into an output data stream. In the proposed solution selecting a time-based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal is based on peakedness of said respective domain signal.
In the first step the input signal is transformed into frequency domain. Subsequently, it is divided into a plurality of the frequency domain signals. Said frequency- domain signals correspond to e.g. frequency bands. Said frequency bands can be fixed, i.e. the thresholds separating the bands are fixed. The division in bands is preferably logarithmic or linear. Thus, the band sizes can vary from band to band, or they can be of the same size. The number of bins can be arbitrary. Preferably, the number of bands should be determined depending on the actual audio content.
Coding of audio or speech data is typically performed in frames of input data in order to be able to track or adapt to the possibly time-varying character of the input signal. Said frame sizes can be fixed or they can vary over time.
The invention focuses on the issue of selecting the encoding mode that is suitable for the frequency-domain signal. Said frequency domain signal is defined as the frequency signal corresponding to a frequency band of e.g. a frame. The peakedness of an audio signal relates much better to the strengths of the time-based and frequency-based encoding than spectral measure, energy measures, or long-term prediction measures, as indicated in S. A. Ramprashad, The multimode transform predictive coding paradigm, IEEE Trans. Speech Audio Process, 11 (2):117-129, March 2003 or the patent application US2007/0106502, do. Therefore, the resulting speech/audio quality and the corresponding
encoding efficiency are improved. In other words, the choice of the encoding mode is better tailored to the actual content of the frequency-domain signal.
In particular, if a signal has a high peakedness, these peaks are typically perceptually relevant and the quantized signal is associated with a low bit rate since the majority of values are quantized to the zero level. In this way, the proposed measure has a direct relation to the essential property of coder and provides a high quality at a low bit rate. The measures used so far are more indirect, involve more parameters, require complex inference rules and are thus more prone to non-optimal decisions. The way the encoding mode is determined is explained latter. In the second step each of the plurality of frequency-domain signal is encoded according to the encoding mode selected for them. In the third step the encoded data for each of plurality of the frequency-domain signals is combined with its respective encoding mode into the output data stream.
In an embodiment, the peakedness of the frequency-domain signal is a spectral peakedness. The spectral measure essentially determines the amount of samples largely deviating from a zero or a mean value relative to the amount of samples close to the zero or mean value for a frequency-domain representation of the input signal or a preprocessed version thereof. If only a small number of large values are present in the signal, these are perceptually very relevant and the remaining large number of small values can be efficiently compressed thus arriving at a high-quality efficient code. It is then advantageous to encode such frequency-domain signal with the frequency-based encoding.
The constituent encoders, i.e. time-based encoding and frequency-based encoding, operate by segmenting the input signal into frames and encoding the resulting signal frames. It is efficient to determine the peakedness by considering intervals defined within the frame of the input signal and calculating the spectral peakedness for the entire frame from the results obtained for said intervals.
It is advantageous to decide at regular intervals on basis of the input signal about the appropriate value of the spectral peakedness. The regular intervals allow to closely follow the dynamics of the input signal. From the point of view of the implementation it is preferable that the regular intervals coincide with the shortest update that occurs in the frequency and time domain encoding. However, the intervals over which the spectral peakedness is to be calculated could be even smaller than the shortest frames used by the time and frequency encoding.
In a further embodiment, the spectral peakedness is expressed as:
whereby X(k) is the respective frequency-domain signal for a frequency interval with a width F. Said spectral-domain representation is real valued and comprises e.g. MDCT coefficients or absolute values of complex amplitude values resulting from a Discrete Fourier Transform or any other frequency-domain representation like MDCT or MLT. The advantage of this particular measure is its simplicity of calculation.
Peakedness refers here to any measure that correlates with the degree of presence of peaks in a signal (spectral or temporal). Various measures are known to be used for this purpose. As example the normalized fourth moment that is used to measure the degree of fluctuations in an envelope signal (cf. Hartmann and Pumplin, "Noise power fluctuations and masking of sine signals," 1988, J. Acoust. Soc. Am., Vol. 83, pp. 2277- 2289) can be used. This is also the basis for the formula for the spectral peakedness given above. Alternatively, the kurtosis measure can be used as a measure of spectral peakedness. Since peakedness can be regarded as the opposite of flatness, also flatness measures may be used as a criterion. A common spectral flatness measure is the ratio of the geometric mean of bins of the magnitude spectrum divided by the arithmetic mean of the same bins. Various other statistical methods are feasible to define a measure for spectral peakedness.
In a further embodiment, the frequency-based encoding mode is selected when the spectral peakedness exceeds a predetermined threshold. If the spectral peakedness of a particular frequency band is sufficiently high, the decision is made to use a frequency domain encoding method for this band as it results in a high audio/speech quality at the high encoding efficiency. A value of 1.65 is an example for the predetermined threshold. This value has been found to be giving good results in tests, however other values are also possible.
In a further embodiment, the predetermined threshold takes on a value of a temporal peakedness corresponding to a time-representation of said respective frequency- domain signal. In case both frequency-based encoding and time-based encoding can be used due to high values of the peakedness in the corresponding domains, i.e. spectral and temporal domain, the decision about using one or another encoding mode is made according to most dominant peakedness measure. Thus if the spectral peakedness is larger than the temporal
peakedness the frequency-based encoding mode is chosen, otherwise the time-based encoding is chosen.
In a further embodiment, the peakedness of the frequency-domain signal is a temporal peakedness corresponding to a time-representation of said frequency-domain signal. In order to determine the time peakedness corresponding to the respective frequency-domain signal said frequency-domain signal is transformed to the time domain. In said time domain for said time domain representation of the frequency-domain signal the temporal peakedness is determined.
The temporal measure essentially determines the amount of samples largely deviating from a zero or a mean value relative to the amount of samples close to the zero or mean value for said time-domain representation of the frequency-domain signal.
If only a small number of large values are present in the signal, these are perceptually very relevant and the remaining large number of small values can be efficiently compressed thus arriving at a high-quality efficient code. It is then advantageous to encode such frequency-domain signal with the time-based encoding.
In a further embodiment, the temporal peakedness is expressed as:
whereby xw(n) is a temporal signal representation of the respective frequency-domain signal across the interval where N is the number of samples of the associated time-domain signal. The size of the interval can vary. However, it is advantageous to determine the peakedness measure across a number of short, overlapping intervals within a frame rather than calculating the peakedness measure across the frame. It is also possible to divide the temporal signal representation into subintervals, and subsequently to determine the peakedness per subinterval and combine (e.g. a max operator) these peakedness values into a single peakedness value for the entire temporal signal representation.
The temporal peakedness measure is determined from the time-domain representation of the frequency-domain signal. Said frequency-domain signal preferably corresponds to a frequency band of the input signal. This time-domain representation can be a pre-processed signal obtained by transforming the respective frequency-domain signal to the time domain. In particular, spectral flattening yields a good signal on which the temporal
peakedness can best be established. However, considering sufficiently small frequency bands, the flattening stage may be omitted.
If only a small number of large values is present in the time-domain representation, then these are perceptually very relevant and the remaining large number of small values can be efficiently compressed thus arriving at a high-quality efficient code by using a time-domain encoding method.
In a further embodiment, the time-based encoding mode is selected when the temporal peakedness exceeds a predetermined threshold. If the temporal peakedness of a particular frequency band is sufficiently high, the decision is made to use a time-based encoding mode for this band as it results in a high audio/speech quality at the high encoding efficiency. A value of 1.7 is an example for the predetermined threshold. This value has been found to be giving good results in tests, however other values are also possible.
In a further embodiment, the predetermined threshold takes on a value of a spectral peakedness of said respective frequency-domain signal. In case both frequency- based encoding and time-based encoding can be used due to high values of the peakedness in the corresponding domains, i.e. spectral and temporal domain, the decision about using one or another encoding mode is made according to most dominant peakedness measure. Thus if the temporal peakedness is larger than the spectral peakedness the time-based encoding mode is chosen, otherwise the frequency-based encoding is chosen. Fig. 2 shows a representation of plurality of frequency-domain signals and the corresponding to them time-domain representation together with the peakedness measure corresponding to each of the respective frequency- domain signals. The top plot 210 in Fig. 2 depicts the frequency domain representation of the input signal. Said frequency domain representation is obtained by e.g. Fourier-based transform or filter bank applied on the input signal. For the case shown in Fig. 2 a chunk of 512 samples of the input signal has been windowed and transformed using a critically sampled filter bank. Said frequency domain representation of said chunk of the input signal has been divided into 14 frequency-domain signals each of which corresponds to a respective frequency band. Said frequency bands are determined in the logarithmic manner resulting for a small band size for lower frequencies and a large band size for larger frequencies. For visualization purposes the data in each band is scaled back to a Root-Mean- Square (= RMS) of 1. For each of these bands, i.e. each of these frequency-domain signals, the normalized fourth moment correlation is calculated as a spectral measure. For visualization purposes the values of a spectral measure per frequency
band are multiplied by two, and they are indicated in the top plot 210 of Fig. 2 by a circle or a star.
The bottom plot 220 in Fig. 2 depicts a time domain representation of the respective frequency-domain signals. For the case shown in the bottom plot the real valued data within each band, i.e. for each frequency-domain signal, is transformed to a complex valued vector. The first component of the vector corresponds to real valued first data point in said frequency-domain signal. The real part of the second vector component is the second data point, and the imaginary part is the third data point. The real part of the third vector component is the fourth data point, and the imaginary part is fifth data point, etc. The complex vector created in this way is assumed to represent the positive frequency half of a real valued signal. Alternatively other ways of constructing said vector could be used. Subsequently, said vector is transformed to the time domain using an inverse Fourier transform. This process is performed separately for each band, i.e. for each frequency-domain signal. For visualization purposes within each band the time domain signal is normalized and plotted in the same way as the corresponding frequency-domain signal. Therefore, the time domain signals in the bands as illustrated in the bottom plot of Fig. 2 correspond one-to-one with the bands determined in the frequency domain. For time domain signal representation in each band a temporal peakedness is calculated using the normalized fourth moment correlation. For visualization purposes said value is multiplied by two and plotted within each band using the circle or the star.
As can be seen for certain bands the time domain representation is more peaked than its corresponding frequency domain representation. In such a case a star is used to indicate the temporal peakedness in the bottom plot and the circle is used to indicate the spectral peakedness in the corresponding band in the top plot. When both peakedness measures are available, the decision about which of the encoding modes should be used is based on the relation between the spectral and temporal peakedness for said frequency-domain signal. When the spectral peakedness is larger than the temporal peakedness for the respective frequency- domain signal the frequency-based encoding is used. When the temporal peakedness is larger than the spectral peakedness for the respective frequency-domain signal the time-based encoding is used. To illustrate the above strategy for encoding mode selection consider the 9th band 212 in the top plot 210. The values of the representation of the frequency-domain signal in this band oscillate approximately about the value 0.8. There as many peaks as the valleys, and the values corresponding to said peaks are roughly the same and the values corresponding to said
valleys are also roughly the same. The time representation of this frequency-domain signal as represented in band 222 of the bottom plot 220 shows a pronounced single peak toward the end of the band. The rest of the values in this band are rather small and close to zero. The temporal peakedness exceeds the spectral peakedness and therefore the time-based encoding is used for this frequency-domain signal. For the 8th band 211 in the top plot 210 and the corresponding band 221 in the bottom plot 220 the situation is reversed. The frequency- domain signal has a single pronounced peak in this band, while the time domain representation of this frequency-domain signal has rather balanced variation of values approximately around 0.8. Since the spectral peakedness is clearly smaller than the temporal peakedness for said band the frequency-based encoding is used to encode the frequency- domain signal corresponding to this band.
Alternatively other ways of assessing the dominance of the spectral or temporal peakedness can be used that utilize a formula or heuristics. Below one of the heuristics which can be used to determine the encoding mode is described. For example, the time-based encoding is used to encode the frequency-domain signal when the temporal peakedness is larger than the spectral peakedness for said frequency-domain signal and when additionally the temporal peakedness is larger than the predetermined threshold taking a value of e.g. 1.7. Otherwise, the frequency-based encoding is used to encode the frequency-domain signal. In a further embodiment, selecting a time based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal is further based on at least one of spectral measures, energy measures, or long-term prediction estimates. These additional measures serve a correction purpose to accommodate the features of the input signal which are strongly pronounced by these measures. The spectral measures comprise: LP coding gain, the spectral change between LP filters of consecutive frame, or the like. The energy measures comprise: the signal energy, the change in signal energy between subframes, or the like. The long-term prediction estimates comprise: estimated pitch delay, estimated long-term prediction gains, or the like. Said measures are extensively discussed in S. A. Ramprashad, The multimode transform predictive coding paradigm, IEEE Trans. Speech Audio Process, 11 (2): 117-129, March 2003.
For the heuristics presented above e.g. a pitch is used as the correcting factor. The decision about the use of a specific encoding mode is then made as follows. The time- based encoding is used to encode the frequency-domain signal when a combination (e.g. a weighted sum) of the temporal peakedness and the pitch corresponding to a frequency-
domain signal is larger than the spectral peakedness for said frequency-domain signal and when additionally the temporal peakedness is larger than the predetermined threshold taking a value of e.g. 1.7. Otherwise, the frequency-based encoding is used to encode the frequency- domain signal. Various other options can be used to improve the encoding mode selection.
For example next to the heuristic approaches discussed above, additional constraints are imposed on e.g. tilt, or energy of frequency-domain signal. Said constraint takes a form of e.g. a threshold limitation or other more sophisticated form.
Said mode determining means can also be used to determine the optimal division of the input signal into the plurality of the frequency-domain signals. For instance, considering the method where the encoding method is determined by the absolute or relative difference between the spatial and temporal peakedness of the divided frequency-domain signal, the division can be determined as one which maximizes in some sense said difference. In an embodiment, an indicator for the division information is comprised in the output data stream. Said indicator is for example a code for specific division information of the input signal, or an address of a device, e.g. a server on the Internet, wherefrom said division information can be retrieved. A decoder based on said indicator can be configured to operate according to the division information used to produce the data stream to be decoded. When the division of the input audio signal into a plurality of frequency-domain signals varies in time said it is especially desirable to comprise division information 332 in the output data stream.
Fig. 3 shows example architecture of an adaptive time/frequency-based audio encoder 300 for encoding an input signal into an output data stream in accordance with the invention. Said encoder comprises a transformation and mode determination unit 310, an encoding unit 320, and a merger 330.
The transformation and mode determination unit 310 divides the input audio signal into plurality of frequency-domain signals and to select the time-based encoding mode or the frequency-based encoding mode for each respective frequency-domain signal. Then, the transformation and mode determination unit 310 outputs a frequency domain-signal 321 determined to be encoded in the time-based encoding mode, a frequency-domain signal 322 determined to be encoded in the frequency-based encoding mode, and encoding mode information 331 for each frequency-domain signal. In said example dividing the input signal and the encoding mode selection is performed in a single unit 310, however, a separate
functional unit (implemented in hardware or software) could be assigned to perform each of these functions.
The encoding unit 320 encodes each frequency-domain signal in the respective encoding modes selected by the transformation and mode determination unit 310. The unit 320 performs time-based encoding on the frequency-domain signal 321 and performs frequency-based encoding on the frequency-domain signal 322. The encoding unit 320 outputs encoded data 333 on which the time-based encoding has been performed and encoded data 334 on which the frequency-based encoding has been performed.
The merger 330 combines encoded data 333 and 334, and encoding mode information 331 for each respective encoded frequency-domain signal to produce the output data stream 341.
Fig. 4 shows an example block diagram illustrating transformation and mode determination unit 310 of the adaptive time/frequency-based audio encoder 300. The transformation and mode determination unit 310 comprises a frequency-domain transform unit 400 and an encoding mode determination unit 410. The frequency-domain transform unit 400 transforms the input audio signal 311 into a full frequency-domain signal 421. Said full frequency-domain signal 421 having a frequency spectrum such as e.g. the one illustrated in the top plot 210 of Fig. 2. Said frequency-domain representation is obtained by using e.g. Fourier-based transform or filter bank applied on the input signal. The encoding mode determination unit 410 divides the full frequency-domain signal 421 into a plurality of frequency-domain signals according to a preset standard and selects either the time-based encoding mode or the frequency-based encoding mode for each frequency-domain signal based on peakedness of said frequency-domain signal. The encoding mode determination unit 410 outputs the frequency domain signal 321 determined to be encoded in the time-based encoding mode, the frequency-domain signal 322 determined to be encoded in the frequency- based encoding mode, the encoding mode information 331, and when required the division information 332 for each frequency-domain signal.
Fig. 5 schematically shows an example of an encoding mode determination unit 410 that determines whether time-based encoding mode or a frequency-based encoding mode is to be used for the respective frequency-domain signal. Said unit 410 determines an encoding mode based on a spectral peakedness and temporal peakedness of the input signal 421.
The input signal 421 is fed into a signal selector 511, which outputs the frequency-domain signal 422 corresponding to e.g. the selected frequency band. Said
frequency-domain signal 422 is further fed into the unit 514 which calculates the spectral peakedness of said frequency-domain signal 422 based e.g. on the normalized fourth moment correlation (as discussed before).
In parallel the frequency-domain signal 422 is fed into an inverse transform unit 512 that transforms said frequency-domain signal into time-domain signal. Unit 513 derives a temporal peakedness measure for the signals provided from the unit 512. Said temporal peakedness is calculated using e.g. the normalized fourth moment correlation (as discussed before). Advantageously, said temporal peakedness is determined across a number of short, overlapping intervals within a frame rather than calculating the peakedness measure across the frame at once.
The temporal and spectral peakedness measures obtained from units 513 and 514 are fed into a unit 515, which combines these two measures to make a decision about the value of the predetermined frequency. The predetermined frequency can be determined in many ways by means of a formula or by means of heuristics. Alternatively other ways of assessing the dominance of the spectral or temporal peakedness can be used that utilize a formula or heuristics. Below one of the heuristics which can be used to determine the encoding mode is described.
For example, the time-based encoding is used to encode the frequency-domain signal when the temporal peakedness is larger than the spectral peakedness for said frequency-domain signal and when additionally the temporal peakedness is larger than the predetermined threshold taking a value of e.g. 1.7. Otherwise, the frequency-based encoding is used to encode the frequency-domain signal.
Furthermore, only one of the peakedness measures, namely spectral or temporal can be used, which means that only one of the paths comprising the unit 514 or the units 512 and 513 is used to determine the encoding mode.
Furthermore, although individually listed, a plurality of means, elements or method steps may be implemented by e.g. a single unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that feature is equally applicable to other claim categories as appropriate. Furthermore, the order of features in the claims does not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that
the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", etc. do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.
Claims
1. An adaptive time/frequency-based audio encoding method for encoding an input signal (311) that is divided into a plurality of frequency-domain signals into an output data stream, said encoding method comprising encoding each frequency-domain signal in one of a time-based encoding mode or a frequency-based encoding mode, said method characterized in that the selection of the time-based encoding mode or the frequency-based encoding mode for the respective frequency-domain signal is based on a peakedness of said respective frequency-domain signal.
2. An encoding method according to claim 1, wherein the peakedness of the frequency-domain signal is a spectral peakedness.
3. An encoding method according to claim 2, wherein the spectral peakedness is expressed as:
4. An encoding method according to claim 2, wherein the frequency-based encoding mode is selected when the spectral peakedness exceeds a predetermined threshold.
5. An encoding method according to claim 4, wherein the predetermined threshold takes on a value of a temporal peakedness corresponding to a time-representation of said respective frequency-domain signal.
6. An encoding method according to claim 1, wherein the peakedness of the frequency-domain signal is a temporal peakedness corresponding to a time-representation of said frequency-domain signal.
7. An encoding method according to claim 6, wherein the temporal peakedness is expressed as:
8. An encoding method according to claim 6, wherein the time-based encoding mode is selected when the temporal peakedness exceeds a predetermined threshold.
9. An encoding method according to claim 8, wherein the predetermined threshold takes on a value of a spectral peakedness of said respective frequency-domain signal.
10. An encoding method according to claims 1-9, wherein selecting a time based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal is further based on at least one of spectral measures, energy measures, or long-term prediction estimates.
11. An encoding method according to claim 1 , wherein the division information is comprised in the output data stream.
12. An adaptive time/frequency-based audio encoder (300) for encoding an input signal (311) into an output data stream (341), said encoder comprising: a transformation and mode determination unit (310) to divide an input audio signal into plurality of frequency-domain signals and to select a time-based encoding mode or a frequency-based encoding mode for each respective frequency-domain signal, an encoding unit (320) to encode each frequency-domain signal in the respective encoding modes selected by the transformation and mode determination unit, and a merger (330) to combine encoded data and encoding mode information for each respective encoded frequency-domain signal to produce the output data stream, said adaptive time/frequency-based encoder characterized by said transformation and mode determination unit selecting the time-based encoding mode or the frequency-based encoding mode for each respective frequency-domain signal based on peakedness of said respective frequency-domain signal.
13. An encoder according to claim 12, wherein the transformation and mode determination unit (310) is configured to select the time-based encoding mode or frequency- based encoding mode based on a spectral peakedness.
14. An encoder according to claim 13, wherein the spectral peakedness is
15. An encoder according to claim 13, wherein the transformation and mode determination unit is configured to select the frequency-based encoding mode when the spectral peakedness exceeds a predetermined threshold.
16. An encoder according to claim 15, wherein the predetermined threshold takes on a value of a temporal peakedness corresponding to a time-representation of said respective frequency-domain signal.
17. An encoder according to claim 12, wherein the transformation and mode determination unit is configured to select the time-based encoding mode or frequency-based encoding mode based on a temporal peakedness.
18. An encoder according to claim 17, wherein the temporal peakedness is
19. An encoder according to claim 17, the transformation and mode determination unit is configured to select the time-based encoding mode when the temporal peakedness exceeds a predetermined threshold.
20. An encoder according to claim 19, wherein the predetermined threshold takes on a value of spectral peakedness of said respective frequency-domain signal.
21. An encoder according to claims 12-20, wherein the transformation and mode determination unit is configured to select the time-based encoding mode or frequency-based encoding mode is further based on at least one of spectral measures, energy measures, or long-term prediction estimates of the frequency-domain signal.
22. An encoder according to claims 12-21, wherein the merger is arranged to include division information in the output data stream when division of the input audio signal into a plurality of frequency-domain signals varies in time.
23. A computer program product for executing the method of any of claims 1 to 11.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP07123456.1 | 2007-12-18 | ||
| EP07123456 | 2007-12-18 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2009077950A1 true WO2009077950A1 (en) | 2009-06-25 |
Family
ID=40316955
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2008/055244 Ceased WO2009077950A1 (en) | 2007-12-18 | 2008-12-12 | An adaptive time/frequency-based audio encoding method |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2009077950A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103915100A (en) * | 2013-01-07 | 2014-07-09 | 中兴通讯股份有限公司 | A coding mode switching method and device, a decoding mode switching method and device |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6233550B1 (en) * | 1997-08-29 | 2001-05-15 | The Regents Of The University Of California | Method and apparatus for hybrid coding of speech at 4kbps |
| US7039581B1 (en) * | 1999-09-22 | 2006-05-02 | Texas Instruments Incorporated | Hybrid speed coding and system |
| US20070106502A1 (en) * | 2005-11-08 | 2007-05-10 | Junghoe Kim | Adaptive time/frequency-based audio encoding and decoding apparatuses and methods |
| WO2007120316A2 (en) * | 2005-12-05 | 2007-10-25 | Qualcomm Incorporated | Systems, methods, and apparatus for detection of tonal components |
-
2008
- 2008-12-12 WO PCT/IB2008/055244 patent/WO2009077950A1/en not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6233550B1 (en) * | 1997-08-29 | 2001-05-15 | The Regents Of The University Of California | Method and apparatus for hybrid coding of speech at 4kbps |
| US7039581B1 (en) * | 1999-09-22 | 2006-05-02 | Texas Instruments Incorporated | Hybrid speed coding and system |
| US20070106502A1 (en) * | 2005-11-08 | 2007-05-10 | Junghoe Kim | Adaptive time/frequency-based audio encoding and decoding apparatuses and methods |
| WO2007120316A2 (en) * | 2005-12-05 | 2007-10-25 | Qualcomm Incorporated | Systems, methods, and apparatus for detection of tonal components |
Non-Patent Citations (1)
| Title |
|---|
| SEAN A RAMPRASHAD: "The Multimode Transform Predictive Coding Paradigm", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 11, no. 2, 1 March 2003 (2003-03-01), XP011079700, ISSN: 1063-6676 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103915100A (en) * | 2013-01-07 | 2014-07-09 | 中兴通讯股份有限公司 | A coding mode switching method and device, a decoding mode switching method and device |
| CN103915100B (en) * | 2013-01-07 | 2019-02-15 | 中兴通讯股份有限公司 | A coding mode switching method and device, and a decoding mode switching method and device |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP6682683B2 (en) | Decoding method, computer program, and decoding system | |
| RU2485606C2 (en) | Low bitrate audio encoding/decoding scheme using cascaded switches | |
| JP5688852B2 (en) | Audio codec post filter | |
| EP2491555B1 (en) | Multi-mode audio codec | |
| CA2833868C (en) | Apparatus for quantizing linear predictive coding coefficients, sound encoding apparatus, apparatus for de-quantizing linear predictive coding coefficients, sound decoding apparatus, and electronic device therefor | |
| CN107077858B (en) | Audio encoder and decoder using frequency domain processor with full bandgap padding and time domain processor | |
| JP5722437B2 (en) | Method, apparatus, and computer readable storage medium for wideband speech coding | |
| CA2833874C (en) | Method of quantizing linear predictive coding coefficients, sound encoding method, method of de-quantizing linear predictive coding coefficients, sound decoding method, and recording medium | |
| CN106796800B (en) | Audio encoder, audio decoder, audio encoding method, and audio decoding method | |
| US10706865B2 (en) | Apparatus and method for selecting one of a first encoding algorithm and a second encoding algorithm using harmonics reduction | |
| EP2502230B1 (en) | Improved excitation signal bandwidth extension | |
| CN105122357B (en) | LPC-based low-frequency enhancement in frequency domain | |
| WO2003091989A1 (en) | Coding device, decoding device, coding method, and decoding method | |
| KR20090025304A (en) | Audio encoders, audio decoders, and audio processors with dynamic variable warping | |
| JP2016505902A (en) | Apparatus and method for selecting one of a first encoding algorithm and a second encoding algorithm | |
| CA2983813C (en) | Audio encoder and method for encoding an audio signal | |
| WO2009077950A1 (en) | An adaptive time/frequency-based audio encoding method | |
| CN105122358B (en) | Apparatus and method for processing coded signals and encoder and method for generating coded signals | |
| CA2910878C (en) | Apparatus and method for selecting one of a first encoding algorithm and a second encoding algorithm using harmonics reduction | |
| WO2004097795A2 (en) | Adaptive voice enhancement for low bit rate audio coding | |
| KR20080034817A (en) | Encoding / Decoding Apparatus and Method | |
| HK1175293B (en) | Multi-mode audio codec |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08861201 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 08861201 Country of ref document: EP Kind code of ref document: A1 |