[go: up one dir, main page]

WO2009043066A1 - Method and device for low-latency auditory model-based single-channel speech enhancement - Google Patents

Method and device for low-latency auditory model-based single-channel speech enhancement Download PDF

Info

Publication number
WO2009043066A1
WO2009043066A1 PCT/AT2007/000466 AT2007000466W WO2009043066A1 WO 2009043066 A1 WO2009043066 A1 WO 2009043066A1 AT 2007000466 W AT2007000466 W AT 2007000466W WO 2009043066 A1 WO2009043066 A1 WO 2009043066A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
signal
filter
speech
band
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/AT2007/000466
Other languages
French (fr)
Inventor
Martin Opitz
Robert Höldrich
Franz Zotter
Markus Noisternig
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AKG Acoustics GmbH
Original Assignee
AKG Acoustics GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AKG Acoustics GmbH filed Critical AKG Acoustics GmbH
Priority to GB1004090.5A priority Critical patent/GB2465910B/en
Priority to DE112007003674T priority patent/DE112007003674T5/en
Priority to AT0956707A priority patent/AT509570B1/en
Priority to PCT/AT2007/000466 priority patent/WO2009043066A1/en
Publication of WO2009043066A1 publication Critical patent/WO2009043066A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the present invention relates to a method for enhancing wide-band speech audio signals in the presence of background noise and, more particularly to a noise suppression system, a noise suppression method and a noise suppression program. More specifically, the present invention relates to low-latency single-channel noise reduction using sub-band processing based on masking properties of the human auditory system.
  • noise reduction methods i.e. methods aiming at processing a noisy signal with the purpose of eliminating or attenuating the level of noise and improving the signal-to-noise-ratio (SNR) without affecting the speech and its characteristics.
  • SNR signal-to-noise-ratio
  • noise reduction is also referred to as noise suppression or speech enhancement.
  • ambient noise usually arises from fans of computers, printers, or facsimile machines, which can be considered as (long-term) stationary.
  • Conversational noise emerging from (telephone) talks of colleagues sharing the office, as often referred to as babble noise, contains harmonic components and is therefore much harder to attenuate by a noise reduction unit.
  • spectral subtraction attempts to estimate the short time spectral amplitude (STSA) of clean speech from that of the noisy, speech, i.e. the desired speech contaminated by noise, by subtracting an estimate noise signal.
  • STSA short time spectral amplitude
  • the estimated speech magnitude is combined with the phase of the noisy speech, based on the assumption that the human ear is insensitive against phase distortions (see C. L.
  • a method to reduce musical tones which is often applied is to subtract an overestimate of the noise spectrum to reduce the fluctuations in the DFT coefficients and prevent the spectral components from going below a spectral floor (see M. Berouti et al., "Enhancement of speech corrupted by acoustic noise," in Proc. IEEE Int. Conf. on Acoust., Speech and Sig. Proc. (ICASSP'79), vol. 4, pp. 208-211, Washington D.C., Apr. 1979).
  • This approach successfully reduces musical tones during low SNR conditions and noise only periods.
  • the main disadvantage is the distortion of the speech signal during voice activity . In practice a tradeoff between speech quality level and residual noise floor level has to be found.
  • MMSE-STSA The minimum mean-square error short-time spectral amplitude estimator (MMSE-STSA, see Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error short-time amplitude estimator," TERF. Trans. Acoust Speech and Sig. Proc, vol. 32, no. 6, pp.1109-1121,
  • a priori SNR represents the information on the unknown spectrum magnitude gathered from previous frames and is evaluated in the decision-directed approach (DDA). As the smoothing performed by the DDA may have irregularities, low-level musical noise may occur.
  • DDA decision-directed approach
  • Ephraim and Van Trees propose another important method for noise reduction based on signal subspace decomposition (see Y. Ephraim and H. L. Van Trees, "A signal subspace approach for speech enhancement", in IEEE Trans. Speech and Audio Proc, vol. 3, pp.251-266, July 1995).
  • the noisy signal is decomposed into a signal-plus-noise subspace and a noise subspace, where these two subspaces are orthogonal.
  • the resulting linear estimator is a general Wiener filter with adjustable noise level, to control the trade-off between signal distortion and residual noise, as they cannot be minimized simultanously.
  • Skoglund and Kleijn point out the importance of the temporal masking property in connection with the excitation of voiced speech (see J. Skoglund and W. B. Kleijn, "On Time-Frequency Masking in Voiced Speech", in EEEE Trans. Speech and Audio Proc., vol. 8, no. 4, pp. 361-369, July 2000). It is shown that noise between the excitation impulses is more perceptive than noise close to the impulses, and this is especially so for the low pitch speech for which the excitation impulses locates temporal sparsely. Temporal masking is not employed by conventioanl noise reduction methods using frequency domain MMSE estimators. Patent WO 2006 114100 discloses a signal subspace approach taking the temporal masking properties into account.
  • the aim of the present invention consists in providing a single-channel auditory-model based noise suppression method with low-latency processing of wide-band speech signals in the presence of backgound noise. More specifically, the present invention is based on the method of spectral subtraction using a modified decision directed approach comprising oversubtraction and an adjustable noise-level to avoid perceptible musical tones. Further, the present invention uses sub-band processing plus pre- and post-filtering to give consideration to temporal and simultaneous masking inherent to human auditory perception, in particular to minimize perceptible signal distortions during speech periods.
  • GTF Gammatone filter bank
  • a pre-processor which emulates the transfer behaviour of the human outer- and middle ear, is applied to the time-discrete noisy input signal (i.e. the desired speech contaminated by noise and interference).
  • each sub-band the level of the noisy signal is detected and smoothed.
  • These narrow-band level detectors applied to each of the plurality of sub-bands utilize the phase of simple low-order filter sections to provide lowest signal processing delay.
  • the noise level is estimated in each sub-band utilizing a heuristic approach based on recursive Minimum-Statistics.
  • the instantaneous signal-to-noise-ratio (SNR) in each sub-band is estimated from the envelope of the noisy signal and the noise level estimate.
  • the a priori SNR is estimated from the instantaneous SNR by applying the Ephraim-and- Malah Spectral Subtraction Rule (EMSR).
  • ESR Ephraim-and- Malah Spectral Subtraction Rule
  • DDA decision directed approach
  • Temporal masking based on human auditory perception is taken into account by appropriate filtering of the sub-band signals.
  • These non-linear auditory post-masking filters apply recursive averaging to falling slopes of the signal level detected in each sub-band, with the following effects: (a) over-estimating variances of impulsive noise, (b) noise suppression algorithms do not effect signal below the temporal masking threshold, and (c) no additional signal delay is introduced to transient signals, important in speech perception.
  • a non-linear gain function for each sub-band is derived from the a priori SNR estimates, comprising over-subtraction of the noise signal estimates.
  • the noisy signal in each sub-band is multiplied by the respective gain in order to suppress the noise signal components.
  • An optimized nearly perfect reconstruction filter-bank employing a decision criterion for signed summation re-synthesizes the enhanced full-band speech singal.
  • a post-processing filter is applied to the enhanced full-band signal to compensate the effect of the pre-processing filter.
  • Single channel subtractive-type speech enhancement systems are efficient in reducing background noise; however, they introduce a perceptually annoying residual noise.
  • properties of the auditory system are introduced in the enhancement process. This phenomenon is modeled by the calculation of a noise-masking threshold in frequency domain, below which all components are inaudible (see N. Virag, "Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System", IEEE Trans, on Speech and Audio Proc., vol. 7, no. 2, pp. 126-137, March 1999).
  • filter bank implementations are especially attractive as they can be adapted to the spectral and temporal resolution of the human ear.
  • the authors propose a noise suppression method based on spectral subtraction combined with Gammatone filter (GTF) banks divided into critcal bands.
  • GTF Gammatone filter
  • the concept of critical bands which describes the resolution of the human auditory systems, leads to a nonlinearly warped frequency scale, called the Bark Scale (see J. O. Smith HI and J. S. Abel, "Bark and ERB Bilinear Transforms," IEEE Trans, on Speech and Audio Pro ⁇ , vol. 7, no. 6, pp. 697-708, Nov. 1999).
  • the use of Gammatone filter banks outperforms the DTFT based reaches in terms of computational complexity and overall system latency.
  • the GTF approach allows implementing a low-latency analysis-synthesis scheme with low computational complexity and nearly perfect reconstruction.
  • the proposed synthesis filter creates the broadband output signal by a simple summation of the sub-band signals, introducing a criterion that indicates the necessity of sign alteration before summation.
  • This approach outperforms channel vocoder based approaches as proposed e.g. by McAulay and Malpass (see R. J. McAulay and M. L. Malpass, "Speech Enhancement Using a Soft-Decision Noise Suppression Filter", IEEE Trans, on Acoust., Speech and Sig.
  • the method for low-latency auditory-model based single channel noise suppression and reduction works as an independent module and is intended for installation into a digital signal processing chain, wherein the software-specified algorithm is implemented on a commercially available digital signal processor (DSP), preferably a special DSP for audio applications.
  • DSP digital signal processor
  • FIG 1 is a schematic illustration of the single-channel sub-band speech enhancement unit of the present invention.
  • FIG 2 is a schematic illustration of the non-linear calculation of the gain factor for noise suppression applied to each sub-band.
  • FIG 3 and 4 show the roof-shaped MMSE-SP attenuation surface dependent on the a posteriori (7 f c) and the a priori (£*) SNR.
  • the x-axis corresponds to 7* and not (7* - 1) as in the literature.
  • the dash-dotted line in Fig. 3 marks the transition between the partitions * /**"• and G 10 , the dashed line shows the power spectral subtraction contour.
  • the contours of the DDA estimation are plotted in Fig. 4 upon the MMSE-SP attenuation surface. Dashed lines in Fig. 4 show the average of the dynamic relationships between 7* and ⁇ k , solid lines show static relationships.
  • FIG 5 and 6 are illustrations of the combined (modified) DDA and MMSE-SP estimation behaviour. Dashed lines in Fig. S show the average of the dynamic relationships between 7* and ⁇ k , solid lines show static relationships. Two fictitious hysteretis loops of Fig. 6 matching the observations from informal experiments.
  • FIG 7 shows a block diagram of the overall-system.
  • FIG 8 shows the over-all system comprising auditory frequency analysis and resynthesis as front- and back end, and using special low-latency and low-effort speech enhancement in between.
  • a combination of an elaborate noise suppression law with a human auditory model enables high quality performance.
  • FIG 9 shows an outer- and middle ear filter composed of three second order sections (SOS).
  • FIG 11 shows a familiar way of level-detection. As the signal power is used, the squared amplitude is detected.
  • FIG 12 shows the Low-Latency FIR level detector
  • FIG 13 shows a non-linear recursive auditory post-masking filter, responding to falling slopes.
  • FIG 14 shows a recursive noise level estimator using three time-constant and a counter threshold.
  • the Ephraim and Malah amplitude estimator and the Ephraim and Malah decision directed a priori SNR estimate (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984 and Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr.
  • a signal model is considered in which a noisy signal y[n] consists of speech x[n] and additive noise d[n], at time-index n.
  • the signals x[n] and d[n] are assumed to be statistically independent Gaussian random variables. Due to certain properties of the Fourier transform, the same statistical model can be assumed for corresponding complex short-term spectral amplitudes 2C t f m ] an ⁇ £*[m] in each frequency bin k, at analysis time m (Underlined variables denote complex quantities here. Therefore, in our notation, 2C j t[ m ] represents a complex variable.
  • ⁇ f c[m] shall represent the magnitude 12CjJm]
  • the unknown clean speech variance ⁇ fc is implicitly determined in the a priori SNR estimation part of the algorithm, whereas the noise variance ⁇ k has to be determined in advance, e.g. using Minimum Statistics (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001), MCRA (I. Cohen and B. Berdugo, "Speech Enhancement for non-stationary noise environments", Signal Processing, no. 11, pp. 2403-2418, Elsevier, Nov. 2001), or Harmonic Tunneling (D. Ealey, H. Kelleher, D. Pearce, "Harmonic Tunneling: Tracki ⁇ g Non-Stationaiy Noises During Speech", Proc. Eurospeech, 2001)
  • section II an overview of the combined estimation is given, and its hysteretic shape is presented. Furthermore in section HI it is shown how a slight modification can reduce unwanted estimation behaviour and enable a smoother estimation hysteresis.
  • the EMSR reconstructs the magnitude of the clean speech signal X k [m] from the noisy observation Vt[m].
  • the time index m may be dropped for simplicity of notation.
  • the noisy phase is an optimal estimate of the clean phase.
  • the reconstruction operator is a real- valued spectral weight G[m]:
  • the DDA combines two basic SNR estimators to a new estimator of die a priori SNR ⁇ k .
  • the second estimator describes the reconstructed SNR, which is calculated after noise reduction using
  • the a posteriori SNR 7* shows relative variations in time that are smaller than those of (7*- 1) (Relative variations, e.g. 10 1og(7 f c[m])— 10-log(7 f c[m— 1]), are more significant than linear variations regarding human auditory perception.).
  • G provides a consistently high attenuation under low SNR conditions. Therefore, the reconstructed SNR rec will take more consistent values than SNRi 11St in the low SNR case.
  • the DDA for estimation of the a priori SNR combines both SNHi 11St aad SNR rec :
  • the specific estimation properties can be observed by inserting the suppression gain into the DDA.
  • Awkward estimation behavior e.g. the "constant- ⁇ -effect”
  • the discontinuities in the hysteresis loop (Fig. 4) give rise to considerations concerning a modification of the DDA and a reconsideration of time-constant and minimum a priori SNR quantities.
  • the parameter p can directly control the suppression hysteresis width and musical noise suppression. Our modification enables a separate control of averaging time-constant and musical noise suppression.
  • the over-all system is shown in a block diagram in Fig. 7. It can be implemented as analog or digital effect processor or as a part of a software algorithm. Inside the over-all system there are several subsystems Fig. 8:
  • LD low-latency level detection
  • PM auditory post-masking filter
  • An outer and middle ear filter consists of three second order sections (SOS) representing the following physiological parts of the human ear (E. Zwicker, H. Fasti, “Psychoacoustics, facts and models”, Springer, Berlin Heidelberg, 1999; E. Terhardt, “Akustician Kochunikation”, Springer, Berlin Heidelberg, 1998):
  • the latter two filters are optional, whereas the high-pass component is mandatory and reduces the influence of low-frequency noise on the noise suppressor.
  • a filter structure providing an appropriate magnitude transfer function could look like Fig. 9. All three filter sections have to be second order sections to provide appropriate slopes.
  • the outer filter skirts can be modelled as second order low- and high-pass shelving filters, whereas the resonance can be modelled as parametric peak-filter (P. Dutilleux, U. Zolzer, "DAFX”, Wiley&Sons, 2002).
  • Frequency grouping is an imporant effect in human loudness perception.
  • the perceived loudnesss consists of particular loudnesses associated to individual frequency ranges.
  • An auditory frequency scale can be used to model this frequency grouping effects, the units of which can be seen as frequency resolution of human auditory loudness perception (E. Zwicker, H. Fasti, "Psychoacoustics, facts and models", Springer, Berlin Heidelberg, 1999).
  • a reasonable frequency scale using a low number of frequency groups can be given by the formula of Traunmuller (E. Terhardt, "Akustician Kochunikation", Springer, Berlin Heidelberg, 19
  • the bandwidths B k can be derived from B k — 35 ⁇ 1 ⁇ / f c + du/2 ⁇ — 93 "1 I ⁇ — du/2 ⁇ .
  • Other Bark-scames e.g. E. Zwicker, H. Fasti, 'Tsychoacoustics, facts and models", Springer, Berlin Heidelberg, 1999
  • Auditory Gammatone filters (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996) can be efficiently implemented in the time- domain, allowing the separation of a broadband audio signal into auditory band signals.
  • the magnitude response of the Gammatone Filter corresponds to the simultaneous masking properties of the human ear. Plotting the magnitude of this filter along an auditory frequency scale the filter shape remains the same, whatever center frequency the filter is designed to have.
  • the arbitrary form representing the family of Gammatone-filters of the order m is shown below, wherein it is the filterbank channel index.
  • a corresponding z-transform wherein *GF denotes an arbitrary Gammatone filter (e.g. GF, APGF, OZGF, TZGF), is:
  • An auditory Gammatone Filterbank represents of a set of overlapping Gammatone filters that devide the auditory firequeny scale in equally spaced frequency bands.
  • the term g *GF shall be adjusted so that unity gain at the center frequency ⁇ k can be provided.
  • the system H n ⁇ m,k (z) has to be adapted suitably as shown in the following sub-sections.
  • the odinary Gammatone filter (GF; R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996) has to be derived from the continuous-time impulse response using the Laplace- and impulse-invariance transform (A. V. Oppenheim, R. W. Schafer, J. R. Buck, “Discrete-Time Signal Processing", Prentice Hall, 1999): which determines the unknown polynomial H num,k (z) in the above equation (21). Due to its shape and computational cost its use is not recommended.
  • the One-Zero Gammatone (OZGF) can be efficiently composed of a common "One-Zero" for all channels k before splitting up into k All-Pole Gammatone filters.
  • Fig. 11 provides an example, which also takes the form-factor F into account
  • Suitable time-constants match the auditory pre-masking time-constant, which is approximately ⁇ avg ⁇ 2[ ⁇ ns] (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colonies, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treurniet, "PEAQ - der ein ITU-Standard Kunststoff sammlungiven Messung der Cincinnati Congress proceedingsen Audioqualitat", RTM - Rundfunktechnische Mitteilungen, die suzeitschrift fur H ⁇ rfunk und Femsehtechnik, 43. Gonzgang, ISSN 0035- 9890 (81-120), Fi ⁇ na Mensing GmbH + Co. KG, Abander Verlag, Sept 1999).
  • a consistent 90° phase shift can be brought to a broad band signal. Summing up the squares of the original and the shifted signal, squared amplitudes (i.e. signal power) remain while sinusoidal components cancel. But a causal implementation of the Hilbert transform doesn't exist. Unlike an ideal Hubert transformator, we only need 90° phase shift in the considered frequency range, i.e. in the corresponding auditory frequency group.
  • Each of the above mentioned methods can provide a 90° phase shift to a virtually arbitrary frequency ⁇ k and is therefore suitable.
  • Fig. 12 provides an example for the FIR level detection method. Appropriate parameters can be found using the phase-equations for the corresponding systems, e.g. A. V. Oppenheim, R. W. Schafer, J. R. Buck, "Discrete-Time Signal Processing", Prentice Hall, 1999.
  • the averaging parameter ⁇ k in the channel k has to correspond to human auditory post- masking time-constants at corresponding frequencies ⁇ k . Therefore, we use following equation to derive the averaging parameter ⁇ :
  • a parameter G can be used to scale the post-masking time-constants if useful.
  • This method essentially applies three time-constants of averaging to the signal level. Falling slopes are sligthtly averaged, whereas during rising input slope, the output is held constant (i.e. infinitely large time-constant) during the period of N ⁇ , sampling intervals. When N ⁇ sampling intervals are exceeded, the rising signal slope is averaged by a third time constant.
  • the time- constants can be similarly converted to recursive averaging parameters as in equation (25) and (26).
  • An appropriate counter threshold N w can be calculated using a continuous time interval T w
  • this time interval can be chosen e.g. T w ⁇ 1.5s.
  • the falling slope time-constant can be a scaled version of the post-masking time-constants r*, or e.g. constant 200 [msj.
  • the rising slope time-constant defining ⁇ can be approximately 700 [ms], which corresponds to a velocity of appoximately 6[dB]/[s]. Unlike other time-constants, this one is proposed to be equal for all channels k.
  • the noise variance is given by the noise estimation algorithm; m and n are time indices, ⁇ s is the system sample rate and L a down-sampling factor.
  • ⁇ k [m] is the a posteriori SNR
  • ⁇ k [m] is the a priori SNR
  • G w,k [m] is the spectral weight of a Wiener filter
  • a is an averaging parameter, ced by an averaging time-constant ⁇ snr,k , which is either approximately 2[ms] (F. Zotter, M. Noisternig, R.
  • up-sampling needs either a processing-delay or a group-delay due to the interpolation operation involved. Such a delay is approximately L samples long, using the up-sampling factor L.
  • Frequency domain solutions using equivalent auditory models require delays in the range of 10 miliseconds, the implementation of our system with 20 frequency bands and the third order TZGF has a mean latency of 3.5 up to 4 miliseconds.
  • ESR Ephraim and Malah suppression rule

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present invention relates to a method for enhancing wide-band speech audio signals in the presence of background noise and, more particularly to a noise suppression system, a noise suppression method and a noise suppression program. More specifically, the present invention relates to low-latency single-channel noise reduction using sub-band processing based on masking properties of the human auditory system.

Description

METHOD AND DEVICE FOR LOW-LATENCY AUDITORY MODEL-BASED SINGLE- CHANNEL SPEECH ENHANCEMENT
FIELD OF THE INVENTION
The present invention relates to a method for enhancing wide-band speech audio signals in the presence of background noise and, more particularly to a noise suppression system, a noise suppression method and a noise suppression program. More specifically, the present invention relates to low-latency single-channel noise reduction using sub-band processing based on masking properties of the human auditory system.
BACKGROUND OF THE INVENTION
Additive background noise in speech communication systems degrades the subjective quality and intelligibility of the perceived voice. Therefore, speech processing systems require noise reduction methods, i.e. methods aiming at processing a noisy signal with the purpose of eliminating or attenuating the level of noise and improving the signal-to-noise-ratio (SNR) without affecting the speech and its characteristics. In general, noise reduction is also referred to as noise suppression or speech enhancement.
For example, mobile phones are often used in environments with high level of background noise such as public spaces. The use of mobile phones, voice-controlled devices and communication systems in cars has created a great demand for hands-free in-car installations, with the objective to increase safety and convenience; in many countries and regions law prohibits e.g. hand-held telephony in cars. Noise reduction becomes important for these applications, as they often needed to operate in adverse acoustic environments, in particular at low signal-to-noise ratios (SNR) and highly time-varying noise signal characteristics (e.g. rolling noise of cars).
In room teleconferencing applications, such as video-conferencing or speech recognition and querying systems, ambient noise usually arises from fans of computers, printers, or facsimile machines, which can be considered as (long-term) stationary. Conversational noise, emerging from (telephone) talks of colleagues sharing the office, as often referred to as babble noise, contains harmonic components and is therefore much harder to attenuate by a noise reduction unit.
However, applications within hearing aids and in-car speech communication systems require noise suppression methods, which can be performed in real-time.
Despite, the fast development of the underlying hardware in terms of computing power and storage capacity supports the progress of software implementations.
One of the most widely used methods for noise reduction in real-world applications is referred to in the art as spectral subtraction (see S. F. Boll, "Suppression of Acoustic Noise in Speech using Spectral Subtraction," BEEE Trans. Acoust. Speech and Sig. Proα, vol. ASSP-27, pp. 113-120, Apr. 1979). Generally, spectral subtraction attempts to estimate the short time spectral amplitude (STSA) of clean speech from that of the noisy, speech, i.e. the desired speech contaminated by noise, by subtracting an estimate noise signal. The estimated speech magnitude is combined with the phase of the noisy speech, based on the assumption that the human ear is insensitive against phase distortions (see C. L. Wang et al., "The unimportance of phase in speech enhancement," IERE Trans. Acoust. Speech and Sig. Proc, vol. ASSP-30, pp. 679-681, Aug. 1982). In practice, spectral subtraction is implemented by multiplying the input signal spectrum with a gain function in order to suppress frequency components with low SNR. This SNR-based gain function is formed from estimates of the noise spectrum and noisy speech spectrum assuming wide-sense stationary, zero-mean random signals and the speech and the noise signals to be uncorrelated. These conventional spectral subtraction methods provide significant noise reduction with the main disadvantage of a degradation of the signal quality, acoustically perceptive as "musical tones" or "musical noise". The musical tones emerge from spectrum estimation errors. In the recent years many enhancements to the basic spectral subtraction approach have been developed.
A method to reduce musical tones which is often applied is to subtract an overestimate of the noise spectrum to reduce the fluctuations in the DFT coefficients and prevent the spectral components from going below a spectral floor (see M. Berouti et al., "Enhancement of speech corrupted by acoustic noise," in Proc. IEEE Int. Conf. on Acoust., Speech and Sig. Proc. (ICASSP'79), vol. 4, pp. 208-211, Washington D.C., Apr. 1979). This approach successfully reduces musical tones during low SNR conditions and noise only periods. The main disadvantage is the distortion of the speech signal during voice activity . In practice a tradeoff between speech quality level and residual noise floor level has to be found. Further methods cope with this problem by introducing optimal and adaptive oversubtraction factors for low SNR conditions and propose underestimation of the noise spectrum at high SNR conditions (see W. M. Kushner et al., "The effects of subtractive-type speech enhancement / noise reduction algorithms on parameter estimation for improved recognition and coding in high noise environments," in Proc. IEEE Int. Conf. Acoustics, Speech and Sig. Proc. (ICASSP'89), vol. 1, pp. 211-214, 1989).
Applying a soft-decision based modification of the spectral gain function (see R. McAulay and M. Malpass, "Speech enhancement using a soft-decision noise suppression filter," in IEEE Trans. Acoust., Speech and Sig. Proc, vol. 28, no. 2, pp. 137-145, 1980) has been shown to improve the noise suppression properties of the enhancement system in terms of musical tone suppression. These soft-decision approaches mainly depend on the a priori probability of speech absence in each spectral component of the noisy speech.
The minimum mean-square error short-time spectral amplitude estimator (MMSE-STSA, see Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error short-time amplitude estimator," TERF. Trans. Acoust Speech and Sig. Proc, vol. 32, no. 6, pp.1109-1121,
1984) and the mini mum mean-square error log spectral amplitude estimator (MMSE-LSA, Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error log spectral amplitude estimator," IEEE Trans. Acoust. Speech and Sig. Proc, vol. 33, no. 2, pp.443-445,
1985) minimize the mean squared error of the estimated short-time spectral or log spectral amplitude respectively. It was found that the nonlinear smoothing procedure of the MMSE- STSA/LSA methods (the so-called decision-directed approach) obtains a more consistent estimate of the SNR, resulting in good noise suppression without unpleasant musical tones (see O. Capp, "Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor," IEEE Trans. Speech and Audio Proc, vol. 2, no. 2, pp. 345-349, 1994). Both, Capp and Malah (see E. Malah et al., 'Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments," in Proc IEEE Int Conf. Acoust., Speech and Sig. Proc. (ICASSP'99), vol. 2, pp. 789-792, 1999) propose a limitation of the a priori SNR estimate to overcome the problem of perceptible low-level musical noise during speech pauses. The so-called a priori SNR represents the information on the unknown spectrum magnitude gathered from previous frames and is evaluated in the decision-directed approach (DDA). As the smoothing performed by the DDA may have irregularities, low-level musical noise may occur. A simple solution to this problem consists in constraining the a priori SNR by a lower bound.
In single-channel spectral subtraction the noise power spectrum is usually estimated during speech pauses requiring voice activity detection (VAD) methods (see R. McAulay and M. Malpass, "Speech enhancement using a soft-decision noise suppression niter," in TRRR Trans. Acoust., Speech and Sig. Proc., vol. 28, no. 2, pp. 137-145, 1980; and W. J. Hess, "A pitch- synchronous, digital feature extraction system for phonemic recognition of speech", in TRRP. Trans. Acoust., Speech and Sig. Proc., vol. 24, no. 1, pp. 14-25, 1976). This approach implies stationary noise characteristics during periods of speech. Arslan et al. developed a robust noise estimation method that does not require voice activity detection by recursive averaging with level dependent time constants in each subband (see L. Arslan et al. "New methods for adaptive noise suppression", in Proc. Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP-95), Detroit, May 1995). Martin proposes a noise estimation method, which is based on minimum statistics and optimal signal power spectral density (PSD) smoothing (see R. Martin, "Noise power spectral density estimation based on optimal smoothing and minimum statistics," in TRRR Trans. Speech and Audio Proc, vol. 9, no. 5, pp. 512, July 2001). Further, Ealey et al. present a method for estimating non-stationary noise throughout the duration of the speech utterance by making use of the harmonic structure of the voiced speech spectrum, also referred as harmonic tunnelling (see D. Ealey et al., 'Ηaπnonic tunnelling: tracking non-stationary noises during speech," in Proc. Eurospeech Aalborg, 2001 ). Further, as proposed by Sohn and Sung (see J. Sohn and W. Sung, "A voice activity detector employing soft decision based noise spectrum adaptation," in Proc. IEEE InL Conf. Acoustics, Speech and Sig. Proc. (ICASSP'98), vol. 1, pp- 365-368, 1998) using soft decision information, the noise spectrum is continuously adapted wheter speech is present or not.
Ephraim and Van Trees propose another important method for noise reduction based on signal subspace decomposition (see Y. Ephraim and H. L. Van Trees, "A signal subspace approach for speech enhancement", in IEEE Trans. Speech and Audio Proc, vol. 3, pp.251-266, July 1995). In doing so, the noisy signal is decomposed into a signal-plus-noise subspace and a noise subspace, where these two subspaces are orthogonal. Thus makes it possible to estimate the clean speech signal from the noisy speech signal. The resulting linear estimator is a general Wiener filter with adjustable noise level, to control the trade-off between signal distortion and residual noise, as they cannot be minimized simultanously.
Skoglund and Kleijn point out the importance of the temporal masking property in connection with the excitation of voiced speech (see J. Skoglund and W. B. Kleijn, "On Time-Frequency Masking in Voiced Speech", in EEEE Trans. Speech and Audio Proc., vol. 8, no. 4, pp. 361-369, July 2000). It is shown that noise between the excitation impulses is more perceptive than noise close to the impulses, and this is especially so for the low pitch speech for which the excitation impulses locates temporal sparsely. Temporal masking is not employed by conventioanl noise reduction methods using frequency domain MMSE estimators. Patent WO 2006 114100 discloses a signal subspace approach taking the temporal masking properties into account.
OBJECT AND SUMMARY OF THE INVENTION
The aim of the present invention consists in providing a single-channel auditory-model based noise suppression method with low-latency processing of wide-band speech signals in the presence of backgound noise. More specifically, the present invention is based on the method of spectral subtraction using a modified decision directed approach comprising oversubtraction and an adjustable noise-level to avoid perceptible musical tones. Further, the present invention uses sub-band processing plus pre- and post-filtering to give consideration to temporal and simultaneous masking inherent to human auditory perception, in particular to minimize perceptible signal distortions during speech periods.
Frequency domain processing is accomplished for the proposed system by using a nonuniform Gammatone filter bank (GTF), which is divided into critical bands, also often referred as Bark bands. This analysis filter bank separates the noisy signal into a plurality of overlapping narrow-band signals, considering spectral (simultaneous) masking properties of human auditory perception.
A pre-processor, which emulates the transfer behaviour of the human outer- and middle ear, is applied to the time-discrete noisy input signal (i.e. the desired speech contaminated by noise and interference).
In each sub-band, the level of the noisy signal is detected and smoothed. These narrow-band level detectors applied to each of the plurality of sub-bands utilize the phase of simple low-order filter sections to provide lowest signal processing delay.
From the smoothed envelope of the sub-band signals the noise level is estimated in each sub-band utilizing a heuristic approach based on recursive Minimum-Statistics.
The instantaneous signal-to-noise-ratio (SNR) in each sub-band is estimated from the envelope of the noisy signal and the noise level estimate.
The a priori SNR is estimated from the instantaneous SNR by applying the Ephraim-and- Malah Spectral Subtraction Rule (EMSR). In order to minimize the influence of estimation errors an improved decision directed approach (DDA) is proposed, introducing an underestimation parameter and a noise floor parameter.
Temporal masking based on human auditory perception is taken into account by appropriate filtering of the sub-band signals. These non-linear auditory post-masking filters apply recursive averaging to falling slopes of the signal level detected in each sub-band, with the following effects: (a) over-estimating variances of impulsive noise, (b) noise suppression algorithms do not effect signal below the temporal masking threshold, and (c) no additional signal delay is introduced to transient signals, important in speech perception.
A non-linear gain function for each sub-band is derived from the a priori SNR estimates, comprising over-subtraction of the noise signal estimates.
The noisy signal in each sub-band is multiplied by the respective gain in order to suppress the noise signal components.
An optimized nearly perfect reconstruction filter-bank employing a decision criterion for signed summation re-synthesizes the enhanced full-band speech singal.
Finally, a post-processing filter is applied to the enhanced full-band signal to compensate the effect of the pre-processing filter.
NOTES: The noise reduction methods as cited above operate in the frequency domain using the Discrete Time Fourier Transform (DTFT), which is based on block processing of the time- discrete input signals. This block procesisng introduces a signal delay depending on the frame size.
Single channel subtractive-type speech enhancement systems are efficient in reducing background noise; however, they introduce a perceptually annoying residual noise. To deal with this problem, properties of the auditory system are introduced in the enhancement process. This phenomenon is modeled by the calculation of a noise-masking threshold in frequency domain, below which all components are inaudible (see N. Virag, "Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System", IEEE Trans, on Speech and Audio Proc., vol. 7, no. 2, pp. 126-137, March 1999).
To model auditory masking in subtractive-type speech enhancement systems, filter bank implementations are especially attractive as they can be adapted to the spectral and temporal resolution of the human ear. The authors propose a noise suppression method based on spectral subtraction combined with Gammatone filter (GTF) banks divided into critcal bands. The concept of critical bands, which describes the resolution of the human auditory systems, leads to a nonlinearly warped frequency scale, called the Bark Scale (see J. O. Smith HI and J. S. Abel, "Bark and ERB Bilinear Transforms," IEEE Trans, on Speech and Audio Proα, vol. 7, no. 6, pp. 697-708, Nov. 1999).
The use of Gammatone filter banks outperforms the DTFT based approches in terms of computational complexity and overall system latency. However, the GTF approach allows implementing a low-latency analysis-synthesis scheme with low computational complexity and nearly perfect reconstruction. The proposed synthesis filter creates the broadband output signal by a simple summation of the sub-band signals, introducing a criterion that indicates the necessity of sign alteration before summation. This approach outperforms channel vocoder based approaches as proposed e.g. by McAulay and Malpass (see R. J. McAulay and M. L. Malpass, "Speech Enhancement Using a Soft-Decision Noise Suppression Filter", IEEE Trans, on Acoust., Speech and Sig. Proc, vol. ASSP-28, no. 2, pp. 137-145, April 1980). Within this approach full-band reconstruction of the output signal is performed by the summation of alternately out of phase sub-band signals without considerung the real phase realύons between subbands. This introduces higher distortions in the output signal.
Important note: Sub-band signals without downsampling, as often applied to bearing aid systems, do not require a resynthesis filter bank. Therefore this approach is applicable to low latency speech enhancement systems, but it is computational highly inefficient. The method proposed by the authors allows calculating the output signal from the sub-band signals by simple summation, taking the phase differences into account!
It is worth mentioning, that there are many applications, such as hearing aids or in-car- communication systems, where the computational complexity and signal latency is of utmost importance.
The main advantages of the present invention compared to conventional noise reduction approaches are the significant improvements concerning overall signal latency and computational efficiency.
The invention is not restricted to the following embodiment. It is merely intended to explain the inventive principle and to illustrate one possible implementation.
According to the invention, the method for low-latency auditory-model based single channel noise suppression and reduction works as an independent module and is intended for installation into a digital signal processing chain, wherein the software-specified algorithm is implemented on a commercially available digital signal processor (DSP), preferably a special DSP for audio applications.
NOTES: With the Ephraim-and-Malah Spectral Subtraction Rule (EMSR) the clean speech signal amplitude is estimated subject to the given amplitude of the noisy signal and the estimated noise variance. To avoid artifacts like musical noise, a modified descision directed approach (DDS) is applied, introducing over-subtraction (under-estimation) of the noise variance plus a noise floor parameter.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG 1 is a schematic illustration of the single-channel sub-band speech enhancement unit of the present invention.
FIG 2 is a schematic ilustration of the non-linear calculation of the gain factor for noise suppression applied to each sub-band.
FIG 3 and 4 show the roof-shaped MMSE-SP attenuation surface dependent on the a posteriori (7fc) and the a priori (£*) SNR. To include all values 0 < 7* < 00, the x-axis corresponds to 7* and not (7* - 1) as in the literature. The dash-dotted line in Fig. 3 marks the transition between the partitions */**"• and G10, the dashed line shows the power spectral subtraction contour. The contours of the DDA estimation are plotted in Fig. 4 upon the MMSE-SP attenuation surface. Dashed lines in Fig. 4 show the average of the dynamic relationships between 7* and ξk, solid lines show static relationships.
FIG 5 and 6 are illustrations of the combined (modified) DDA and MMSE-SP estimation behaviour. Dashed lines in Fig. S show the average of the dynamic relationships between 7* and ξk, solid lines show static relationships. Two fictitious hysteretis loops of Fig. 6 matching the observations from informal experiments.
FIG 7 shows a block diagram of the overall-system.
FIG 8 shows the over-all system comprising auditory frequency analysis and resynthesis as front- and back end, and using special low-latency and low-effort speech enhancement in between. A combination of an elaborate noise suppression law with a human auditory model enables high quality performance.
FIG 9 shows an outer- and middle ear filter composed of three second order sections (SOS).
FIG 10 shows an example: Three-Zero Gammatone filter of order 3. The common zero at z = 1 is not included in this figure.
FIG 11 shows a familiar way of level-detection. As the signal power is used, the squared amplitude is detected.
FIG 12 shows the Low-Latency FIR level detector
FIG 13 shows a non-linear recursive auditory post-masking filter, responding to falling slopes.
FIG 14 shows a recursive noise level estimator using three time-constant and a counter threshold.
DETATI. FD DESCRIPTION
In this description new aspects are brought forward concerning the Ephraim and Malah noise suppression rule (EMSR) and the decision directed approach (DDA) for a priori signal to noise ratio (SNR) estimation. After partitioning the domain of the amplitude estimator, it becomes clear that the combined DDA estimation obeys an unshaped hysteretic cycle. Introducing a hysteresis width parameter improves the hysteresis shape and reduces musical noise. Eventually, we obtain a more flexible noise suppressor with less dependency on the system sample rate.
I. INTRODUCTION
The Ephraim and Malah amplitude estimator and the Ephraim and Malah decision directed a priori SNR estimate (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984 and Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985.) are a powerful means of noise suppression in speech signal processing. Actually there are quite a lot of recently published works on both issues, as the combined algorithm is a powerful tool on the one hand (O. Cappέ, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994), but on the other hand simplifications (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001) as well as enhancements (L Cohen and B. Berdugo, "Speech Enhancement for non-stationary noise environments", Signal Processing, no. 11, pp. 2403-2418, Elsevier, Nov. 2001; I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR estimator", IEEE Signal Processing Letters, no. 9, pp. 725-728, Sep. 2004; I. Cohen, "Relaxed Statistical Model for Speech Enhancement and A Priori SNR Estimation", Center for Communication and Information Technologies, Israel Institute of Technology, Oct, 2003, CCTT Report no.443; M. K. Hasan, S. Salahuddin, M. R. Khan, "A Modified A Priori SNR for Speech Enhancement Using Spectral Subtraction Rules", IEEE Signal Processing Letters, vol. 11, no. 4, pp 450-453, April 2004) are desirable.
In the amplitude estimation part of the algorithm a signal model is considered in which a noisy signal y[n] consists of speech x[n] and additive noise d[n], at time-index n. The signals x[n] and d[n] are assumed to be statistically independent Gaussian random variables. Due to certain properties of the Fourier transform, the same statistical model can be assumed for corresponding complex short-term spectral amplitudes 2Ct fm] an^ £*[m] in each frequency bin k, at analysis time m (Underlined variables denote complex quantities here. Therefore, in our notation, 2Cjt[m] represents a complex variable. For simplicity of notation Λfc[m] shall represent the magnitude 12CjJm] |.)- Given the speech and noise variances σ\k and σ%k, the clean speech amplitude 2CjJm] can be estimated from the noisy speech YjJm]. An eligible estimator 2C*[m] for the clean speech amplitude is described in section I-A.
The unknown clean speech variance σ^fc is implicitly determined in the a priori SNR estimation part of the algorithm, whereas the noise variance σ\k has to be determined in advance, e.g. using Minimum Statistics (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001), MCRA (I. Cohen and B. Berdugo, "Speech Enhancement for non-stationary noise environments", Signal Processing, no. 11, pp. 2403-2418, Elsevier, Nov. 2001), or Harmonic Tunneling (D. Ealey, H. Kelleher, D. Pearce, "Harmonic Tunneling: Trackiαg Non-Stationaiy Noises During Speech", Proc. Eurospeech, 2001)
The decision directed estimation described in section I-B determines the a priori SNR ξk = σx,fe/σ.U *° eac^ frequency bin k. Additionally, the noise suppressor utilizes an instantaneous estimate, the so called a posteriori SNR, which relates the square of the current noisy magnitude to the noise variance 7jt[m] = ^[wij/σ^.
In section II an overview of the combined estimation is given, and its hysteretic shape is presented. Furthermore in section HI it is shown how a slight modification can reduce unwanted estimation behaviour and enable a smoother estimation hysteresis.
Λ. The Ephraim and Malah Suppression Rule (EMSR)
Like mentioned above, the EMSR reconstructs the magnitude of the clean speech signal Xk[m] from the noisy observation Vt[m]. As magnitudes at different time-steps πi are assumed to be statistically independent, the time index m may be dropped for simplicity of notation.
Ephraim and Malah's MMSE-SA estimator (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984) solves the Bayesian formula Xk = E {Xfcp^} to estimate the clean speech magnitude Xk. Applying different distortions to the amplitude, other estimators were derived in similar ways, i.e. the MMSE-LSA (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean- Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985) Xk = e**^*^, and Wolfe and GodsilTs MMSE-SP (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing
Workshop, pp. 496-499, 6-8. Aug 2001) Xk = y/W{X^}. For a more detailed description refer to Cohen (I. Cohen, "Relaxed Statistical Model for Speech Enhancement and A Priori SNR Estimation", Center for Communication and Information Technologies, Israel Institute of Technology, Oct. 2003, CCIT Report no. 443).
According to Ephraim and Malah the noisy phase is an optimal estimate of the clean phase. Thus the reconstruction operator is a real- valued spectral weight G[m]:
Figure imgf000013_0001
Because of its simplicity, we have chosen the Wolfe and Oodsill MMSE-SP Eq. (3) as the basis of our considerations. The corresponding weighting rule can be expressed as
Figure imgf000013_0002
using the equation of the
Figure imgf000013_0004
hi order to simplify its application, we partition the reconstruction operator into a few regions
Figure imgf000013_0007
Additionally; we can approximate the Wiener Filter by
Figure imgf000013_0003
Combining both, we can divide die MMSE-SP surface into logarithmically flat partitions (see als
Figure imgf000013_0005
Note that in the following sections we use the short form G when we refer to GMMSE-SP.
B. The Decision Directed Approach (DDA)
The DDA combines two basic SNR estimators to a new estimator of die a priori SNR ζk.
The first estimator is the instantaneous SNR (7* — 1) = !?/<* - ! = -- w - <**>/«*»•
Allowing only positive SNR values we get
Figure imgf000013_0006
which can be calculated before noise reduction. This instantaneous SNR will differ from the true SNR in the following cases:
• when the analysis time-window is too short regarding the stationarity of the signals x[n] and d[n],
• when there is non-stationary noise that can't be identified in detail, or
• when noise and speech signals are highly correlated.
The second estimator describes the reconstructed SNR, which is calculated after noise reduction using
Figure imgf000014_0001
In bad SNR conditions, e.g.0 < 7* < 2, the a posteriori SNR 7* shows relative variations in time that are smaller than those of (7*- 1) (Relative variations, e.g. 10 1og(7fc[m])— 10-log(7fc[m— 1]), are more significant than linear variations regarding human auditory perception.). Ideally, G provides a consistently high attenuation under low SNR conditions. Therefore, the reconstructed SNRrec will take more consistent values than SNRi11St in the low SNR case. Eventually, the DDA for estimation of the a priori SNR combines both SNHi11St aad SNRrec:
Figure imgf000014_0002
The specific estimation properties can be observed by inserting the suppression gain into the DDA.
II. COMBINING DDA AND EMSR
Using the partitions of the Wolfe and Godsill reconstruction operator GMMSE-SP from section I-A and inserting them into the Ephraim and Malah DDA (7), the combined a priori SNR estimation exhibits the following spheres of action:
Figure imgf000014_0003
Figure imgf000015_0001
The characteristics of the combined approach can be seen in Fig. 4. Considering the magnitude of a speech signal and a constant noise level, i.e. a time-varying a posteriori SNR 7t as input sequence, one can imagine a kind of hysteretic loop evolving on the MMSE-SP surface. Besides the obvious discontinuities in this loop, other properties can be shown (O. Cappe, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994). A. Recursive Averaging
1) Expectation by Recursive Averaging: In the above enumeration we can see that in partition 1 the a priori SNR estimation corresponds to recursive averaging (Eq. (8)) of the instantaneous SNRinSt (S). It is feasible to generalize the averaging process by introducing a time-constant τβvg specifying the averaging parameter α = exp[—l/(τavi - f3)]. Here, the sample rate /, = 1/T denotes the amount of time-frequency transformations per second.
2) The Constant-ξ-Effect: If the a priori SNR ζ* takes a constant value in partition 1, e.g. in case of a large time-constant ravg, or at the border of ξk's value range, the estimator could operate strangely. At a small and constant ξk, the system will hold its output magnitude at a constant level. This happens whenever the input is small enough (5jt 2["i]/σ d,fc — 1) "C l/ξk =» Vfc 2[m] < σ^/C, w o%kk (using (8) and its preconditions):
Figure imgf000016_0001
Under certain circumstances this can lead to annoying additional broad-band noise, which could even be worse when a limitation of ξk to a minimum ζ causes a constant output magnitude only for y?M < σyς.
3) Unstable Recursive Averaging: Following Eq. (12), partition 5 can lead to a priori SNR estimation by unstable recursive averaging of SNRinSt when a > 1/2, i.e. £* can increase suddenly in this partition.
B. Partitions Without Recursive Averaging
In the partitions 2, 3, and 4 the recursive averaging interpretation is not useful. Namely, in Eq. (9) the a priori SNR estimate ξk takes a constant value, and in Eq. (11) ξk is determined by a single tap delay. It seems odd that in Eq. (10) the SNR ξk is a down-scaled version of SNRiπst.
C. Summary of Properties
Actually, every partition except for 1 and 4 (Eqs. (8) and (H)) exhibits some unexpected behaviour. Defining a by a time-constant, we obtain generalized averaging properties in Eq. (8), whereas a sample rate dependent behaviour is introduced to the estimation defined by (9)-(12). This form of sample rate dependency rules out a general parameter set suitable for different analysis time-steps and transformation sizes.
Awkward estimation behavior, e.g. the "constant-ξ-effect", and the discontinuities in the hysteresis loop (Fig. 4) give rise to considerations concerning a modification of the DDA and a reconsideration of time-constant and minimum a priori SNR quantities.
in. A MODIFIED, FAST RESPONDING DDA
In order to minimize the influence of unexpected estimation performance, we modify the decision directed a roach to
Figure imgf000017_0002
with ζ being a noise-floor parameter (O. Cappέ, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994) and p being an under-estimation parameter of the instantaneous SNR. Similar to the partitions in section π, we can now find:
Figure imgf000017_0001
Regarding the partitions of the new estimator, an over-all estimation scheme can be shown in Fig. 5. Instead of time-constants in me range of speech quasi-stationarity, we now use ravg = 2 ms. p = 1O~15/10 ensures that the scale factor in (17) is approximately p(\ — a) « p, which fixes the discontinuities of the estimation hysteresis. We can choose the noise-floor ζ = 1O~25/10 so small that the maximum attenuation C lies at the bottom of the dynamical range of a frequency bin. These measures largely reduce the sample rate dependency described in section II-C and the "constant-f-effect" in section II-A.2.
It becomes clear that rising instantaneous SNRs are now better attenuated according to Fig. S than in Fig.4. Thus, a stronger attenuation of musical noise, i.e. inconsistently high instantaneous SNR, can be provided, while a signal with consistently high SNR is able to pass through the noise suppressor. The two curly loops in Fig. 6 give approximate examples of hysteresis loops occurring during system operation. In the recursive averaging partition the hysteresis path depends on the slope of rising or falling signal amplitude.
The parameter p can directly control the suppression hysteresis width and musical noise suppression. Our modification enables a separate control of averaging time-constant and musical noise suppression.
IV. CONCLUSION
We found a comprehensible way to graphically describe the properties of a combined Wolfe and Godsill spectral amplitude estimation and Ephraim and Malah decision directed a priori SNR estimation. This description can similarly be used for other amplitude estimation rules, and provides a new insight into the Ephraim and Malah noise suppressor.
So far the suppression of musical noise has been a trade-off between musical noise suppression and transient distortion. Small modifications in the decision directed estimation rule allow a more flexible handling of musical noise suppression, while reducing dependencies on the analysis time- step and the "constants-effect". An informal listening test using the modified algorithm with adjustable analysis time/frequency-resolution (filterbank approach) showed useful enhancements in the over-all algorithm.
Our further work will introduce our descriptive methods into the more elaborate estimation approaches of Cohen (I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR estimator", IEEE Signal Processing Letters, no. 9, pp. 725-728, Sep. 2004) or Hasan (M. K. Hasan, S. Salahuddin, M. R. Khan, "A Modified A Priori SNR for Speech Enhancement Using Spectral Subtraction Rules", IEEE Signal Processing Letters, vol. 11, no. 4, pp 450-453, April 2004). APPARATUS FOR LOW-LATENCY SINGLE CHANNEL SPEECH ENHANCEMENT In the following a proffered embodiment will be described, however the invention is not limited to this embodiment.
The reduction of musical noise in noise suppression algrithms still an issue in noise reduction. Although the Ephraim and Malah suppression rule (EMSR) and the decision directed approach (DDA) show a good performance, additional means have to be applied. Moreover, processing delays arising from signal analysis (fast Fourier transform, FFT) pose a problem in real-time applications. Essential improvements in both issues can be achieved by implementing signal analysis and filtering approaches capable of modelling the human auditory perception and latency reduction.
V. INTRODUCTION
The major part of this description is dedicated to auditory signal preparation and analysis, using efficient algorithms with low-latency. Our system combines an auditory Gammatone filterbank (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996; L. Lin, E. Ambikairajah, W. H. Holmes, "Auditory Filterbank Design Using Masking Curves", Proc. EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2001; L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002) with the Ephraim and Malah noise suppression rule (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984; Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985; P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001). This combination is newly introduced by the authors, whereas the combination of an auditory Gammatone filterbank with a Wiener noise suppressor is known from (L. Lin, E. Ambikairajah, "Speech Denoising Based on an Auditory Filterbank", 6th ICSP, International Conference on Signal Processing, (552-S5S), 26-30 Aug. 2002), and a frequency domain solution is known from WO 00/30264 (International applicatoin No. PCT/SG99/00119). Furthermore, the integration of a time-domain outer and middle ear filter, as well as the integration of a non-linear temporal post-masking filter (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colomes, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treumiet, 'TEAQ
- der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualitat", RTM
- Rundfunktechnische Mitteilungen, die Fachzeitschrift fur Hδrfunk und Fernsehtechnik, 43. Jahrgang, ISSN 0035-9890 (81-120), Fiπna Mensing GmbH + Co. KG, Abteilung Verlag, Sept 1999; L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002) into the noise suppression system is new. Additionally, a narrow band low-latency level detection exploiting the phase of simple first order filters is newly introduced. Finally, we present a simple scheme for signal reconstruction (resynthesis) avoiding band-edge signal cancellation.
• Combination of an auditory Gammatone Filterbank and the EMSR noise suppressor in a time-domain approach
• Integration of outer and middle ear filters into the suppression system in a time-domain approach
• Integration of an auditory post-masking filter
• Low-latency narrow band level-detector
• Low-Effort Wolfe and Godsill signal restauration
• Low-latency up-sampling
• Low-latency resynthesis restraining destructive interference
VI. SYSTEM OVERVIEW
The over-all system is shown in a block diagram in Fig. 7. It can be implemented as analog or digital effect processor or as a part of a software algorithm. Inside the over-all system there are several subsystems Fig. 8:
• an outer and middle ear filter (HOME).
• a Gammatone filterbank analysis section (GFB),
• the low-latency level detection (LD), • the auditory post-masking filter (PM),
• a recursive noise spectrum estimation (NE),
• the spectral sutraction weight (EMSR),
• the low-latency upsampling (L |)
• the vocoder stage, and
• the inverse outer and middle ear filter (HOME)-
VII. OUTER AND MIDDLE EAR FILTER
An outer and middle ear filter consists of three second order sections (SOS) representing the following physiological parts of the human ear (E. Zwicker, H. Fasti, "Psychoacoustics, facts and models", Springer, Berlin Heidelberg, 1999; E. Terhardt, "Akustische Kommunikation", Springer, Berlin Heidelberg, 1998):
1) the high-pass attenuation curve below 1 kHz modelling the 100-Phon curve, which represents the acoustic impedance of the outer ear and the mechanic impedance of the middle ear ossicles,
2) the resonance of the ear channel, and
3) the low-pass attenuation curve above 1 kHz modelling the threshold of hearing.
The latter two filters are optional, whereas the high-pass component is mandatory and reduces the influence of low-frequency noise on the noise suppressor.
In the end, a filter structure providing an appropriate magnitude transfer function could look like Fig. 9. All three filter sections have to be second order sections to provide appropriate slopes. The outer filter skirts can be modelled as second order low- and high-pass shelving filters, whereas the resonance can be modelled as parametric peak-filter (P. Dutilleux, U. Zolzer, "DAFX", Wiley&Sons, 2002).
The filter inversion is straight-forward. If there should be zeros at e.g. z = 1 in the z-domain, the inverse filter can't undo this, so perhaps z = 0.99 could be a proper choice for a pole location inverting a z = l zero.
VIII. FREQUENCY GROUPING / AUDITORY BANDWIDTHS
Frequency grouping is an imporant effect in human loudness perception. The perceived loudnesss consists of particular loudnesses associated to individual frequency ranges. An auditory frequency scale can be used to model this frequency grouping effects, the units of which can be seen as frequency resolution of human auditory loudness perception (E. Zwicker, H. Fasti, "Psychoacoustics, facts and models", Springer, Berlin Heidelberg, 1999). We denote an arbitrary auditory frequency transform with the operator <B{ } and the corresponding inverse frequency transform with Q3-1{ }. A reasonable frequency scale using a low number of frequency groups can be given by the formula of Traunmuller (E. Terhardt, "Akustische Kommunikation", Springer, Berlin Heidelberg, 19
Figure imgf000022_0001
Accordingly, the inverse transform 93~1{ } is
Figure imgf000022_0002
The center frequencies /* of the auditory filterbank can be calculated applying the inverse transform /* = ^B~l{uk} on an equally spaced scale vk (with spacing dv, e.g. dv — l[Baxk]) in the Bark-domain. Similarly the bandwidths Bk can be derived from Bk — 35~1{ι/fc + du/2} — 93"1I^ — du/2}. Other Bark-scames (e.g. E. Zwicker, H. Fasti, 'Tsychoacoustics, facts and models", Springer, Berlin Heidelberg, 1999) use smaller bandwidths resulting in auditory filters with more group-delay, thus the above spacing is preferred.
Note that we use v for the Bark-frequency instead of z in order to avoid confusion with the z-domain variable z.
IX. AUDITORY GAMMATONE FILTERS
Auditory Gammatone filters (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996) can be efficiently implemented in the time- domain, allowing the separation of a broadband audio signal into auditory band signals. The magnitude response of the Gammatone Filter corresponds to the simultaneous masking properties of the human ear. Plotting the magnitude of this filter along an auditory frequency scale the filter shape remains the same, whatever center frequency the filter is designed to have. The arbitrary form representing the family of Gammatone-filters of the order m is shown below, wherein it is the filterbank channel index. A corresponding z-transform wherein *GF denotes an arbitrary Gammatone filter (e.g. GF, APGF, OZGF, TZGF), is:
Figure imgf000023_0001
Digital center frequencies θk and pole radii rk are derived from the continuous-time quantities center frequency /*, bandwidth Bk, the band-edge rejection CdB (e g- CdB = — 5[dB]), and the sample rate fa:
Figure imgf000023_0002
An auditory Gammatone Filterbank represents of a set of overlapping Gammatone filters that devide the auditory firequeny scale in equally spaced frequency bands. An order m = 4 is frequently used in literature, whereas the order m = 3 is proposed to minimize computational cost The term g*GF shall be adjusted so that unity gain at the center frequency ∫k can be provided. For a special form of gammatone filter the system Hnυm,k(z) has to be adapted suitably as shown in the following sub-sections.
A. Ordinary Gammatone filter
The odinary Gammatone filter (GF; R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996) has to be derived from the continuous-time impulse response using the Laplace- and impulse-invariance transform (A. V. Oppenheim, R. W. Schafer, J. R. Buck, "Discrete-Time Signal Processing", Prentice Hall, 1999):
Figure imgf000023_0003
which determines the unknown polynomial Hnum,k(z) in the above equation (21). Due to its shape and computational cost its use is not recommended.
B. All-Pole Gammatone filter
An All-Pole Gammatone filter (APGF) can be obtained when just cancelling the polynomial Hnum,k(z) = 1 in equation (21). It is the most efficient Gammatone filter (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996). C. One-Zero Gammatone filter
Just setting Hama^(z) = (1— z'1) in equation (21) leads to the so-called One-Zero Gammatone filter (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996). The One-Zero Gammatone (OZGF) can be efficiently composed of a common "One-Zero" for all channels k before splitting up into k All-Pole Gammatone filters.
D. Three-Zero Gammatone filter
When adding a pair of complex conjugate zeroes z = τz · e±θz,k with the digital frequency θz,k at 1 Bark above center frequency θk, with radius τt ≈ 0.98, and one additional zero at z = 1, we obtain Hnum,k(z) = (1 - 2rz cos(θz,k)z-χ + τzz-2) · (1 - z-1) for the Three-Zero Gammatone (TZGF) filter with its improved shape (L. Lin, E. Ambikairajah, W. H. Holmes, "Auditory Filterbank Design Using Masking Curves", Proc. EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2001). In comparison, the computational cost of the One-Zero Gammatone filter of order m+ 1 is equal to the cost of the Three-Zero Gammatone Filter of order m, when again, a single "One-Zero" is commonly used by all channels k. Appropriate transforms and digital frequency calculation θz,k follow from the equations (19)(20) and (22).
X. RESYNTHESIS
A resynthesis of a broadband signal from the auditory band signals can be implemented as an addition of all signal bands. Unfortunately this can bring destructive signal cancellation at the overlap between neighbouring signal channels. Therefore we derived a simple criterion that indicates the necessity of a sign alteration for every second channel befor signal summation
Figure imgf000024_0001
Using this formula, the frequency response of the superposition of all signals lies in the range of approximately CdB + 3[dB] and 0[dB]. Omitting a necessary sign alteration can result in a destructive signal cancellation at the band-edges of adjacent filters. XI. (LOW-LATENCY) LEVEL DETECTION
Masking effects modelled by the auditory filterbank cannot be exploited unless the amplitude of the filterbank channels is determined. Suitable ways of level detection are proposed in the following sub-sections.
We propose to use the first simple approach for the high-frequency channels, and the low- latency approach for low-frequency bands.
A. Ordinary Level-Detection with pre-masking
Usually non-linearities nice e.g. the absolute value, square, or half-wave rectification are used to transform the signal amplitude into the base band around 0 Hz. Further a smoothing filter removes components at higher frequencies, and in the end the desired amplitude signal is found. Fig. 11 provides an example, which also takes the form-factor F into account
The commonly used approach of amplitude detection is computationally efficient, but smoothing filters involve group delays in the signal path that have to be compensated for. We recommend to describe the recursive smoothing parameter α by a time-constant τavg in [s]
Figure imgf000025_0001
Suitable time-constants match the auditory pre-masking time-constant, which is approximately τavg ≈ 2[τns] (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colonies, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treurniet, "PEAQ - der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualitat", RTM - Rundfunktechnische Mitteilungen, die Fachzeitschrift fur Hδrfunk und Femsehtechnik, 43. Jahrgang, ISSN 0035- 9890 (81-120), Fiπna Mensing GmbH + Co. KG, Abteilung Verlag, Sept 1999).
B. Low-Latency Level detection
Our new method exploits the phase of simple filter sections. This method for level detection is also applicable to other technical fields and not restricted to noise suppression alone.
Using a Hubert transform, a consistent 90° phase shift can be brought to a broad band signal. Summing up the squares of the original and the shifted signal, squared amplitudes (i.e. signal power) remain while sinusoidal components cancel. But a causal implementation of the Hilbert transform doesn't exist. Unlike an ideal Hubert transformator, we only need 90° phase shift in the considered frequency range, i.e. in the corresponding auditory frequency group.
We propose to use the following kinds of filters to provide a 90° phase shift at a frequency θk:
• a simple FIR first order section,
• a simple IER first order all-pass (AP), and
> a simple delay line providing a λ/4 delay at θk
Each of the above mentioned methods can provide a 90° phase shift to a virtually arbitrary frequency θk and is therefore suitable. One can choose between the following properties:
• FIR: numerical not stable around θk = [0, π/2, π], providing the broadest band featuring a 90° phase.
• AP: numerical unstable around θk = [0, π/2, π], the 90° phase frequency band is smaller, computational effort bigger.
• λ/4-delay: numerical stable, the smallest frequency band of 90" phase, computational effort low, more memory needed.
Fig. 12 provides an example for the FIR level detection method. Appropriate parameters can be found using the phase-equations for the corresponding systems, e.g. A. V. Oppenheim, R. W. Schafer, J. R. Buck, "Discrete-Time Signal Processing", Prentice Hall, 1999.
XII. AUDITORY POST-MASKING
Using a non-linear post-masking filter (i.e. recursive averaging only responding to a falling slope) exhibits several benefits:
• impulsive noise variance is slightly over-estimated (over-subtraction) because of the post- masking.
• noise suppression algorithms cannot attenuate signals until the auditory post masking time has elapsed.
• aliasing effects after downsampling or ripples in the amplitude signals are reduced due the post-masking smoothing operation.
• though smoothing is applied, no group delay is brought to the amplitude of important transient signals We propose a structure that works on the signal power detected in each channel (cf. Fig. 13, L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002).
The averaging parameter αk in the channel k; has to correspond to human auditory post- masking time-constants at corresponding frequencies ∫k. Therefore, we use following equation to derive the averaging parameter α:
Figure imgf000027_0001
A parameter G can be used to scale the post-masking time-constants if useful.
The time-constant for l[Bark] is approximately τv1 ≈ 40[ms], and for 20[Bark] approximately τv20 ≈ 4[ms] (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colomes, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treurniet, "PEAQ - der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualitat", RTM - Rundfunktechnische Mitteilungen, die Fachzeitschrift fur Hόrfunk und Fernsehtechnik, 43. Jahrgang, ISSN 0035- 9890 (81-120), Firma Mensing GmbH + Co. KG, Abteilung Verlag, Sept 1999). Following equation can be used to derive τk
Figure imgf000027_0002
Alternatively, the equation in above cited reference can be used, but our formula provides a suitable interpolation with longer time-constants.
XIII. RECURSIVE MINIMUM STATISTICS
We can use the structure in Fig. 14 to estimate the noise level in each frequency band. Similar approaches can be found in R. Martin, "Noise Power Spectral Estimation Based on Optimal Smoothing and Minimum Statistics", IEEE Transactions on Speech and Audio Processing, nr. 5, vol. 9, pp. 504-512, JuL 2001 or WO 00/30264 (International applicatoin No. PCT/SG99/00119).
This method essentially applies three time-constants of averaging to the signal level. Falling slopes are sligthtly averaged, whereas during rising input slope, the output is held constant (i.e. infinitely large time-constant) during the period of Nω, sampling intervals. When Nω sampling intervals are exceeded, the rising signal slope is averaged by a third time constant. The time- constants can be similarly converted to recursive averaging parameters as in equation (25) and (26). An appropriate counter threshold Nw can be calculated using a continuous time interval Tw
Figure imgf000028_0001
Suitable to utterances or words of human speech, this time interval can be chosen e.g. Tw ≈ 1.5s. The falling slope time-constant can be a scaled version of the post-masking time-constants r*, or e.g. constant 200 [msj.
The rising slope time-constant defining β can be approximately 700 [ms], which corresponds to a velocity of appoximately 6[dB]/[s]. Unlike other time-constants, this one is proposed to be equal for all channels k.
The saturation operation in Fig. 14 can be expressed as:
Figure imgf000028_0002
XIV. EPHRAIM AND MALAH NOISE SUPPRESSION RULE (EMSR)
With the EMSR (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984; Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985) we can estimate the clean speech amplitude subject to the given noisy speech amplitude and the noise variance. We can e.g. use the Wolfe and Godsill definition of the spectral weight (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001) and a modified decision directed approach (F. Zotter, M. Noisternig, R. Hδldrich, "Speech Enhancement Using the Ephraim and Malah Suppression Rule and Decision Directed Approach: A Hysteretic Process", to appear in IEEE Signal Processing Letters, 2005. First manuscript submitted Jan 24, 2005)
Figure imgf000028_0003
The following relations are involved in the above equation:
Figure imgf000029_0001
The noise variance is given by the noise estimation algorithm; m and n are time
Figure imgf000029_0002
indices, ∫s is the system sample rate and L a down-sampling factor.
According to Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean- Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984, γk[m] is the a posteriori SNR, and ξk[m] is the a priori SNR. Gw,k[m] is the spectral weight of a Wiener filter, a is an averaging parameter, denned by an averaging time-constant τsnr,k, which is either approximately 2[ms] (F. Zotter, M. Noisternig, R. Höldrich, "Speech Enhancement Using the Ephraim and Malah Suppression Rule and Decision Directed Approach: A Hysteretic Process", to appear in IEEE Signal Processing Letters, 2005. First manuscript submitted Jan 24, 2005) or derived from the auditory post-masking time-constants.
The "over-subtraction factor" p (cf. Zotter et at) can be chosen to be p = 10-15/10, and the noise-floor parameter ζ can be ζ = 10-40/'10.
XV. LOW-LATENCY Up-SAMPLING
Usually up-sampling needs either a processing-delay or a group-delay due to the interpolation operation involved. Such a delay is approximately L samples long, using the up-sampling factor L.
We propose to use a special method for up-sampling introducing no additional delays. This can be done if the signal is devided into buffers (perferably with the buffer size of the ADC and DAC). When in every signal block the last sample of the preceeding block is given, it is possible to linearly interpolate to the following given sample instantaneously. Therefore, the last sample in every block must correspond to a sampling instant at the lower sampling rate.
XVI. CONCLUSIONS
Frequency domain solutions using equivalent auditory models require delays in the range of 10 miliseconds, the implementation of our system with 20 frequency bands and the third order TZGF has a mean latency of 3.5 up to 4 miliseconds. The required computational cost is about approximately 8.9 MIPs at fs = 16 [kHz], which is only slightly more than DFT solutions need (7 MIPs). We also apply a slightly modified Ephraim and Malah suppression rule (EMSR) using the simplified Wolfe and Godsill formula and modified decision directed approach.
The disclosure of all cited publications is included in its entirety into this description.

Claims

1. A method for suppressing noise in an input audio signal (y[n]) which comprises a wanted signal component (x[n]) and a noise signal component, the method comprising the steps of
- dividing the input audio signal (y[n]) into a plurality of frequency subbands (yk[n]) by means of an analysis band splitter,
- suppressing noise in each of the subbands (yk[n]) by a plurality of noise supressing processors,
- recombining the plurality of subbands (yk[n]) into an output signal (x[n]) by means of a synthesis filter, all steps being performed in the time domain.
2. Method according to claim 1, characterized in that the dividing of the input audio signal into a plurality of subbands by means of the analysis band splitter is performed according to human auditory loudness perception.
3. Method according to claim 2, characterized in mat the analysis band splitter comprises a Gammatone filter bank (GFB), preferably a nonuniform Gammatone filter bank.
4. Method according to any of claims 1 to 3, characterized in that a pre-processor (HOME) and post-processor (H IOME) perform non-linear filtering to the input audio signal, comprising a. a pre-processing filter, which emulates the transfer behaviour of the human outer and middle ear applied to the time-discrete noisy input audio signal b. a post-processing filter applied to the enhanced full-band signal to compensate the effect of the pre-processing filter.
5. Method according to any of claims 1 to 4, characterized in that each noise processor is comprised of a signal level detector (LD), a noise estimator (NE), an auditory masking filter (PM) and a subtraction processor.
6. Method according to claim 5, wherein said signal level detector (LD) exploits the phase of low-order filter sections to generate a quadrature signal and an in-phase signal out of the sub-band signal yk[n]) and summing up the squared amplitudes of these signals,
7. Method according to claim 5, wherein said noise estimator generates a sub-band noise value by performing smoothing based on Minimum Statistics, more particularly weighted averaging of the previous noise value and the current input value with three different time constants is applied.
8. Method according to claim 5 or 6, wherein said auditory masking filter uses the signal power detected in each sub-channel to generate a temporal masking behaviour based on human auditory perception, more particularly non-linear weighted averaging of the previous signal value and the current sub-band input value is applied only on the falling slope depending on the level detected in each sub-band.
9. Method according to claims 1 to 8, wherein the update of the noise estimator depends on the current input value compared to time-varying, level dependent thresholds, i.e. if the current input value is greater than a predetermined threshold value the current input value is not considered to be noiwe and said noise estimator is not updated.
10. Method according to claims 1 to 9, wherein the noise suppression in each of the subbands is performed using the Ephraim and Malah noise suppression rule (EMSR).
11. Method according to claims 1 to 10, wherein the noise suppression in each of the subbands is performed decision directed approach (DDA).
12. Apparatus for suppressing noise in an input audio signal (y[nj) which comprises a wanted signal component (x[n]) and a noise signal component, the apparatus comprising
- an analysis band splitter for dividing the input audio signal (y[n\) into a plurality of frequency subbands (yk[n])>
- a plurality of noise supressing processors for suppressing noise in each of the subbands (yk[n])f
- a synthesis filter for recombining the plurality of subbands (yk[n]) into an output signal (x[n]), analysis band splitter, noise supressing processors and synthesis filter working in the time domain.
13. Apparatus according to claim 12, characterized in that a level detector (LD) is provided in each of the subbands.
14. Apparatus according to claim 13, characterized in that said signal level detector (LD) exploits the phase of low-order filter sections to generate a quadrature signal and an in-phase signal out of the sub-band signal (yk[n]) and summing up the squared amplitudes of these signals.
15. Apparatus according to claim 14, characterized in that said quadrature signal is generated by FIR first order section provided in the level detector (LD).
16. Apparatus according to claim 14, characterized in that said quadrature signal is generated by FIR first order all-pass (AP) provided in the level detector (LD).
17. Apparatus according to claim 14, characterized in that said quadrature signal is generated by a delay line providing a λ/4 delay at the digital center frequency (θk).
18. Apparatus according to any of claims 12 to 17, characterized in that each noise processor is comprised of a signal level detector (LD), a noise estimator (NE), an auditory masking filter (PM) and a subtraction processor.
19. Apparatus according to any of Claims 12 to 18, characterized in mat the analysis band splitter comprises a Gammatone filter bank (GFB), preferably a nonuniform Gammatone filter bank,
20. Apparatus according to any of claims 12 to 19, characterized in that a pre-processor (HOME) and post-processor (HIOME) is provided for performing non-linear filtering to the input audio signal, comprising a. a pre-processing filter, which emulates the transfer behaviour of the human outer and middle ear applied to the time-discrete noisy input audio signal b. a post-processing filter applied to the enhanced full-band signal to compensate the effect of the pre-processing filter.
PCT/AT2007/000466 2007-10-02 2007-10-02 Method and device for low-latency auditory model-based single-channel speech enhancement Ceased WO2009043066A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
GB1004090.5A GB2465910B (en) 2007-10-02 2007-10-02 Method and device for low-latency auditory model-based single-channel speech enhancement
DE112007003674T DE112007003674T5 (en) 2007-10-02 2007-10-02 Method and apparatus for single-channel speech enhancement based on a latency-reduced auditory model
AT0956707A AT509570B1 (en) 2007-10-02 2007-10-02 METHOD AND APPARATUS FOR ONE-CHANNEL LANGUAGE IMPROVEMENT BASED ON A LATEN-TERM REDUCED HEARING MODEL
PCT/AT2007/000466 WO2009043066A1 (en) 2007-10-02 2007-10-02 Method and device for low-latency auditory model-based single-channel speech enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/AT2007/000466 WO2009043066A1 (en) 2007-10-02 2007-10-02 Method and device for low-latency auditory model-based single-channel speech enhancement

Publications (1)

Publication Number Publication Date
WO2009043066A1 true WO2009043066A1 (en) 2009-04-09

Family

ID=39447761

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AT2007/000466 Ceased WO2009043066A1 (en) 2007-10-02 2007-10-02 Method and device for low-latency auditory model-based single-channel speech enhancement

Country Status (4)

Country Link
AT (1) AT509570B1 (en)
DE (1) DE112007003674T5 (en)
GB (1) GB2465910B (en)
WO (1) WO2009043066A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102157156A (en) * 2011-03-21 2011-08-17 清华大学 Single-channel voice enhancement method and system
EP2495724A1 (en) * 2011-02-17 2012-09-05 Siemens Medical Instruments Pte. Ltd. Device and method for estimating an interference noise
EP2747081A1 (en) * 2012-12-18 2014-06-25 Oticon A/s An audio processing device comprising artifact reduction
US9173025B2 (en) 2012-02-08 2015-10-27 Dolby Laboratories Licensing Corporation Combined suppression of noise, echo, and out-of-location signals
US10141003B2 (en) 2014-06-09 2018-11-27 Dolby Laboratories Licensing Corporation Noise level estimation
CN112151060A (en) * 2020-09-25 2020-12-29 展讯通信(天津)有限公司 Single-channel voice enhancement method and device, storage medium and terminal
US10939161B2 (en) 2019-01-31 2021-03-02 Vircion LLC System and method for low-latency communication over unreliable networks
WO2021128670A1 (en) * 2019-12-26 2021-07-01 紫光展锐(重庆)科技有限公司 Noise reduction method, device, electronic apparatus and readable storage medium
US12160709B2 (en) 2022-08-23 2024-12-03 Sonova Ag Systems and methods for selecting a sound processing delay scheme for a hearing device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110580910B (en) * 2018-06-08 2024-04-26 北京搜狗科技发展有限公司 Audio processing method, device, equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002011125A1 (en) * 2000-07-31 2002-02-07 Herterkom Gmbh Attenuation of background noise and echoes in audio signal
EP1600947A2 (en) * 2004-05-26 2005-11-30 Honda Research Institute Europe GmbH Subtractive cancellation of harmonic noise
WO2006114100A1 (en) * 2005-04-26 2006-11-02 Aalborg Universitet Estimation of signal from noisy observations
EP1729287A1 (en) * 1999-01-07 2006-12-06 Tellabs Operations, Inc. Method and apparatus for adaptively suppressing noise

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052771A (en) 1998-01-20 2000-04-18 International Business Machines Corporation Microprocessor with pipeline synchronization
DE69932626T2 (en) 1998-11-13 2007-10-25 Bitwave Pte Ltd. SIGNAL PROCESSING DEVICE AND METHOD
US6377637B1 (en) * 2000-07-12 2002-04-23 Andrea Electronics Corporation Sub-band exponential smoothing noise canceling system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1729287A1 (en) * 1999-01-07 2006-12-06 Tellabs Operations, Inc. Method and apparatus for adaptively suppressing noise
WO2002011125A1 (en) * 2000-07-31 2002-02-07 Herterkom Gmbh Attenuation of background noise and echoes in audio signal
EP1600947A2 (en) * 2004-05-26 2005-11-30 Honda Research Institute Europe GmbH Subtractive cancellation of harmonic noise
WO2006114100A1 (en) * 2005-04-26 2006-11-02 Aalborg Universitet Estimation of signal from noisy observations

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
AMIR HUSSAIN ET AL: "Nonlinear Adaptive Speech Enhancement Inspired by Early Auditory Processing", NONLINEAR SPEECH MODELING AND APPLICATIONS LECTURE NOTES IN COMPUTER SCIENCE;LECTURE NOTES IN ARTIFICIAL INTELLIG ENCE;LNCS, SPRINGER-VERLAG, BE, vol. 3445, 1 January 2005 (2005-01-01), pages 291 - 316, XP019012533, ISBN: 978-3-540-27441-4 *
JAN SKOGLUND ET AL: "On Time-Frequency Masking in Voiced Speech", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 8, no. 4, 1 July 2000 (2000-07-01), XP011054031, ISSN: 1063-6676 *
JOHNSON ET AL: "Speech signal enhancement through adaptive wavelet thresholding", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 49, no. 2, 15 February 2007 (2007-02-15), pages 123 - 133, XP005890520, ISSN: 0167-6393 *
KALLIRIS M G ET AL: "Broad-Band Acoustic Noise Reduction Using a Novel Frequency Depended Parametric Wiener Filter. Implementations using Filterbank, STFT and Wavelet Analysis/Synthesis Techniques.", AUDIO ENGINEERING SOCIETY (AES) CONVENTION, 12 May 2001 (2001-05-12) - 15 May 2001 (2001-05-15), Amsterdam, The Netherlands, pages 1 - 9, XP002499667 *
LIN L ET AL: "Speech denoising based on an auditory filterbank", SIGNAL PROCESSING, 2002 6TH INTERNATIONAL CONFERENCE ON AUG. 26-30, 2002, PISCATAWAY, NJ, USA,IEEE, vol. 1, 26 August 2002 (2002-08-26), pages 552 - 555, XP010628047, ISBN: 978-0-7803-7488-1 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2495724A1 (en) * 2011-02-17 2012-09-05 Siemens Medical Instruments Pte. Ltd. Device and method for estimating an interference noise
US8634581B2 (en) 2011-02-17 2014-01-21 Siemens Medical Instruments Pte. Ltd. Method and device for estimating interference noise, hearing device and hearing aid
CN102157156A (en) * 2011-03-21 2011-08-17 清华大学 Single-channel voice enhancement method and system
US9173025B2 (en) 2012-02-08 2015-10-27 Dolby Laboratories Licensing Corporation Combined suppression of noise, echo, and out-of-location signals
EP2747081A1 (en) * 2012-12-18 2014-06-25 Oticon A/s An audio processing device comprising artifact reduction
US10141003B2 (en) 2014-06-09 2018-11-27 Dolby Laboratories Licensing Corporation Noise level estimation
US10939161B2 (en) 2019-01-31 2021-03-02 Vircion LLC System and method for low-latency communication over unreliable networks
WO2021128670A1 (en) * 2019-12-26 2021-07-01 紫光展锐(重庆)科技有限公司 Noise reduction method, device, electronic apparatus and readable storage medium
US12260873B2 (en) 2019-12-26 2025-03-25 Unisoc (Chongqing) Technologies Co., Ltd. Method and apparatus of noise reduction, electronic device and readable storage medium
CN112151060A (en) * 2020-09-25 2020-12-29 展讯通信(天津)有限公司 Single-channel voice enhancement method and device, storage medium and terminal
CN112151060B (en) * 2020-09-25 2022-11-25 展讯通信(天津)有限公司 Single-channel voice enhancement method and device, storage medium and terminal
US12160709B2 (en) 2022-08-23 2024-12-03 Sonova Ag Systems and methods for selecting a sound processing delay scheme for a hearing device

Also Published As

Publication number Publication date
DE112007003674T5 (en) 2010-08-12
GB2465910B (en) 2012-02-15
GB201004090D0 (en) 2010-04-28
AT509570A5 (en) 2011-09-15
AT509570B1 (en) 2011-12-15
GB2465910A (en) 2010-06-09

Similar Documents

Publication Publication Date Title
US7313518B2 (en) Noise reduction method and device using two pass filtering
Martin Speech enhancement based on minimum mean-square error estimation and supergaussian priors
US8560320B2 (en) Speech enhancement employing a perceptual model
WO2009043066A1 (en) Method and device for low-latency auditory model-based single-channel speech enhancement
Soon et al. Speech enhancement using 2-D Fourier transform
JP5068653B2 (en) Method for processing a noisy speech signal and apparatus for performing the method
Abramson et al. Simultaneous detection and estimation approach for speech enhancement
Wu et al. Subband Kalman filtering for speech enhancement
Chen et al. Fundamentals of noise reduction
Mosayyebpour et al. Single-microphone early and late reverberation suppression in noisy speech
Soon et al. Wavelet for speech denoising
EP1995722B1 (en) Method for processing an acoustic input signal to provide an output signal with reduced noise
Diethorn Subband noise reduction methods for speech enhancement
Taşmaz et al. Speech enhancement based on undecimated wavelet packet-perceptual filterbanks and MMSE–STSA estimation in various noise environments
Sunnydayal et al. A survey on statistical based single channel speech enhancement techniques
Saleem et al. Machine learning approach for improving the intelligibility of noisy speech
Li et al. A block-based linear MMSE noise reduction with a high temporal resolution modeling of the speech excitation
WO2006114100A1 (en) Estimation of signal from noisy observations
Diethorn Subband noise reduction methods for speech enhancement
Esch et al. Model-based speech enhancement exploiting temporal and spectral dependencies
Mwema et al. A spectral subtraction method for noise reduction in speech signals
Yong et al. Real time noise suppression in social settings comprising a mixture of non-stationary anc transient noise
Dionelis On single-channel speech enhancement and on non-linear modulation-domain Kalman filtering
Tilp Single-channel noise reduction with pitch-adaptive post-filtering
Roy Single channel speech enhancement using Kalman filter

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07815133

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 1004090

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20071002

WWE Wipo information: entry into national phase

Ref document number: 1004090.5

Country of ref document: GB

WWE Wipo information: entry into national phase

Ref document number: 1120070036745

Country of ref document: DE

ENP Entry into the national phase

Ref document number: 95672007

Country of ref document: AT

Kind code of ref document: A

RET De translation (de og part 6b)

Ref document number: 112007003674

Country of ref document: DE

Date of ref document: 20100812

Kind code of ref document: P

122 Ep: pct application non-entry in european phase

Ref document number: 07815133

Country of ref document: EP

Kind code of ref document: A1