[go: up one dir, main page]

WO2025014685A1 - Environmental noise compensation in teleconferencing - Google Patents

Environmental noise compensation in teleconferencing Download PDF

Info

Publication number
WO2025014685A1
WO2025014685A1 PCT/US2024/036469 US2024036469W WO2025014685A1 WO 2025014685 A1 WO2025014685 A1 WO 2025014685A1 US 2024036469 W US2024036469 W US 2024036469W WO 2025014685 A1 WO2025014685 A1 WO 2025014685A1
Authority
WO
WIPO (PCT)
Prior art keywords
sii
current
target
noise
control system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/036469
Other languages
French (fr)
Inventor
Ning Wang
Shanush Prema Thasarathan
Byung Hoon Cho
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Publication of WO2025014685A1 publication Critical patent/WO2025014685A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03GCONTROL OF AMPLIFICATION
    • H03G3/00Gain control in amplifiers or frequency changers
    • H03G3/20Automatic control
    • H03G3/30Automatic control in amplifiers having semiconductor devices
    • H03G3/32Automatic control in amplifiers having semiconductor devices the control being dependent upon ambient noise level or sound level
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/002Applications of echo suppressors or cancellers in telephonic connections
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2236Quality of speech transmission monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Definitions

  • At least some aspects of the present disclosure may be implemented via methods. Some such methods involve compensating for environmental noise during a teleconference. For example, some methods may involve estimating, by a control system, a current speech spectrum corresponding to speech of remote teleconference participants and estimating, by the control system, a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located. Some methods may involve calculating, by the control system, a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum. Some methods may involve determining, by the control system and based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant. Some methods may involve updating, by the control system, at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.
  • SII speech intelligibility index
  • the determining may involve evaluating the current SII according to one or more target SII parameters. In some examples, the determining may involve determining whether the current SII is within a target SII range. Some methods may involve adjusting, by the control system, at least a portion of the local audio system responsive to determining that the adjustment should be made.
  • Some methods may involve determining a confidence value corresponding to the current noise spectrum.
  • the confidence value may, for example, indicate a likelihood of a current input audio frame corresponding mainly to ambient noise.
  • Some methods may involve determining a band-based confidence value for each frequency band of a plurality of frequency bands.
  • determining whether to make the adjustment of the local audio system may involve determining whether to update one or more noise statistics for each frequency band based, at least in part, on the band-based confidence value.
  • Some methods may involve determining whether to update one or more noise statistics based, at least in part, on the confidence value.
  • estimating the current noise spectrum may involve estimating an echo coupling gain corresponding to local loudspeaker playback captured by a local microphone system.
  • estimating the echo coupling gain may involve determining a maximum band loudspeaker reference power for each frequency band of a current audio frame and previous N-l audio frames, where N s an integer corresponding to a number of audio frames in a delay line.
  • estimating the echo coupling gain may involve tracking a minimum power for each frequency band of an input microphone signal and estimating an echo coupling gain for each frequency band of the input microphone signal based, at least in part, on the maximum band loudspeaker reference power and the minimum power.
  • Some methods may involve determining whether to update an estimated echo coupling gain based, at least in part, on whether a current estimated echo coupling gain value has changed more than a threshold amount from a most recent estimated echo coupling gain value, on whether an estimated echo coupling gain has been updated within a threshold time interval, or both.
  • Some methods may involve, responsive to determining that the adjustment should be made, adjusting at least a portion of the local audio system to maintain the current SII within a target SII range. Some methods may involve, responsive to determining that the adjustment should be made, adjusting at least a portion of the local audio system to maintain the current SII within a target SII range. According to some examples, maintaining the current SII within a target SII range may involve increasing a loudspeaker playback volume of one or more audio frames until the current SII is greater than a first target SII and less than a high target SII. The first target SII may, in some examples, be greater than a median target SII and less than the high target SII.
  • maintaining the current SII within a target SII range may involve decreasing a loudspeaker playback volume of one or more audio frames until the current SII is less than a second target SII and greater than a low target SII.
  • the second target SII may be less than a median target SII and greater than the low target SII.
  • non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented in a non-transitory medium having software stored thereon.
  • RAM random access memory
  • ROM read-only memory
  • an apparatus may include an interface system and a control system.
  • the control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • the apparatus may be one of the above-referenced audio devices.
  • the apparatus may be another type of device, such as a mobile device, a laptop, a server, etc.
  • Figure 1A shows examples of sound sources that may be captured by a local microphone during a teleconference.
  • Figure IB is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure.
  • Figure 2A shows example blocks of a novel noise estimator according to some examples.
  • Figure 2B shows blocks of the main noise estimator of Figure 2A according to some disclosed implementations.
  • Figure 3A shows blocks of the echo coupling gain estimator of Figure 2B according to some disclosed implementations.
  • Figure 3B shows a plot of estimated SII over time according to one example.
  • Figure 4 shows a plot of estimated SII over time according to another example.
  • Figure 5 shows another plot of estimated SII over time.
  • Figure 6 is a flow diagram that outlines another example of a method that may be performed by an apparatus or system such as those disclosed herein.
  • Like reference numbers and designations in the various drawings indicate like elements.
  • Figure 1A shows examples of sound sources that may be captured by a local microphone during a teleconference.
  • the local speech 102 of a local teleconference participant, the environmental noise 104 — also referred to herein as “background noise” or “ambient noise” — and the loudspeaker playback sounds 106 — also referred to herein as “echo” — from the loudspeaker 108 are captured by microphone 112.
  • the speech of remote teleconference participants also referred to herein as “far-end speech” — is played back by the loudspeaker 108.
  • the term “remote teleconference participants” refers to teleconference participants who are in locations other than that of the local teleconference participant.
  • Figure 1A also shows an example of a noise estimator 114, examples of which are disclosed herein.
  • the noise estimator 114 may, for example, be implemented via an instance of the control system 110 that is described with reference to Figure IB.
  • the speech of remote teleconference participants and the local speech 102 of a local teleconference participant are usually time divided. Instances of “double talk,” during which the far-end speech and local speech 102 overlap in time, are atypical. Most of the time, double talk only happens when teleconference participants want to interrupt. To prevent recaptured far-end speech being transmitted back to the far end, echo management is normally employed in the signal chain, and the echo signal 106 is often mostly suppressed.
  • control system 110 may be configured for estimating a current speech spectrum corresponding to speech of remote teleconference participants.
  • the control system 110 may be configured for estimating a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located.
  • the control system 110 may be configured for calculating a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum.
  • SII current speech intelligibility index
  • the control system 110 may be configured for determining, based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant.
  • the apparatus 101 may include the optional display system 135 shown in Figure IB.
  • the optional display system 135 may include one or more displays, such as one or more light-emitting diode (LED) displays.
  • the optional display system 135 may include one or more organic light-emitting diode (OLED) displays.
  • the sensor system 130 may include a touch sensor system and/or a gesture sensor system proximate one or more displays of the display system 135.
  • the control system 110 may be configured for controlling the display system 135 to present a graphical user interface (GUI), such as a GUI related to implementing one of the methods disclosed herein.
  • GUI graphical user interface
  • Figure 1C is a flow diagram that outlines an example of a method that may be performed by an apparatus or system such as those disclosed herein.
  • the blocks of method 140 like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of method 140 may be performed concurrently. Moreover, some implementations of method 140 may include more or fewer blocks than shown and/or described.
  • method 140 involves calculating the current speech intelligibility index (SII) and making an incremental change in the system — such as a playback volume change — if necessary, such that the system stays within a speech intelligibility index target.
  • the system may include a device that is configured to provide a teleconference.
  • the system may include a laptop, having a loudspeaker and a microphone, that is configured to provide a teleconference and is being used by a local participant during the teleconference.
  • the output of the loudspeaker during the teleconference is one example of what is referred to as the “speech of remote teleconference participants” in this document.
  • the audio captured by the local microphone will include the environmental noise 104 and the speech 102 of the local teleconference participant.
  • the speech 102 may be regarded as a type of interference.
  • block 160 involves calculating the current SII, which is the SII corresponding to the current frame of audio data from the microphone.
  • block 165 involves determining, based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant.
  • block 150 may involve estimating the speech spectrum as described in Methods for Calculation of the Speech Intelligibility Index, which was originally published in 1969 and was revised in 1997 by the American National Standards Institute and the Acoustical Society of America (hereinafter ASA/ ANSI S3.5-1997), which is hereby incorporated by reference.
  • block 150 may involve estimating the speech spectrum as described in the “Methods for determining input variables for SII calculation” sections on pages 11-13 of AS A/ ANSI S3.5-1997.
  • block 150 may involve estimating the speech spectrum according to one or more other methods.
  • block 150 may involve estimating the speech spectrum according to a modified version of what is described in ASA/ ANSI S3.5-1997. Some such examples involve using the “insertion gain” somewhat differently from what is described in AS A/ ANSI S3.5-1997.
  • AS A/ ANSI S3.5-1997 for an amplification or attenuation device worn by a listener, at a specific frequency the insertion gain is the difference in decibels between the pure-tone sound pressure level (SPL) at the eardrum with the amplification/attenuation device in place and the pure-tone SPL at the eardrum with the amplification/attenuation device removed.
  • SPL pure-tone sound pressure level
  • the insertion gain is used to calculate the equivalent speech spectrum level (see section 5.1.3, starting on page 12 of ASA/ ANSI S3.5-1997) and is also used to calculate the equivalent noise spectrum level (see section 5.1.4, on page 13 of ASA/ANSI S3.5- 1997). For example, according to ASA/ANSI S3.5-1997, if the speech spectrum is measured at the center of a listener’s head, the equivalent speech spectrum for a particular frequency band is calculated as the measured speech spectrum level for the particular frequency band plus the insertion gain for the particular frequency band.
  • Some disclosed examples involve mapping the equivalent SPL of the playback to this insertion gain.
  • the current hardware/software gain that is applied to the loudspeaker is converted to an equivalent SPL at the listener position. According to some examples, this conversion is aided by a tuning parameter during the tuning process for a particular device, such as a particular laptop model.
  • this equivalent SPL is subtracted by the nominal SPL associated with the “normal” speech spectrum. The final subtracted value is used as the insertion gain, with the same value used for all frequency bands.
  • block 155 may involve calculating the current noise spectrum corresponding to the environmental noise 104 in the current frame of audio data from the microphone according to the methods described in ASA/ANSI S3.5-1997, for example on page 13. However, in some disclosed examples, block 155 may involve calculating the current noise spectrum according to alternative methods.
  • block 160 may involve determining, based on the current speech spectrum estimated by block 150 and the current noise spectrum estimated by block 155, the current SII as a value between 0 and 1, where 1 means very intelligible and 0 means not intelligible at all. In some examples, block 160 may involve determining the current SII based on the current speech spectrum, the current noise spectrum and a measured or assumed hearing threshold level.
  • the SII that is calculated in block 160 may be the SII metric described in AS A/ ANSI S3.5-1997.
  • the SII metric may be calculated as described in the “Methods for calculating Speech Intelligibility Index” section on pages 9-11 of ASA/ ANSI S3.5-1997, which is specifically incorporated herein by reference.
  • the current SII may be determined as follows:
  • sii[current] represents the current smoothed SII value
  • sii[current-l] represents the smoothed SII value calculated in the previous block
  • Sii_alpha represents a tuning parameter
  • Instanteous_sii represents the immediate output of metric described in ASA/ ANSI S3.5-1997.
  • the tuning parameter Sii_alpha may be 2 A (-0.02/0.15), where 0.02 represents the block length in seconds. Other examples may use larger or smaller tuning parameters.
  • Figure 2A shows example blocks of a novel noise estimator according to some examples.
  • the noise estimator 200 includes a first noise estimator 210 — also referred to herein as a main noise estimator 210 — a second noise estimator 220 — also referred to herein as an auxiliary noise estimator 220 — and a fast tracking flag module 230.
  • the main noise estimator 210, the auxiliary noise estimator 220 and the fast tracking flag module 230 are implemented by an instance of the control system 110 that is described with reference to Figure IB.
  • the types and numbers of elements shown in Figure 2A are merely provided by way of example. Other implementations may include more elements, fewer elements, different types of elements, different arrangements of elements, or combinations thereof.
  • the main noise estimator 210 and the auxiliary noise estimator 220 are both configured to receive loudspeaker audio data 222 and microphone audio data 224, for example as described with reference to block 145 of Figure 1C.
  • the loudspeaker audio data 222 may be, or may include, a reference signal corresponding to what is being played back by a local loudspeaker.
  • the main noise estimator 210 is configured to determine and output a noise estimate 208 and a confidence metric 207 — also referred to herein as a “confidence value” — based at least in part on the loudspeaker audio data 222 and the microphone audio data 224.
  • the noise estimate 208 and the confidence metric 207 correspond to a current noise spectrum, which in turn corresponds to a current input audio frame of the microphone audio data 224.
  • the confidence metric 207 indicates a likelihood of the current input audio frame corresponding mainly to ambient noise, such as the environmental noise 104 that is described with reference to Figure 1A.
  • the main noise estimator 210 may be configured to determine the noise estimate 208, the confidence metric 207, or both, based at least in part on fast tracking flags 228 from the fast tracking flag module 230.
  • Example blocks and functionalities of the main noise estimator 210 are described in more detail below with reference to Figures 2B and 3A.
  • the auxiliary noise estimator 220 is configured to determine and output a noise estimate 226 based at least in part on the loudspeaker audio data 222 and the microphone audio data 224.
  • noise estimate 226 corresponds to the current input audio frame of the microphone audio data 224.
  • the auxiliary noise estimator 220 is configured to respond to noise level changes — and to produce a responsive noise estimate 226 — relatively faster than the main noise estimator 210, thereby increasing the response speed (decreasing the response time) of the noise estimator 200.
  • the fast tracking flag module 230 is configured to determine and output fast tracking flags 228, based at least in part on the noise estimates 226 and 208.
  • the main noise estimator 210 may be configured to converge to a new noise level relatively faster if the main noise estimator 210 has received a fast tracking flag 228.
  • the fast tracking flag module 230 is configured to set a value of the fast tracking flag 228 to a maximum value, for example to 1. In other words, the fast tracking flag module 230 may be configured to set a value of the fast tracking flag 228 to a maximum value, for example to 1 , upon determining that the difference between (1) the noise power estimated by the auxiliary noise estimator 220 and (2) the noise power estimated by the main noise estimator 210 is greater than a difference threshold for at least a first time threshold.
  • the fast tracking flag module 230 may be configured to set a value of the fast tracking flag 228 to a minimum value, for example to 0.
  • the first time threshold may or may not be equal to the second time threshold.
  • the power threshold may be in the range from IdB to 5dB, e.g., IdB, 2dB, 3dB, 4dB or 5dB.
  • the time constant may be in the range from 0.5 seconds to 2 seconds, e.g., 0.5 seconds, 0.6 seconds, 0.7 seconds, 0.8 seconds, 0.9 seconds, 1.0 seconds, 1.1 seconds, 1.2 seconds, 1.3 seconds, 1.4 seconds, 1.5 seconds, 1.6 seconds, 1.7 seconds, 1.8 seconds, 1.9 seconds or 2 seconds.
  • the auxiliary noise estimator 220 may be configured with different tuning parameters than those of the main noise estimator 210.
  • the auxiliary noise estimator 220 may be based on a deep neural network. Such noise estimators may have a very fast response.
  • Figure 2B shows blocks of the main noise estimator of Figure 2A according to some disclosed implementations.
  • the main noise estimator 210 includes a voice activity detector (VAD) 201, an echo coupling gain estimator 202, a confidence calculation block 203, and a noise and confidence statistics calculation block 204.
  • VAD voice activity detector
  • the VAD 201, the echo coupling gain estimator 202, the confidence calculation block 203, and the noise and confidence statistics calculation block 204 are implemented by an instance of the control system 110 that is described with reference to Figure IB.
  • the types and numbers of elements shown in Figure 2B are merely provided by way of example. Other implementations may include more elements, fewer elements, different types of elements, different arrangements of elements, or combinations thereof.
  • the VAD 201 is configured to detect local speech activity based on the microphone audio data 224 and to send speech activity signals 211 to the echo coupling gain estimator 202, the confidence calculation block 203 and the noise and confidence statistics calculation block 204. According to some examples, only when the speech activity signals 211 indicate that there is no local speech activity are the echo coupling gain estimator 202, the confidence calculation block 203 and the noise and confidence statistics calculation block 204 active.
  • the echo coupling gain estimator 202 is configured to estimate, based on the microphone audio data 224 and the loudspeaker audio data 222, coupling gains from loudspeaker playback to microphone capturing for each of a plurality of frequency bands and to provide corresponding coupling gain estimations 205 to the confidence calculation block 203.
  • the echo coupling gain estimator 202 is configured to estimate the contribution of the echo 106 to the sounds detected by the microphone 112 and that are present in the microphone audio data 224 and to provide corresponding coupling gain estimations 205.
  • the loudspeaker audio data 222 is, or includes, reference signals corresponding to what is being played back by a local loudspeaker.
  • Figure 3A shows more details of the echo coupling gain estimator 202 according to some examples.
  • the confidence calculation block 203 is configured to calculate, for each frequency band, the likelihood that ambient noise is dominant in the current input audio frame of the microphone audio data 224.
  • the confidence calculation block 203 is configured to output confidence data 206 to the noise and confidence statistics calculation block 204.
  • the confidence data 206 includes at least a confidence value corresponding to the current noise spectrum.
  • the confidence value indicates a likelihood of a current input audio frame corresponding mainly to ambient noise.
  • the confidence data 206 also includes a binary value (e.g., either 0 or 1) that indicates whether the current input audio frame is likely to be primarily ambient noise and should be processed accordingly.
  • the noise and confidence statistics calculation block 204 is configured to determine and output a noise estimate 208 and a confidence metric 207, based at least in part on the microphone audio data 224 and the confidence data 206.
  • the confidence metric 207 indicates a likelihood of the current input audio frame corresponding mainly to ambient noise, such as the environmental noise 104 that is described with reference to Figure 1A.
  • the noise and confidence statistics calculation block 204 may be configured to determine the noise estimate 208, the confidence metric 207, or both, based at least in part on fast tracking flags 228 (not shown in Figure 2B) from the fast tracking flag module 230 of Figure 2A.
  • fast tracking flags 228 not shown in Figure 2B
  • Figure 3A shows blocks of the echo coupling gain estimator of Figure 2B according to some disclosed implementations.
  • the echo coupling gain estimator 202 includes a delay line module 301, a minimum follower 302, a threshold detector 303, a maximum follower 304, an update control block 305 and a subtraction node 312.
  • the delay line module 301, the minimum follower 302, the threshold detector 303, the maximum follower 304, the update control block 305 and the subtraction node 312 are implemented by an instance of the control system 110 that is described with reference to Figure IB.
  • the echo coupling gain estimator 202 only operates when there is no local speech activity, as determined by the VAD 201 (see Figure 3A).
  • the types and numbers of elements shown in Figure 3 A are merely provided by way of example. Other implementations may include more elements, fewer elements, different types of elements, different arrangements of elements, or combinations thereof.
  • the delay line module 301 is a length N delay line and is configured to receive the loudspeaker audio data 222 — which is, or includes, loudspeaker reference signals corresponding to what is being played back by a local loudspeaker.
  • N may be in the range from 8 to 16 frames, e.g., 8 frames, 9 frames, 10 frames, 11 frames, 12 frames, 13 frames, 14 frames, 15 frames, 16 frames, etc.
  • each frame may be in the range from 10 milliseconds (ms) to 30 ms, e.g., 10 ms, 12 ms, 14 ms, 16 ms, 18 ms, 20 ms, 22 ms, 24 ms, 26 ms, 28 ms, 30 ms, etc.
  • the input loudspeaker reference signals are in the form of frequency banded loudspeaker reference power.
  • the delay line module 301 is configured to implement a maximum operation (MAX) across the entire delay line.
  • the purpose of letting the loudspeaker reference signals go through a delay line is to compensate for the natural delay between the loudspeaker reference signals and the corresponding microphone capture of the sounds played back by one or more local loudspeakers, in order to take into account the sound wave propagation delay, the electrical circuitry delay, the software buffering delay, etc., as well as the reverberant build-up that may be present in a local room.
  • the delay line module 301 outputs maximum power data 310, which indicates the maximum of the loudspeaker reference signal power for each frequency band for the current audio frame and for the previous N-l audio frames.
  • the minimum follower 302 is configured to track the minimum power for each input frequency band for each frame of the microphone audio data 224 and to output minimum power data 311, which indicates the minima of input power for each frequency band of each frame of the microphone audio data 224.
  • the “B” indicates the number of frequency bands.
  • the time window size of the minimum follower 302 may be dynamically adjusted. For example, (referring again to Figure 2A) in some implementations, when a fast-tracking flag 228 is input to the main noise estimator 210, the minimum follower 302 will shorten its window size in order to help the main noise estimator 210 to converge to a solution faster. In one such example, the minimum follower 302 may shorten its window size from 1.0 seconds to 250 ms. Other examples may involve different starting window sizes, different shortened window sizes, or both.
  • the threshold detector 303 is a simple threshold detector that is configured to determine when the power of the current frame of the input microphone audio data 224 is above a tracked minimum power level by at least a threshold amount and to provide corresponding threshold detector output 313.
  • the tracked minimum power level is highly dependent on the sensitivity of the particular microphone. Therefore, the tracked minimum power level range may vary widely from microphone to microphone.
  • the threshold amount be in the range from 3dB to lOdB, e.g., 3dB, 4dB, 5dB, 6dB, 7dB, 8dB, 9dB or lOdB.
  • the threshold detector output 313 of the threshold detector 303 has a value of 1 whenever the power of the current frame of the input microphone audio data 224 is above the tracked minimum by at least the threshold amount. In some such examples, the threshold detector output 313 has a value of 0 whenever the power of the current frame of the input microphone audio data 224 is not above the tracked minimum by at least the threshold amount.
  • the subtraction node 312 is configured for subtracting (a) the input power of the current frame of input microphone audio data 224 from (b) the maximum power of the loudspeaker reference signals, as determined by the maximum power data 310 output by the delay line module 301, to produce the subtraction node output 315.
  • the subtraction node output 315 represents potential coupling gain estimations.
  • the maximum follower 304 is configured to determine and output coupling gain estimations 205 based on the subtraction node output 315 and update control signals 317 from the update control module 305.
  • the update control module 305 is configured to determine whether to disallow or allow the subtraction node output 315 to be output by the maximum follower 304 as the current coupling gain estimation 205. According to some such examples, the update control signals 317 control whether the subtraction node output 315 will be received by the maximum follower 304 at all. Even though — according to this example — the echo coupling gain estimator 202 only operates when there is no local speech activity, it is nonetheless possible that local transient noise may still be included in the microphone signals 224. Whenever there is some transient noise in a frequency band, the gain calculated by the subtraction node 312 will not be accurate and should be eliminated.
  • the new gain maximum is within a certain range — such as within 1 dB, 2 dB, 3 dB, 4 dB, etc. — of a previously-tracked maximum gain. In some implementations, this condition may be ignored for the first update if there is no measured initial value (see below);
  • the coupling gain is primarily a characteristic of the device being used to participate in the teleconference, such as the phone, the laptop, the conferencing endpoint, etc.
  • the coupling gain can be affected by the acoustic properties of a room in which the device resides, the coupling gain is mainly determined by the industrial design of the device itself. Once the product is manufactured, this coupling gain may be obtained and, in some examples, this coupling gain may be stored for future use as the initial value.
  • the input to the noise and confidence statistics calculation block 204 includes the minimum power data 311 from the minimum follower 302.
  • the minimum power data 311 indicates the minima of input power for each frequency band of each frame of the microphone audio data 224.
  • the input to the noise and confidence statistics calculation block 204 also includes the microphone signal power, Y b and the indication flag I b — , both of which correspond to frequency band b.
  • the noise and confidence statistics calculation block 204 only takes Y b into the accumulation if I b — 1 and if the following condition is met: Y b ⁇ Y b + f
  • Y b represents the tracked minima of Y b and /? represents a threshold, such as 8 dB, 9 dB, 10 dB, 11 dB, 12 dB, etc.
  • the confidence statistics are updated which takes L b as input.
  • the output of the the noise and confidence statistics calculation block 204 includes n b , the estimated ambient noise power for each band, and p b , the confidence value of corresponding to the current ambient noise power estimation.
  • the confidence value p b may be used to generate a wideband confidence flag P to control the behaviour of noise compensation control logic, for example as follows:
  • w b represents a weighting factor wherein ⁇ w b — 1, and H represents a hysteresis function output 0 or 1 with two hysteresis curve thresholds.
  • control system After the control system has estimated the current SII, in some implementations the control system will implement a multi-step process in order to determine what the target SII is, whether change the system and, if so, how to change the system to achieve this target SIL In some examples, this process corresponds to block 165 of Figure 1C. Accordingly, this section describes various actions that may be performed according to various implementations of block 165.
  • block 165 may calculate and apply the one or more changes to the system such that the SII calculated in block 160 of Figure 1C is within a target SII range.
  • Such changes may include, but are not limited to, the following:
  • Figure 3B shows a plot of estimated SII over time according to one example.
  • the estimated SII may, for example correspond to the output of block 160 of Figure 1C.
  • Figure 3A illustrates how SII can fluctuate over a period of time.
  • block 165 does not suggest any system change(s), because the SII remains within the high and low targets, which are shown in Figure 3B as “High target sii” and Low target sii.”
  • Figure 4 shows a plot of estimated SII over time according to another example.
  • Figure 4 show what happens when the SII increases beyond the high target sii.
  • the SII value has breached the range, going beyond the high_target_sii.
  • the dec_aim_target_sii value may be the same as the target_sii or may differ from the target_sii based on the knowledge on how the system will react, and how the SII will tend to fluctuate. For example, in some systems it may be known that SII may be underestimated compared to the true SII due to a slow noise estimate adaptation. In these cases, a dec aim target sii that is slightly lower than the target_sii (but higher than the low_target_sii) may be preferable. Once the SII is between dec_aim_target_sii and low_target_sii (at t2), in this example the system state will go back to the behavior described with reference to Figure 3B. In our implementation, the dec_aim_target_sii is the same as the target_sii.
  • One example of a change to the system that may be implemented is to decrease the volume.
  • the decrease in volume could be specified such that it is proportional to the difference between the current SII and the SII that we are aiming for.
  • Gain_dec_delta represents a first tuning parameter, which in one example is 0.6 and Dec speed represents a second tuning parameter, which in one example 1.0.
  • Other examples may involve different tuning parameters.
  • the first tuning parameter may be 0.5, 0.55, 0.65, 0.7, etc.
  • the second tuning parameter may be 0.9, 0.95, etc.
  • the control system may be configured to optionally wait for a time interval, which may be measured in input audio frames, to allow the system to stabilize before checking the current SII and suggesting a new change.
  • the suggested change may or may not be implemented. For example, if a feature corresponding to one or more disclosed methods has been switched off or disabled — for example, according to user input — in some examples, the suggested change will not be implemented.
  • the amount of time that the control system is configured to wait we wait between checking the current SII and suggesting a new change may be proportional to how close the SII is to the dec_aim_target_sii.
  • the control system may first calculate:
  • control system may calculate a wait factor, for example as follows:
  • the number of input audio frames corresponding to the waiting time interval is equal to:
  • Wait_frames gain_adj_holdon_frames * wait_factor
  • gain_adj_holdon_frames represents a tuning parameter.
  • the tuning factor is 96 for a 20ms block length. In other implementations, the tuning factor may be 90, 92, 94, 98 or 100 for a 20ms block length.
  • the SII is within the high_target_sii and low_target_sii.
  • the SII is between high_target_sii and dec_aim_target_sii.
  • the control system again suggests a new change. • This suggested change occurs, and the control system waits.
  • block 165 will involve suggesting a system change.
  • Figure 5 shows another plot of estimated SII over time.
  • Figure 5 shows how the control system may implement block 165 when, and after, the SII falls below the low target SII according to some examples.
  • the inc_aim_target_sii value shown in Figure 5 may be the same as the target_sii or may differ based on the knowledge on how the system will react, and how the SII will tend to fluctuate. For example, in some systems it may be known that SII may be overestimated compared to the true SII due to a slow noise estimate adaptation.
  • an inc_aim_target_sii that is slightly higher than the target_sii (but lower than the high_target_sii) may be preferable.
  • the SII is between inc_aim_target_sii and high_target_sii (at t2), in some examples the system state will go back to the behavior described with reference to Figure 3B.
  • the inc_aim_target_sii may be the same as the target_sii.
  • One example of a change to the system that the control system may suggest is to increase the volume.
  • the increase in volume may be specified such that it is proportional to the difference between the current SII and a target SII.
  • Gain_inc_delta represents a tuning parameter. In one implementation this tuning parameter is 0.8, but in alternative implementations this tuning parameter may be 0.7, 0.75, 0.85, 0.9, etc. In the foregoing equation, Inc speed represents another tuning parameter. In one implementation this tuning parameter is 1.0, but in alternative implementations this tuning parameter may be 0.9, 0.95, 1.05, 1.1, etc.
  • the control system may optionally wait a few frames to allow the system to stabilize before checking the current SII and suggesting a new change.
  • the amount of time the control system waits between suggestions may be proportional to how close the SII is to the inc_aim_target_sii. In one implementation, the wait time may be determined as follows:
  • control system may determine that the following occurs:
  • the SII is within the high_target_sii and low_target_sii.
  • the SII is between low_target_sii and inc_aim_target_sii .
  • the control system again suggests a new change.
  • the control system may start the initialization process by determining the average SII over a time interval, which may correspond to a number of audio frames or blocks. (As used herein, the terms “audio frame” and “audio block” have the same meaning.) This period of blocks may be considered as a tuning parameter, referred to herein as “sii_max_init_counter.” In one example, this tuning parameter may be 50 blocks, whereas in other examples this tuning parameter may be 40 blocks, 45 blocks, 55 blocks, 60 blocks, etc.
  • the control system will set this average SII to be the target_sii.
  • target_sii_range represents a tuning parameter.
  • this tuning parameter may be 0.15, whereas in other examples this tuning parameter may be 0.1, 0.2, etc.
  • control system may determine whether the energy is insignificant by checking whether the following condition is met: mono_ref_level > ref_th_alpha * ref_th
  • mono ref level represents the energy of the speaker for the current block in dB
  • ref_th_alpha represents a tuning parameter.
  • this tuning parameter may be 4.0, whereas in other examples this tuning parameter may be 3.0, 3.5, 4.5, 5.0, etc.
  • ref_th represents a tuning parameter.
  • this tuning parameter may be -30, whereas in other examples this tuning parameter may be -20, -25, -35, -40, etc.
  • the control system may determine the current volume of the system and then set the current volume as our minimum volume. This current volume may, for example, be stored as “usr_ref_gain.”
  • the control system may determine whether to implement this suggestion by ensuring that implement this suggestion would not cause the volume to go below the minimum volume.
  • the estimated SII may ramp up and may be invalid for the first few blocks of the audio session. This phenomenon may be caused by the instability of the speech level or the noise estimate during the first few blocks of the audio session.
  • the control system may ensure that the SII averaging process does not take into account the first few blocks of the audio session. In some such implementations, the control system may ignore the first 20 blocks (assuming 20ms block length), the first 22 blocks, the first 24 blocks, the first 26 blocks, the first 28 blocks, the first 30 blocks, etc.
  • the user may attempt to change the volume manually. If this occurs, the control system may optionally take the user’s attempted volume change into account. In some such examples, the control system may interpret the user’s selected volume as a new minimum volume (new usr ref gain) and a new target sii. According to some implementations, when the user interacts with the volume gain of the system, the control system may restart the initialization phase.
  • Figure 6 is a flow diagram that outlines another example of a method that may be performed by an apparatus or system such as those disclosed herein.
  • the blocks of method 600 like other methods described herein, are not necessarily performed in the order indicated. In some implementation, one or more of the blocks of method 600 may be performed concurrently. Moreover, some implementations of method 600 may include more or fewer blocks than shown and/or described.
  • the blocks of method 600 may be performed by one or more devices, which may be (or may include) a control system such as the control system 110 that is shown in Figure IB and described above.
  • method 600 is a method of compensating for environmental noise during a teleconference.
  • block 605 involves estimating, by a control system, a current speech spectrum corresponding to speech of remote teleconference participants.
  • block 605 may correspond to block 150 of Figure 1C and may be performed according to the descriptions herein of block 150.
  • block 610 involves estimating, by the control system, a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located.
  • block 610 may correspond to block 155 of Figure 1C and may be performed according to the descriptions herein of block 155.
  • block 615 involves calculating, by the control system, a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum.
  • SII current speech intelligibility index
  • block 615 may correspond to block 160 of Figure 1C and may be performed according to the descriptions herein of block 160.
  • block 620 involves determining, by the control system and based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant.
  • the determining may involve evaluating the current SII according to one or more target SII parameters.
  • the determining may involve determining whether the current SII is within a target SII range.
  • method 600 may involve, responsive to determining that the adjustment should be made, adjusting, by the control system, at least a portion of the local audio system.
  • block 620 may correspond to block 165 of Figure 1C and may be performed according to the descriptions herein of block 165. In some such examples, block 620 may involve one or more of the example implementations of block 165 that are described with reference to Figures 3B-5.
  • block 625 involves updating, by the control system, at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.
  • method 600 may involve determining a confidence value corresponding to the current noise spectrum.
  • the confidence value may indicate the likelihood of a current input audio frame corresponding mainly to ambient noise.
  • the confidence value may correspond with the confidence metric 207 that is described herein with reference to Figures 2 A and 2B.
  • method 600 may involve determining whether to update one or more noise statistics based, at least in part, on the confidence value.
  • estimating the current noise spectrum may involve estimating an echo coupling gain corresponding to local loudspeaker playback captured by a local microphone system.
  • the echo coupling gain may be estimated by the echo coupling gain estimator 202 of Figures 2B and 3A.
  • estimating the echo coupling gain may involve determining a maximum band loudspeaker reference power for each frequency band of a current audio frame and previous N-l audio frames, where N s an integer corresponding to a number of audio frames in a delay line.
  • estimating the echo coupling gain may involve tracking a minimum power for each frequency band of an input microphone signal.
  • estimating the echo coupling gain may involve estimating an echo coupling gain for each frequency band of the input microphone signal based, at least in part, on the maximum band loudspeaker reference power and the minimum power. Some disclosed examples involve determining whether to update an estimated echo coupling gain based, at least in part, on whether a current estimated echo coupling gain value has changed more than a threshold amount from a most recent estimated echo coupling gain value, on whether an estimated echo coupling gain has been updated within a threshold time interval, or on combinations thereof. In some such examples the threshold time interval may be measured in audio frames or audio blocks.
  • method 600 may involve adjusting, by the control system and responsive to determining that the adjustment should be made, at least a portion of the local audio system to maintain the current SII within a target SII range.
  • maintaining the current SII within a target SII range may involve increasing a loudspeaker playback volume of one or more audio frames until the current SII is greater than a first target SII and less than a high target SII.
  • the first target SII may be greater than a median target SII and less than the high target SII.
  • maintaining the current SII within a target SII range involves decreasing a loudspeaker playback volume of one or more audio frames until the current SII may be less than a second target SII and greater than a low target SII.
  • the second target SII may be less than a median target SII and greater than the low target SII.
  • EEE1 A method of compensating for environmental noise during a teleconference, the method comprising: estimating, by a control system, a current speech spectrum corresponding to speech of remote teleconference participants; estimating, by the control system, a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located; calculating, by the control system, a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum; determining a confidence value corresponding to the current noise spectrum, the confidence value indicating a likelihood of a current input audio frame corresponding mainly to ambient noise; and determining, by the control system and based at least in part on the current SII and the confidence value, whether to make an adjustment of a local audio system used by the local teleconference participant, wherein the determining involves evaluating the current SII according to one or more target SII parameters.
  • SII current speech intelligibility index
  • EEE2 The method of EEE1, wherein the determining involves determining whether the current SII is within a target SII range.
  • EEE3 The method of EEE1 or EEE2, further comprising, responsive to determining that the adjustment should be made, adjusting, by the control system, at least a portion of the local audio system.
  • EEE4 The method of any one of EEEs 1-3, further comprising updating at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.
  • EEE5. The method of EEE1, wherein the confidence value is a broadband confidence value and wherein determining whether to make the adjustment of the local audio system is based, at least in part, on the broadband confidence value.
  • EEE6 The method of EEE4 or EEE5, further comprising determining whether to update one or more noise statistics based, at least in part, on the confidence value.
  • EEE7 The method of EEE1, further comprising determining a bandbased confidence value for each frequency band of a plurality of frequency bands and wherein determining whether to make the adjustment of the local audio system involves determining whether to update one or more noise statistics for each frequency band based, at least in part, on the band-based confidence value.
  • EEE8 The method of any one of EEEs 1-7, wherein estimating the current noise spectrum involves estimating an echo coupling gain corresponding to local loudspeaker playback captured by a local microphone system.
  • EEE9 The method of EEE8, wherein estimating the echo coupling gain involves: determining a maximum band loudspeaker reference power for each frequency band of a current audio frame and previous N-l audio frames, where N s an integer corresponding to a number of audio frames in a delay line; tracking a minimum power for each frequency band of an input microphone signal; and estimating an echo coupling gain for each frequency band of the input microphone signal based, at least in part, on the maximum band loudspeaker reference power and the minimum power.
  • EEE10 The method of EEE9, further comprising determining whether to update an estimated echo coupling gain based, at least in part, on whether a current estimated echo coupling gain value has changed more than a threshold amount from a most recent estimated echo coupling gain value, on whether an estimated echo coupling gain has been updated within a threshold time interval, or on combinations thereof.
  • EEE 11 The method of any one of EEEs 1-10, further comprising , responsive to determining that the adjustment should be made, adjusting, by the control system, at least a portion of the local audio system to maintain the current SII within a target SII range.
  • EEE12 The method of EEE11, wherein maintaining the current SII within a target SII range involves increasing a loudspeaker playback volume of one or more audio frames until the current SII is greater than a first target SII and less than a high target SII.
  • EEE13 The method of EEE12, wherein the first target SII is greater than a median target SII and less than the high target SII.
  • EEE 14 The method of EEE 12 or EEE 13 , wherein maintaining the current SII within a target SII range involves decreasing a loudspeaker playback volume of one or more audio frames until the current SII is less than a second target SII and greater than a low target SII.
  • EEE15 The method of EEE14, wherein the second target SII is less than a median target SII and greater than the low target SII.
  • EEE16 An apparatus configured to perform the method of any one of EEEsl-15.
  • EEE17 A system configured to perform the method of any one of
  • EEE18 One or more non-transitory media having instructions stored thereon for controlling one or more devices to perform the method of any one of EEEsl-15.
  • Some aspects of the present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof.
  • a tangible computer readable medium e.g., a disc
  • some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof.
  • Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
  • Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods.
  • DSP digital signal processor
  • embodiments of the disclosed systems may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods.
  • PC personal computer
  • microprocessor which may include an input device and a memory
  • elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones).
  • a general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.
  • Another aspect of the present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof.
  • code for performing e.g., coder executable to perform

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Telephone Function (AREA)

Abstract

A method of compensating for environmental noise during a teleconference may involve: estimating, by a control system, a current speech spectrum corresponding to speech of remote teleconference participants; estimating, by the control system, a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located; calculating, by the control system, a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum; determining, by the control system and based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant, wherein the determining involves evaluating the current SII according to one or more target SII parameters; and updating at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.

Description

ENVIRONMENTAL NOISE COMPENSATION IN TELECONFERENCING
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims priority to U.S. Provisional Patent Application No. 63/512,424, filed on July 7, 2023, and U.S. Provisional Patent Application No. 63/635,570, filed on April 17, 2024, all of which are incorporated by reference in their entirety.
TECHNICAL FIELD
[0002] This disclosure pertains to devices, systems and methods for environmental noise compensation (ENC), particularly ENC in the teleconferencing context. As used herein, the term “teleconferencing” encompasses both audio/ video teleconferencing and audio teleconferencing.
BACKGROUND
[0003] Teleconferencing has become an important part of modern life. The ability to communicate clearly while teleconferencing is based mainly on speech intelligibility, which in turn is based in part on the presence or absence of noise in the audio signal(s). Although existing devices, systems and methods for speech intelligibility estimation, noise estimation and ENC provide benefits, improved systems and methods would be desirable.
SUMMARY
[0004] At least some aspects of the present disclosure may be implemented via methods. Some such methods involve compensating for environmental noise during a teleconference. For example, some methods may involve estimating, by a control system, a current speech spectrum corresponding to speech of remote teleconference participants and estimating, by the control system, a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located. Some methods may involve calculating, by the control system, a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum. Some methods may involve determining, by the control system and based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant. Some methods may involve updating, by the control system, at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.
[0005] The determining may involve evaluating the current SII according to one or more target SII parameters. In some examples, the determining may involve determining whether the current SII is within a target SII range. Some methods may involve adjusting, by the control system, at least a portion of the local audio system responsive to determining that the adjustment should be made.
[0006] Some methods may involve determining a confidence value corresponding to the current noise spectrum. The confidence value may, for example, indicate a likelihood of a current input audio frame corresponding mainly to ambient noise. According to some examples, the confidence value may be a broadband confidence value. Determining whether to make the adjustment of the local audio system may be based, at least in part, on the broadband confidence value. Some methods may involve determining a band-based confidence value for each frequency band of a plurality of frequency bands. In some examples, determining whether to make the adjustment of the local audio system may involve determining whether to update one or more noise statistics for each frequency band based, at least in part, on the band-based confidence value. Some methods may involve determining whether to update one or more noise statistics based, at least in part, on the confidence value.
[0007] In some examples, estimating the current noise spectrum may involve estimating an echo coupling gain corresponding to local loudspeaker playback captured by a local microphone system. In some such examples, estimating the echo coupling gain may involve determining a maximum band loudspeaker reference power for each frequency band of a current audio frame and previous N-l audio frames, where N s an integer corresponding to a number of audio frames in a delay line. In some such examples, estimating the echo coupling gain may involve tracking a minimum power for each frequency band of an input microphone signal and estimating an echo coupling gain for each frequency band of the input microphone signal based, at least in part, on the maximum band loudspeaker reference power and the minimum power. Some methods may involve determining whether to update an estimated echo coupling gain based, at least in part, on whether a current estimated echo coupling gain value has changed more than a threshold amount from a most recent estimated echo coupling gain value, on whether an estimated echo coupling gain has been updated within a threshold time interval, or both.
[0008] Some methods may involve, responsive to determining that the adjustment should be made, adjusting at least a portion of the local audio system to maintain the current SII within a target SII range. Some methods may involve, responsive to determining that the adjustment should be made, adjusting at least a portion of the local audio system to maintain the current SII within a target SII range. According to some examples, maintaining the current SII within a target SII range may involve increasing a loudspeaker playback volume of one or more audio frames until the current SII is greater than a first target SII and less than a high target SII. The first target SII may, in some examples, be greater than a median target SII and less than the high target SII. In some examples, maintaining the current SII within a target SII range may involve decreasing a loudspeaker playback volume of one or more audio frames until the current SII is less than a second target SII and greater than a low target SII. According to some examples, the second target SII may be less than a median target SII and greater than the low target SII.
[0009] Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented in a non-transitory medium having software stored thereon.
[0010] At least some aspects of the present disclosure may be implemented via apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus may include an interface system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof. In some examples, the apparatus may be one of the above-referenced audio devices. However, in some implementations the apparatus may be another type of device, such as a mobile device, a laptop, a server, etc.
[0011] Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] Figure 1A shows examples of sound sources that may be captured by a local microphone during a teleconference.
[0013] Figure IB is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure.
[0014] Figure 1C is a flow diagram that outlines an example of a method that may be performed by an apparatus or system such as those disclosed herein.
[0015] Figure 2A shows example blocks of a novel noise estimator according to some examples.
[0016] Figure 2B shows blocks of the main noise estimator of Figure 2A according to some disclosed implementations.
[0017] Figure 3A shows blocks of the echo coupling gain estimator of Figure 2B according to some disclosed implementations.
[0018] Figure 3B shows a plot of estimated SII over time according to one example.
[0019] Figure 4 shows a plot of estimated SII over time according to another example.
[0020] Figure 5 shows another plot of estimated SII over time.
[0021] Figure 6 is a flow diagram that outlines another example of a method that may be performed by an apparatus or system such as those disclosed herein. [0022] Like reference numbers and designations in the various drawings indicate like elements.
DETAILED DESCRIPTION
[0023] Figure 1A shows examples of sound sources that may be captured by a local microphone during a teleconference. In these examples, the local speech 102 of a local teleconference participant, the environmental noise 104 — also referred to herein as “background noise” or “ambient noise” — and the loudspeaker playback sounds 106 — also referred to herein as “echo” — from the loudspeaker 108 are captured by microphone 112. During a teleconference, the speech of remote teleconference participants — also referred to herein as “far-end speech” — is played back by the loudspeaker 108. The term “remote teleconference participants” refers to teleconference participants who are in locations other than that of the local teleconference participant.
[0024] Figure 1A also shows an example of a noise estimator 114, examples of which are disclosed herein. The noise estimator 114 may, for example, be implemented via an instance of the control system 110 that is described with reference to Figure IB.
[0025] The speech of remote teleconference participants and the local speech 102 of a local teleconference participant are usually time divided. Instances of “double talk,” during which the far-end speech and local speech 102 overlap in time, are atypical. Most of the time, double talk only happens when teleconference participants want to interrupt. To prevent recaptured far-end speech being transmitted back to the far end, echo management is normally employed in the signal chain, and the echo signal 106 is often mostly suppressed.
[0026] Reliable noise estimation and noise reduction techniques can help meeting participants to better understand what is being said by other teleconference participants. Various improved noise estimation and noise reduction techniques are disclosed herein.
[0027] Figure IB is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure. As with other figures provided herein, the types and numbers of elements shown in Figure IB are merely provided by way of example. Other implementations may include more elements, fewer elements, different types of elements, or combinations thereof. According to some examples, the apparatus 101 may be, or may include, a device that is configured for performing at least some of the methods disclosed herein, such as a smart audio device, a laptop computer, a cellular telephone, a tablet device, a smart home hub, etc. In some such implementations the apparatus 101 may be, or may include, a server that is configured for performing at least some of the methods disclosed herein.
[0028] In this example, the apparatus 101 includes an interface system 105 and a control system 110. In some implementations, the control system 110 may be configured for performing, at least in part, the methods disclosed herein. The control system 110 may, in some implementations, be configured for compensating for environmental noise during a teleconference.
[0029] In some examples, the control system 110 may be configured for estimating a current speech spectrum corresponding to speech of remote teleconference participants. According to some examples, the control system 110 may be configured for estimating a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located. In some examples, the control system 110 may be configured for calculating a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum. In some examples, the control system 110 may be configured for determining, based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant. In some examples, the determining may involve evaluating the current SII according to one or more target SII parameters. According to some examples, the control system 110 may be configured for updating at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.
[0030] The interface system 105 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 105 may include one or more wireless interfaces. The interface system 105 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system. In some examples, the interface system 105 may include one or more interfaces between the control system 110 and a memory system, such as the optional memory system 115 shown in Figure IB. However, the control system 110 may include a memory system in some instances.
[0031] The control system 110 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
[0032] In some implementations, the control system 110 may reside in more than one device. For example, a portion of the control system 110 may reside in a device within an environment (such as a laptop computer, a tablet computer, a smart audio device, etc.) and another portion of the control system 110 may reside in a device that is outside the environment, such as a server. In other examples, a portion of the control system 110 may reside in a device within an environment and another portion of the control system 110 may reside in one or more other devices of the environment.
[0033] Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 115 shown in Figure IB and/or in the control system 110. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. The software may, for example, include instructions for controlling at least one device to process audio data. The software may, for example, be executable by one or more components of a control system such as the control system 110 of Figure IB.
[0034] In some examples, the apparatus 101 may include the optional microphone system 120 shown in Figure IB. The optional microphone system 120 may include one or more microphones. In some implementations, one or more of the microphones may be part of, or associated with, another device, such as a loudspeaker, a smart audio device, etc.
[0035] According to some implementations, the apparatus 101 may include the optional loudspeaker system 125 shown in Figure IB. The optional loudspeaker system 125 may include one or more loudspeakers. Loudspeakers may sometimes be referred to herein as “speakers.” In some examples, at least some loudspeakers of the optional loudspeaker system 125 may be arbitrarily located . For example, at least some speakers of the optional loudspeaker system 125 may be placed in locations that do not correspond to any standard prescribed speaker layout, such as Dolby 5.1, Dolby 5.1.2, Dolby 7.1, Dolby 7.1.4, Dolby 9.1, Hamasaki 22.2, etc. In some such examples, at least some loudspeakers of the optional loudspeaker system 125 may be placed in locations that are convenient to the space (e.g., in locations where there is space to accommodate the loudspeakers), but not in any standard prescribed loudspeaker layout.
[0036] In some implementations, the apparatus 101 may include the optional sensor system 130 shown in Figure IB. The optional sensor system 130 may include a touch sensor system, a gesture sensor system, one or more cameras, etc.
[0037] In some implementations, the apparatus 101 may include the optional display system 135 shown in Figure IB. The optional display system 135 may include one or more displays, such as one or more light-emitting diode (LED) displays. In some instances, the optional display system 135 may include one or more organic light-emitting diode (OLED) displays. In some examples wherein the apparatus 101 includes the display system 135, the sensor system 130 may include a touch sensor system and/or a gesture sensor system proximate one or more displays of the display system 135. According to some such implementations, the control system 110 may be configured for controlling the display system 135 to present a graphical user interface (GUI), such as a GUI related to implementing one of the methods disclosed herein.
[0038] As noted above, reliable noise estimation and noise reduction techniques can help meeting participants to better understand what is being said by other teleconference participants. Recent developments that involve neural-network- based approaches have helped to resolve the noise reduction problem. Neural- network-based methods are often capable of removing many types of noise, provided that the neural network training process noise has provided the neural network with sufficient exposure to each particular type of noise that is to be removed. However, it may be difficult or even impossible to train a neural network to remove every possible type of noise.
[0039] Accordingly, noise compensation is still an important aspect of providing audio that is acceptable for teleconferencing. The basic principle of noise compensation is to adjust the playback volume up and down based on the ambient noise sensed by the microphone. In some examples, noise compensation may be applied on a per-band basis, with different gains applied to different frequency bands.
[0040] In the context of noise compensation, a robust and stable stationary noise estimator is important. The quality of the stationary noise estimator generally determines the overall system performance. The basic functions of a stationary noise estimator include:
• The stationary noise estimator should only estimate “stationary” noise to prevent the volume be adjusted up or down rapidly in the case of occasional (dynamic) noise, e.g., a dog’s barking, sounds caused by tapping on a table, sounds caused by keyboard strokes, etc. As used herein, the term “stationary noise” does not refer to noise types having "strict-sense stationarity,” in which the statistical characteristics do not change over time, but instead refer to noise types having “wide-sense stationarity,” in which the first moment and autocovariance do not vary with respect to time and in which the second moment is finite for all times. Stationary noise is one type of “stationary process,” which is a stochastic process whose unconditional joint probability distribution does not change when shifted in time.
• The stationary noise estimator should only estimate “true” ambient noise of the local environment, not the noise in the echo signal that is played back. For example, referring to Figure 1 A, The stationary noise estimator should only estimate the environmental noise 104, not the loudspeaker playback or “echo” sounds 106. If the noise estimated is actually caused primarily by echo (echo noise is dominant), then the environmental noise compensation (ENC system) will form a positive feedback loop, which will cause any gain adjustment to result in the maximum or the minimum volume setting.
[0041] Figure 1C is a flow diagram that outlines an example of a method that may be performed by an apparatus or system such as those disclosed herein. The blocks of method 140, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of method 140 may be performed concurrently. Moreover, some implementations of method 140 may include more or fewer blocks than shown and/or described.
[0042] According to this example, method 140 is a method of compensating for environmental noise during a teleconference. The blocks of method 140 may be performed by one or more devices, which may be (or may include) a control system such as the control system 110 that is shown in Figure IB and described above. According to this example, the blocks of method 140 are repeated for each new block of audio data.
[0043] In this example, method 140 involves calculating the current speech intelligibility index (SII) and making an incremental change in the system — such as a playback volume change — if necessary, such that the system stays within a speech intelligibility index target. The system may include a device that is configured to provide a teleconference. In some examples, the system may include a laptop, having a loudspeaker and a microphone, that is configured to provide a teleconference and is being used by a local participant during the teleconference. The output of the loudspeaker during the teleconference is one example of what is referred to as the “speech of remote teleconference participants” in this document. Referring again to Figure 1A, the audio captured by the local microphone will include the environmental noise 104 and the speech 102 of the local teleconference participant. For a noise detector, the speech 102 may be regarded as a type of interference.
[0044] According to the example shown in Figure 1C, block 145 involves obtaining a frame of audio data from a local microphone. In this example, block 150 involves estimating a speech spectrum corresponding to the speech 102 in the current frame of audio data from the microphone. According to this example, block 155 involves calculating the current noise spectrum corresponding to the environmental noise 104 for the current frame of audio data from the microphone. In some examples, blocks 150 and 155 may be performed concurrently. According to some examples, the noise spectrum is based in part on historical data. In some examples, block 155 will keep updating the noise spectrum as new data frames are received. In some examples, the current noise spectrum determined by block 155 is for the current frame of audio data from the microphone and is used to determine SII and other parameters. In this example, block 160 involves calculating the current SII, which is the SII corresponding to the current frame of audio data from the microphone. According to this example, block 165 involves determining, based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant.
[0045] In some examples, block 150 may involve estimating the speech spectrum as described in Methods for Calculation of the Speech Intelligibility Index, which was originally published in 1969 and was revised in 1997 by the American National Standards Institute and the Acoustical Society of America (hereinafter ASA/ ANSI S3.5-1997), which is hereby incorporated by reference. For example, block 150 may involve estimating the speech spectrum as described in the “Methods for determining input variables for SII calculation” sections on pages 11-13 of AS A/ ANSI S3.5-1997. However, in some alternative examples, block 150 may involve estimating the speech spectrum according to one or more other methods.
[0046] In some alternative examples, block 150 may involve estimating the speech spectrum according to a modified version of what is described in ASA/ ANSI S3.5-1997. Some such examples involve using the “insertion gain” somewhat differently from what is described in AS A/ ANSI S3.5-1997. In AS A/ ANSI S3.5-1997, for an amplification or attenuation device worn by a listener, at a specific frequency the insertion gain is the difference in decibels between the pure-tone sound pressure level (SPL) at the eardrum with the amplification/attenuation device in place and the pure-tone SPL at the eardrum with the amplification/attenuation device removed. The insertion gain is used to calculate the equivalent speech spectrum level (see section 5.1.3, starting on page 12 of ASA/ ANSI S3.5-1997) and is also used to calculate the equivalent noise spectrum level (see section 5.1.4, on page 13 of ASA/ANSI S3.5- 1997). For example, according to ASA/ANSI S3.5-1997, if the speech spectrum is measured at the center of a listener’s head, the equivalent speech spectrum for a particular frequency band is calculated as the measured speech spectrum level for the particular frequency band plus the insertion gain for the particular frequency band.
[0047] Some disclosed examples involve mapping the equivalent SPL of the playback to this insertion gain. In some examples, the current hardware/software gain that is applied to the loudspeaker is converted to an equivalent SPL at the listener position. According to some examples, this conversion is aided by a tuning parameter during the tuning process for a particular device, such as a particular laptop model. In some examples, this equivalent SPL is subtracted by the nominal SPL associated with the “normal” speech spectrum. The final subtracted value is used as the insertion gain, with the same value used for all frequency bands.
[0048] In some examples, block 155 may involve calculating the current noise spectrum corresponding to the environmental noise 104 in the current frame of audio data from the microphone according to the methods described in ASA/ANSI S3.5-1997, for example on page 13. However, in some disclosed examples, block 155 may involve calculating the current noise spectrum according to alternative methods.
[0049] According to some such methods, block 155 involves calculating the current noise spectrum corresponding to a novel noise estimator (NE), examples of which are described in more detail with reference to Figures 2A-3A. Some implementations of the new noise estimator include at least a first or main noise estimator and a second, auxiliary noise estimator. The second noise estimator may be configured to respond to noise level changes relatively faster than the first or main noise estimator.
[0050] In some examples block 160 may involve determining, based on the current speech spectrum estimated by block 150 and the current noise spectrum estimated by block 155, the current SII as a value between 0 and 1, where 1 means very intelligible and 0 means not intelligible at all. In some examples, block 160 may involve determining the current SII based on the current speech spectrum, the current noise spectrum and a measured or assumed hearing threshold level.
[0051] According to some such examples, the SII that is calculated in block 160 may be the SII metric described in AS A/ ANSI S3.5-1997. In some such examples, the SII metric may be calculated as described in the “Methods for calculating Speech Intelligibility Index” section on pages 9-11 of ASA/ ANSI S3.5-1997, which is specifically incorporated herein by reference.
[0052] According to some examples, the current SII may be determined as follows:
Sii [current] = sii [current- 1] * sii_alpha + instanteous_sii * (1 - sii_alpha)
In the foregoing equation, sii[current] represents the current smoothed SII value, sii[current-l] represents the smoothed SII value calculated in the previous block, Sii_alpha represents a tuning parameter, and Instanteous_sii represents the immediate output of metric described in ASA/ ANSI S3.5-1997. In one example, the tuning parameter Sii_alpha may be 2A(-0.02/0.15), where 0.02 represents the block length in seconds. Other examples may use larger or smaller tuning parameters.
[0053] Figure 2A shows example blocks of a novel noise estimator according to some examples. In this example, the noise estimator 200 includes a first noise estimator 210 — also referred to herein as a main noise estimator 210 — a second noise estimator 220 — also referred to herein as an auxiliary noise estimator 220 — and a fast tracking flag module 230. According to this example, the main noise estimator 210, the auxiliary noise estimator 220 and the fast tracking flag module 230 are implemented by an instance of the control system 110 that is described with reference to Figure IB. As with other figures provided herein, the types and numbers of elements shown in Figure 2A are merely provided by way of example. Other implementations may include more elements, fewer elements, different types of elements, different arrangements of elements, or combinations thereof.
[0054] In the examples shown in Figure 2A, the main noise estimator 210 and the auxiliary noise estimator 220 are both configured to receive loudspeaker audio data 222 and microphone audio data 224, for example as described with reference to block 145 of Figure 1C. In some examples, the loudspeaker audio data 222 may be, or may include, a reference signal corresponding to what is being played back by a local loudspeaker.
[0055] According to this example, the main noise estimator 210 is configured to determine and output a noise estimate 208 and a confidence metric 207 — also referred to herein as a “confidence value” — based at least in part on the loudspeaker audio data 222 and the microphone audio data 224. In this example, the noise estimate 208 and the confidence metric 207 correspond to a current noise spectrum, which in turn corresponds to a current input audio frame of the microphone audio data 224. The confidence metric 207 indicates a likelihood of the current input audio frame corresponding mainly to ambient noise, such as the environmental noise 104 that is described with reference to Figure 1A. In some examples, the main noise estimator 210 may be configured to determine the noise estimate 208, the confidence metric 207, or both, based at least in part on fast tracking flags 228 from the fast tracking flag module 230. Example blocks and functionalities of the main noise estimator 210 are described in more detail below with reference to Figures 2B and 3A.
[0056] In this example, the auxiliary noise estimator 220 is configured to determine and output a noise estimate 226 based at least in part on the loudspeaker audio data 222 and the microphone audio data 224. Here, noise estimate 226 corresponds to the current input audio frame of the microphone audio data 224. According to some disclosed examples, the auxiliary noise estimator 220 is configured to respond to noise level changes — and to produce a responsive noise estimate 226 — relatively faster than the main noise estimator 210, thereby increasing the response speed (decreasing the response time) of the noise estimator 200.
[0057] According to this example, the fast tracking flag module 230 is configured to determine and output fast tracking flags 228, based at least in part on the noise estimates 226 and 208. In some examples, the main noise estimator 210 may be configured to converge to a new noise level relatively faster if the main noise estimator 210 has received a fast tracking flag 228.
[0058] In some examples, if the noise estimate 226 from the auxiliary noise estimator 220 indicates that the power of the wideband noise is greater than the power of the wideband noise indicated by the noise estimate 208 from the main noise estimator 210 by a threshold and for a determined time interval, the fast tracking flag module 230 is configured to set a value of the fast tracking flag 228 to a maximum value, for example to 1. In other words, the fast tracking flag module 230 may be configured to set a value of the fast tracking flag 228 to a maximum value, for example to 1 , upon determining that the difference between (1) the noise power estimated by the auxiliary noise estimator 220 and (2) the noise power estimated by the main noise estimator 210 is greater than a difference threshold for at least a first time threshold. According to some examples, upon determining that the difference between (1) and (2) is less than the difference threshold for at least a second time threshold, the fast tracking flag module 230 may be configured to set a value of the fast tracking flag 228 to a minimum value, for example to 0. The first time threshold may or may not be equal to the second time threshold. In some examples, the power threshold may be in the range from IdB to 5dB, e.g., IdB, 2dB, 3dB, 4dB or 5dB. According to some examples, the time constant may be in the range from 0.5 seconds to 2 seconds, e.g., 0.5 seconds, 0.6 seconds, 0.7 seconds, 0.8 seconds, 0.9 seconds, 1.0 seconds, 1.1 seconds, 1.2 seconds, 1.3 seconds, 1.4 seconds, 1.5 seconds, 1.6 seconds, 1.7 seconds, 1.8 seconds, 1.9 seconds or 2 seconds.
[0059] According to some examples, the auxiliary noise estimator 220 may be configured with different tuning parameters than those of the main noise estimator 210. In some alternative examples, the auxiliary noise estimator 220 may be based on a deep neural network. Such noise estimators may have a very fast response.
[0060] Figure 2B shows blocks of the main noise estimator of Figure 2A according to some disclosed implementations. In this example, the main noise estimator 210 includes a voice activity detector (VAD) 201, an echo coupling gain estimator 202, a confidence calculation block 203, and a noise and confidence statistics calculation block 204. According to this example, the VAD 201, the echo coupling gain estimator 202, the confidence calculation block 203, and the noise and confidence statistics calculation block 204 are implemented by an instance of the control system 110 that is described with reference to Figure IB. As with other figures provided herein, the types and numbers of elements shown in Figure 2B are merely provided by way of example. Other implementations may include more elements, fewer elements, different types of elements, different arrangements of elements, or combinations thereof.
[0061] In this example, the VAD 201 is configured to detect local speech activity based on the microphone audio data 224 and to send speech activity signals 211 to the echo coupling gain estimator 202, the confidence calculation block 203 and the noise and confidence statistics calculation block 204. According to some examples, only when the speech activity signals 211 indicate that there is no local speech activity are the echo coupling gain estimator 202, the confidence calculation block 203 and the noise and confidence statistics calculation block 204 active.
[0062] According to this example, the echo coupling gain estimator 202 is configured to estimate, based on the microphone audio data 224 and the loudspeaker audio data 222, coupling gains from loudspeaker playback to microphone capturing for each of a plurality of frequency bands and to provide corresponding coupling gain estimations 205 to the confidence calculation block 203. In other words, with reference to scenario depicted in Figure 1 A, the echo coupling gain estimator 202 is configured to estimate the contribution of the echo 106 to the sounds detected by the microphone 112 and that are present in the microphone audio data 224 and to provide corresponding coupling gain estimations 205. In the example shown in Figure 2B, the loudspeaker audio data 222 is, or includes, reference signals corresponding to what is being played back by a local loudspeaker. Figure 3A shows more details of the echo coupling gain estimator 202 according to some examples.
[0063] In this example, the confidence calculation block 203 is configured to calculate, for each frequency band, the likelihood that ambient noise is dominant in the current input audio frame of the microphone audio data 224. According to this example, the confidence calculation block 203 is configured to output confidence data 206 to the noise and confidence statistics calculation block 204. In this example, the confidence data 206 includes at least a confidence value corresponding to the current noise spectrum. Here, the confidence value indicates a likelihood of a current input audio frame corresponding mainly to ambient noise. According to this example, the confidence data 206 also includes a binary value (e.g., either 0 or 1) that indicates whether the current input audio frame is likely to be primarily ambient noise and should be processed accordingly.
[0064] According to this example, the noise and confidence statistics calculation block 204 is configured to determine and output a noise estimate 208 and a confidence metric 207, based at least in part on the microphone audio data 224 and the confidence data 206. The confidence metric 207 indicates a likelihood of the current input audio frame corresponding mainly to ambient noise, such as the environmental noise 104 that is described with reference to Figure 1A. In some examples, the noise and confidence statistics calculation block 204 may be configured to determine the noise estimate 208, the confidence metric 207, or both, based at least in part on fast tracking flags 228 (not shown in Figure 2B) from the fast tracking flag module 230 of Figure 2A. Detailed examples of how the noise and confidence statistics calculation block 204 may function are provided below.
[0065] Figure 3A shows blocks of the echo coupling gain estimator of Figure 2B according to some disclosed implementations. In this example, the echo coupling gain estimator 202 includes a delay line module 301, a minimum follower 302, a threshold detector 303, a maximum follower 304, an update control block 305 and a subtraction node 312. According to this example, the delay line module 301, the minimum follower 302, the threshold detector 303, the maximum follower 304, the update control block 305 and the subtraction node 312 are implemented by an instance of the control system 110 that is described with reference to Figure IB. According to this example, the echo coupling gain estimator 202 only operates when there is no local speech activity, as determined by the VAD 201 (see Figure 3A). As with other figures provided herein, the types and numbers of elements shown in Figure 3 A are merely provided by way of example. Other implementations may include more elements, fewer elements, different types of elements, different arrangements of elements, or combinations thereof.
[0066] In the example shown in Figure 3A, the delay line module 301 is a length N delay line and is configured to receive the loudspeaker audio data 222 — which is, or includes, loudspeaker reference signals corresponding to what is being played back by a local loudspeaker. In some examples, N may be in the range from 8 to 16 frames, e.g., 8 frames, 9 frames, 10 frames, 11 frames, 12 frames, 13 frames, 14 frames, 15 frames, 16 frames, etc. According to some examples, each frame may be in the range from 10 milliseconds (ms) to 30 ms, e.g., 10 ms, 12 ms, 14 ms, 16 ms, 18 ms, 20 ms, 22 ms, 24 ms, 26 ms, 28 ms, 30 ms, etc. According to this example, the input loudspeaker reference signals are in the form of frequency banded loudspeaker reference power. In this example, the delay line module 301 is configured to implement a maximum operation (MAX) across the entire delay line. The purpose of letting the loudspeaker reference signals go through a delay line is to compensate for the natural delay between the loudspeaker reference signals and the corresponding microphone capture of the sounds played back by one or more local loudspeakers, in order to take into account the sound wave propagation delay, the electrical circuitry delay, the software buffering delay, etc., as well as the reverberant build-up that may be present in a local room. In this example, the delay line module 301 outputs maximum power data 310, which indicates the maximum of the loudspeaker reference signal power for each frequency band for the current audio frame and for the previous N-l audio frames.
[0067] According to this example, the minimum follower 302 is configured to track the minimum power for each input frequency band for each frame of the microphone audio data 224 and to output minimum power data 311, which indicates the minima of input power for each frequency band of each frame of the microphone audio data 224. In Figure 3A, the “B” indicates the number of frequency bands. In some examples, the time window size of the minimum follower 302 may be dynamically adjusted. For example, (referring again to Figure 2A) in some implementations, when a fast-tracking flag 228 is input to the main noise estimator 210, the minimum follower 302 will shorten its window size in order to help the main noise estimator 210 to converge to a solution faster. In one such example, the minimum follower 302 may shorten its window size from 1.0 seconds to 250 ms. Other examples may involve different starting window sizes, different shortened window sizes, or both.
[0068] In the example shown in Figure 3A, the threshold detector 303 is a simple threshold detector that is configured to determine when the power of the current frame of the input microphone audio data 224 is above a tracked minimum power level by at least a threshold amount and to provide corresponding threshold detector output 313. The tracked minimum power level is highly dependent on the sensitivity of the particular microphone. Therefore, the tracked minimum power level range may vary widely from microphone to microphone. According to some examples, the threshold amount be in the range from 3dB to lOdB, e.g., 3dB, 4dB, 5dB, 6dB, 7dB, 8dB, 9dB or lOdB. In some examples, the threshold detector output 313 of the threshold detector 303 has a value of 1 whenever the power of the current frame of the input microphone audio data 224 is above the tracked minimum by at least the threshold amount. In some such examples, the threshold detector output 313 has a value of 0 whenever the power of the current frame of the input microphone audio data 224 is not above the tracked minimum by at least the threshold amount.
[0069] In this example, the subtraction node 312 is configured for subtracting (a) the input power of the current frame of input microphone audio data 224 from (b) the maximum power of the loudspeaker reference signals, as determined by the maximum power data 310 output by the delay line module 301, to produce the subtraction node output 315. The subtraction node output 315 represents potential coupling gain estimations.
[0070] According to this example, the maximum follower 304 is configured to determine and output coupling gain estimations 205 based on the subtraction node output 315 and update control signals 317 from the update control module 305.
[0071] In this example, the update control module 305 is configured to determine whether to disallow or allow the subtraction node output 315 to be output by the maximum follower 304 as the current coupling gain estimation 205. According to some such examples, the update control signals 317 control whether the subtraction node output 315 will be received by the maximum follower 304 at all. Even though — according to this example — the echo coupling gain estimator 202 only operates when there is no local speech activity, it is nonetheless possible that local transient noise may still be included in the microphone signals 224. Whenever there is some transient noise in a frequency band, the gain calculated by the subtraction node 312 will not be accurate and should be eliminated.
[0072] However, it can be challenging to determine whether a current input audio frame contains local transient noise or simply resulted from a powerful echo — in other words, from loud loudspeaker playback — without prior knowledge of the coupling gain. And this coupling gain is what the echo coupling gain estimator
202 need to estimate here.
[0073] This problem can be overcome by using the fact that the coupling gain will generally not change abruptly and continuously for at least N frames, where N is the delay line length of the delay line module 301. Therefore, according to some examples, whenever there is a new maximum of gain, the control system 110 only allows the maximum follower 304 to track the new gain maximum if two conditions are satisfied:
1. The new gain maximum is within a certain range — such as within 1 dB, 2 dB, 3 dB, 4 dB, etc. — of a previously-tracked maximum gain. In some implementations, this condition may be ignored for the first update if there is no measured initial value (see below);
2. There is no new gain maximum within the previous N frames.
[0074] The above-mentioned problem also may be overcome by considering the fact that the coupling gain is primarily a characteristic of the device being used to participate in the teleconference, such as the phone, the laptop, the conferencing endpoint, etc. Although the coupling gain can be affected by the acoustic properties of a room in which the device resides, the coupling gain is mainly determined by the industrial design of the device itself. Once the product is manufactured, this coupling gain may be obtained and, in some examples, this coupling gain may be stored for future use as the initial value.
Examples of Confidence Calculation Block Functionality
[0075] This section includes examples of how the confidence calculation block
203 of Figure 2B may be implemented. Given the coupling gain and a current audio frame of the microphone audio data 224, the confidence calculation block 203 can calculate the likelihood of the current audio frame being predominantly ambient noise. Suppose the coupling gain (in dB) for frequency band b is gb and the maximum reference power 310 is Xb (in dB). The estimated echo power Eb(dB in dB may be represented as follows:
Figure imgf000022_0001
[0076] The likelihood of the current audio frame of the microphone audio data
224, having power Yb, being ambient noise may be represented as follows:
1.0
Lb ~ l.o + e-6.o(y6-^)/o-
[0077] In the foregoing equation, cr represents a threshold value, which may be 4 dB, 5 dB, 6 dB, 7 dB, 8 dB, etc.
[0078] According to some examples, the confidence calculation block 203 may determine and send an indication flag Ib — corresponding to frequency band b — to the noise and confidence statistics calculation block 204. In some such examples, the confidence calculation block 203 may determine the indication flag Ib by implementing the following set of conditions:
Figure imgf000023_0001
[0079] In the foregoing equations, nb represents the estimated noise mean for frequency band b. The second condition (if nb>Yb and £ nb >T Yb) allows noise statistics to be updated once the noise is gone. Even though the likelihood of current frame being ambient noise may be dubious, the level estimated may not be sensible under such conditions.
Noise and Confidence Statistics Calculations
[0080] This section includes examples of how the noise and confidence statistics calculation block 204 of Figure 2B may be implemented. In the example shown in Figure 3A, the input to the noise and confidence statistics calculation block 204 includes the minimum power data 311 from the minimum follower 302. The minimum power data 311 indicates the minima of input power for each frequency band of each frame of the microphone audio data 224. According to this example, the input to the noise and confidence statistics calculation block 204 also includes the microphone signal power, Yb and the indication flag Ib — , both of which correspond to frequency band b.
[0081] To prevent transient noise being taken into control and to increase the estimation accuracy, in some examples the noise and confidence statistics calculation block 204 only takes Yb into the accumulation if Ib — 1 and if the following condition is met: Yb < Yb + f
[0082] In the foregoing equation, Yb represents the tracked minima of Yb and /? represents a threshold, such as 8 dB, 9 dB, 10 dB, 11 dB, 12 dB, etc. In some examples, only when noise accumulator is being updated, the confidence statistics are updated which takes Lb as input. According to some examples, the output of the the noise and confidence statistics calculation block 204 includes nb, the estimated ambient noise power for each band, and pb, the confidence value of corresponding to the current ambient noise power estimation.
Possible Uses of Output from the Noise and Confidence Statistics Calculation Block
[0083] In some implementations, the confidence value pb may be used to generate a wideband confidence flag P to control the behaviour of noise compensation control logic, for example as follows:
P = H Pb ' b)
In the foregoing equation, wb represents a weighting factor wherein ^wb — 1, and H represents a hysteresis function output 0 or 1 with two hysteresis curve thresholds. According to some implementations, the compensation logic or mechanism(s) shall only be active when P = 1. In some examples, whenever P = 0 and the current volume (or/and other settings) are not at user’s setting(s), the compensation logic or mechanism(s) may cause the setting(s) to revert to the user’s setting(s).
Determining and Applying System Adjustments
[0084] After the control system has estimated the current SII, in some implementations the control system will implement a multi-step process in order to determine what the target SII is, whether change the system and, if so, how to change the system to achieve this target SIL In some examples, this process corresponds to block 165 of Figure 1C. Accordingly, this section describes various actions that may be performed according to various implementations of block 165.
[0085] In some examples, block 165 may calculate and apply the one or more changes to the system such that the SII calculated in block 160 of Figure 1C is within a target SII range. Such changes may include, but are not limited to, the following:
• A change in the hardware/software gain of the speaker (e.g., volume control);
• An increase in the gain would map to an increase to SIL Similar logic for decrease;
• A change in the equalization (EQ);
• A change in the volume leveler effect of Dolby audio processing (DAP).
[0086] In some examples, such control(s) will only be implemented if the wideband confidence indication is high (for example, 1), in which case there is a high confidence level that the estimated noise spectrum is indeed from background noise. If the confidence indicator is low (for example, 0) and any of the above controls are not at the user setting, in some examples the control system will cause the user setting(s) to be restored.
[0087] Following are various examples of how to ensure that the SII stays within a target SII range. Figure 3B shows a plot of estimated SII over time according to one example. The estimated SII may, for example correspond to the output of block 160 of Figure 1C. Figure 3A illustrates how SII can fluctuate over a period of time. During this period of time, in this example block 165 does not suggest any system change(s), because the SII remains within the high and low targets, which are shown in Figure 3B as “High target sii” and Low target sii.” In this case, block 165 is in what is referred to herein as a “ref_gain_adj = NONE” state, indicating that no gain adjustments will be made to the audio being played back in a local system that is providing a teleconference to a local teleconference participant.
[0088] Figure 4 shows a plot of estimated SII over time according to another example. Figure 4 show what happens when the SII increases beyond the high target sii. Between the start of time and tl, block 165 does not suggest any volume change (ref_gain_adj = NONE state). At tl, the SII value has breached the range, going beyond the high_target_sii. In this case, Block 165 will suggest a system change, and will continue to do in subsequent frames (with optional pauses between suggestions as described below) until the SII is between dec_aim_target_sii and low_target_sii. Until this is achieved, Block 165 is in ref_gain_adj = INC_REQ state.
[0089] The dec_aim_target_sii value may be the same as the target_sii or may differ from the target_sii based on the knowledge on how the system will react, and how the SII will tend to fluctuate. For example, in some systems it may be known that SII may be underestimated compared to the true SII due to a slow noise estimate adaptation. In these cases, a dec aim target sii that is slightly lower than the target_sii (but higher than the low_target_sii) may be preferable. Once the SII is between dec_aim_target_sii and low_target_sii (at t2), in this example the system state will go back to the behavior described with reference to Figure 3B. In our implementation, the dec_aim_target_sii is the same as the target_sii.
[0090] One example of a change to the system that may be implemented is to decrease the volume. The decrease in volume could be specified such that it is proportional to the difference between the current SII and the SII that we are aiming for. In one implementation, the control system may determine a possible gain change in dB, for example a possible gain change such that: target_gain_change = gain_dec_delta * (1 + dec_speed *(sii - dec_aim_target_sii))
[0091] In the foregoing equation, Gain_dec_delta represents a first tuning parameter, which in one example is 0.6 and Dec speed represents a second tuning parameter, which in one example 1.0. Other examples may involve different tuning parameters. In some alternative examples, the first tuning parameter may be 0.5, 0.55, 0.65, 0.7, etc., and the second tuning parameter may be 0.9, 0.95, etc.
[0092] In some implementations, whenever the control system determines a possible change to the system, the control system may be configured to optionally wait for a time interval, which may be measured in input audio frames, to allow the system to stabilize before checking the current SII and suggesting a new change. The suggested change may or may not be implemented. For example, if a feature corresponding to one or more disclosed methods has been switched off or disabled — for example, according to user input — in some examples, the suggested change will not be implemented. In some such implementations, the amount of time that the control system is configured to wait we wait between checking the current SII and suggesting a new change may be proportional to how close the SII is to the dec_aim_target_sii. In some implementations, the control system may first calculate:
Diff_to_target = sii - dec_aim_target_sii and Diff_range = (high_target_sii - low_target_sii) / 2
[0093] Based on the value of Diff to target, the control system may calculate a wait factor, for example as follows:
Wait_factor = 1, if 0 >= diff or diff >= Diff_range
Wait_f actor = 1, if diff > Diff_range/3
Wait_f actor = 2, if diff > Diff_range/5
Wait_f actor = 3, if diff > Diff_range/7
Wait_factor = 4, in all other cases
[0094] In some such examples, the number of input audio frames corresponding to the waiting time interval is equal to:
Wait_frames = gain_adj_holdon_frames * wait_factor
[0095] In the foregoing equation, gain_adj_holdon_frames represents a tuning parameter. In one implementation the tuning factor is 96 for a 20ms block length. In other implementations, the tuning factor may be 90, 92, 94, 98 or 100 for a 20ms block length.
[0096] In some instances, the following may occur:
• At tO, the SII is within the high_target_sii and low_target_sii.
• At tl , the SII goes beyond high_target_sii. The control system is now trying to compensate by suggesting a change.
• The suggested change in the system occurs, and the control system waits.
• At t2, the SII is between high_target_sii and dec_aim_target_sii. The control system again suggests a new change. • This suggested change occurs, and the control system waits.
• At t3, the SII is now below low target sii.
[0097] In the foregoing case, the last change in the system has made the system overshoot the target by time t3. To rectify this situation, in some examples block 165 will involve reacting and suggesting changes in the same way as if the SII had breached the low_target_sii from a NONE state. Therefore, in some such examples block 165 will involve transitioning to ref_gain_adj = DEC_REQ state.
[0098] The above example described a case in which the SII went from a NONE state to an INC_REQ state. A similar logic can be followed for cases in which the SII falls below the low target sii. In some such examples, block 165 will involve suggesting a system change. In some such examples, the control system will continue suggesting a system change in subsequent frames (with optional pauses between suggestions as described previously) until the SII is between inc_aim_target_sii and high_target_sii. Until this condition is achieved, block 165 may correspond to an ref_gain_adj = DEC_REQ state.
[0099] Figure 5 shows another plot of estimated SII over time. Figure 5 shows how the control system may implement block 165 when, and after, the SII falls below the low target SII according to some examples. The inc_aim_target_sii value shown in Figure 5 may be the same as the target_sii or may differ based on the knowledge on how the system will react, and how the SII will tend to fluctuate. For example, in some systems it may be known that SII may be overestimated compared to the true SII due to a slow noise estimate adaptation.
In these cases, an inc_aim_target_sii that is slightly higher than the target_sii (but lower than the high_target_sii) may be preferable. Once the SII is between inc_aim_target_sii and high_target_sii (at t2), in some examples the system state will go back to the behavior described with reference to Figure 3B. In some implementations, the inc_aim_target_sii may be the same as the target_sii.
[0100] One example of a change to the system that the control system may suggest is to increase the volume. The increase in volume may be specified such that it is proportional to the difference between the current SII and a target SII. In one implementation, the control system may suggest a gain change in dB such that: target_gain_change = gain_inc_delta * (1 + inc_speed *( inc_aim_target_sii - sii))
[0101] In the foregoing equation, Gain_inc_delta represents a tuning parameter. In one implementation this tuning parameter is 0.8, but in alternative implementations this tuning parameter may be 0.7, 0.75, 0.85, 0.9, etc. In the foregoing equation, Inc speed represents another tuning parameter. In one implementation this tuning parameter is 1.0, but in alternative implementations this tuning parameter may be 0.9, 0.95, 1.05, 1.1, etc.
[0102] Whenever the control system suggests a change to the system, the control system may optionally wait a few frames to allow the system to stabilize before checking the current SII and suggesting a new change. The amount of time the control system waits between suggestions may be proportional to how close the SII is to the inc_aim_target_sii. In one implementation, the wait time may be determined as follows:
Wait_frames = gain_adj_holdon_frames
[0103] In the foregoing equation, gain_adj_holdon_frames represents a tuning parameter. In one implementation the tuning factor is 96 for a 20ms block length. In other implementations, the tuning factor may be 90, 92, 94, 98 or 100 for a 20ms block length.
[0104] In some cases, the control system may determine that the following occurs:
• At tO, the SII is within the high_target_sii and low_target_sii.
• At tl , the SII goes below the low_target_sii. The control system is now trying to compensate by suggesting a change.
• The suggested change in the system occurs, and the control system waits.
• At t2, the SII is between low_target_sii and inc_aim_target_sii . The control system again suggests a new change.
• This suggested change occurs, and the control system waits.
• At t3, the SII is now above high target sii.
[0105] In this case, the last change in the system has made the system overshoot the target SII. To rectify this, block 165 may involve reacting and suggesting changes in the same way as if the SII had breached the high_target_sii from an NONE state. In some examples, block 165 will involve transitioning to a ref_gain_adj = INC_REQ state.
[0106] In the description so far of how block 165 may be implemented, we have not addressed how the target_sii, high_target_sii, low_target_sii may be determined. According to some examples, the control system may implement an initialization process during which these values are determined. In some instances the initialization process may be initiated by a user, whereas in other instances the initialization process may be initiated when a device used to provide a teleconference is powering up. In some examples, during the initialization step, the other operations described above are not performed, and only when initialization is complete, will the above operations be performed.
[0107] According to some examples, the control system may start the initialization process by determining the average SII over a time interval, which may correspond to a number of audio frames or blocks. (As used herein, the terms “audio frame” and “audio block” have the same meaning.) This period of blocks may be considered as a tuning parameter, referred to herein as “sii_max_init_counter.” In one example, this tuning parameter may be 50 blocks, whereas in other examples this tuning parameter may be 40 blocks, 45 blocks, 55 blocks, 60 blocks, etc. After determining the average SII over a time interval, in some implementations the control system will set this average SII to be the target_sii. In some such implementations, the high target SII and the low target SII may be determined as follows: high target sii = target sii + target sii range low_target_sii = target_sii - target_sii_range
[0108] In the foregoing equations, target_sii_range represents a tuning parameter. In one example, this tuning parameter may be 0.15, whereas in other examples this tuning parameter may be 0.1, 0.2, etc.
[0109] During the initialization phase, there may be audio blocks in which the loudspeaker has insignificant energy. In these blocks, the SII calculation will not be indicative of the true SII of the system and therefore should not be taken into account during the averaging SII procedure. In some implementations, the control system may determine whether the energy is insignificant by checking whether the following condition is met: mono_ref_level > ref_th_alpha * ref_th
[0110] In the foregoing equation, mono ref level represents the energy of the speaker for the current block in dB, an input into the control system and ref_th_alpha represents a tuning parameter. In one example, this tuning parameter may be 4.0, whereas in other examples this tuning parameter may be 3.0, 3.5, 4.5, 5.0, etc. In the foregoing equation, ref_th represents a tuning parameter. In one example, this tuning parameter may be -30, whereas in other examples this tuning parameter may be -20, -25, -35, -40, etc.
[0111] When making the compensation, there may be a goal to ensure that the system does not go below the user’s initial volume. To support this goal, during initialization, the control system may determine the current volume of the system and then set the current volume as our minimum volume. This current volume may, for example, be stored as “usr_ref_gain.” When a decrease in the volume is suggested in block 165, in some implementations the control system may determine whether to implement this suggestion by ensuring that implement this suggestion would not cause the volume to go below the minimum volume.
[0112] During the initialization phase, in some systems the estimated SII may ramp up and may be invalid for the first few blocks of the audio session. This phenomenon may be caused by the instability of the speech level or the noise estimate during the first few blocks of the audio session. In some implementations, the control system may ensure that the SII averaging process does not take into account the first few blocks of the audio session. In some such implementations, the control system may ignore the first 20 blocks (assuming 20ms block length), the first 22 blocks, the first 24 blocks, the first 26 blocks, the first 28 blocks, the first 30 blocks, etc.
[0113] In some instances, during the implementation of block 165, the user may attempt to change the volume manually. If this occurs, the control system may optionally take the user’s attempted volume change into account. In some such examples, the control system may interpret the user’s selected volume as a new minimum volume (new usr ref gain) and a new target sii. According to some implementations, when the user interacts with the volume gain of the system, the control system may restart the initialization phase.
[0114] Figure 6 is a flow diagram that outlines another example of a method that may be performed by an apparatus or system such as those disclosed herein. The blocks of method 600, like other methods described herein, are not necessarily performed in the order indicated. In some implementation, one or more of the blocks of method 600 may be performed concurrently. Moreover, some implementations of method 600 may include more or fewer blocks than shown and/or described. The blocks of method 600 may be performed by one or more devices, which may be (or may include) a control system such as the control system 110 that is shown in Figure IB and described above.
[0115] In this example, method 600 is a method of compensating for environmental noise during a teleconference. According to this example, block 605 involves estimating, by a control system, a current speech spectrum corresponding to speech of remote teleconference participants. In some examples, block 605 may correspond to block 150 of Figure 1C and may be performed according to the descriptions herein of block 150.
[0116] According to this example, block 610 involves estimating, by the control system, a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located. According to some examples, block 610 may correspond to block 155 of Figure 1C and may be performed according to the descriptions herein of block 155.
[0117] In this example, block 615 involves calculating, by the control system, a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum. In some examples, block 615 may correspond to block 160 of Figure 1C and may be performed according to the descriptions herein of block 160.
[0118] According to this example, block 620 involves determining, by the control system and based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant. According to some examples, the determining may involve evaluating the current SII according to one or more target SII parameters. In some examples, the determining may involve determining whether the current SII is within a target SII range. In some examples, method 600 may involve, responsive to determining that the adjustment should be made, adjusting, by the control system, at least a portion of the local audio system. According to some examples, block 620 may correspond to block 165 of Figure 1C and may be performed according to the descriptions herein of block 165. In some such examples, block 620 may involve one or more of the example implementations of block 165 that are described with reference to Figures 3B-5.
[0119] In this example, block 625 involves updating, by the control system, at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.
[0120] In some examples, method 600 may involve determining a confidence value corresponding to the current noise spectrum. The confidence value may indicate the likelihood of a current input audio frame corresponding mainly to ambient noise. The confidence value may correspond with the confidence metric 207 that is described herein with reference to Figures 2 A and 2B. In some examples, method 600 may involve determining whether to update one or more noise statistics based, at least in part, on the confidence value.
[0121] According to some examples, the confidence value may be a broadband confidence value. Determining whether to make the adjustment of the local audio system may be based, at least in part, on the broadband confidence value. [0122] However, in some examples method 600 may involve determining a band-based confidence value for each frequency band of a plurality of frequency bands. Determining whether to make the adjustment of the local audio system may involve determining whether to update one or more noise statistics for each frequency band based, at least in part, on the band-based confidence value.
[0123] In some examples, estimating the current noise spectrum may involve estimating an echo coupling gain corresponding to local loudspeaker playback captured by a local microphone system. In some such examples, the echo coupling gain may be estimated by the echo coupling gain estimator 202 of Figures 2B and 3A. According to some examples, estimating the echo coupling gain may involve determining a maximum band loudspeaker reference power for each frequency band of a current audio frame and previous N-l audio frames, where N s an integer corresponding to a number of audio frames in a delay line. In some examples, estimating the echo coupling gain may involve tracking a minimum power for each frequency band of an input microphone signal. In some examples, estimating the echo coupling gain may involve estimating an echo coupling gain for each frequency band of the input microphone signal based, at least in part, on the maximum band loudspeaker reference power and the minimum power. Some disclosed examples involve determining whether to update an estimated echo coupling gain based, at least in part, on whether a current estimated echo coupling gain value has changed more than a threshold amount from a most recent estimated echo coupling gain value, on whether an estimated echo coupling gain has been updated within a threshold time interval, or on combinations thereof. In some such examples the threshold time interval may be measured in audio frames or audio blocks.
[0124] In some examples, method 600 may involve adjusting, by the control system and responsive to determining that the adjustment should be made, at least a portion of the local audio system to maintain the current SII within a target SII range. According to some examples, maintaining the current SII within a target SII range may involve increasing a loudspeaker playback volume of one or more audio frames until the current SII is greater than a first target SII and less than a high target SII. In some examples, the first target SII may be greater than a median target SII and less than the high target SII. According to some examples, maintaining the current SII within a target SII range involves decreasing a loudspeaker playback volume of one or more audio frames until the current SII may be less than a second target SII and greater than a low target SII. In some such examples, the second target SII may be less than a median target SII and greater than the low target SII.
[0125] Following are some enumerated example embodiments (EEEs): [0126] EEE1. A method of compensating for environmental noise during a teleconference, the method comprising: estimating, by a control system, a current speech spectrum corresponding to speech of remote teleconference participants; estimating, by the control system, a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located; calculating, by the control system, a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum; determining a confidence value corresponding to the current noise spectrum, the confidence value indicating a likelihood of a current input audio frame corresponding mainly to ambient noise; and determining, by the control system and based at least in part on the current SII and the confidence value, whether to make an adjustment of a local audio system used by the local teleconference participant, wherein the determining involves evaluating the current SII according to one or more target SII parameters.
[0127] EEE2. The method of EEE1, wherein the determining involves determining whether the current SII is within a target SII range.
[0128] EEE3. The method of EEE1 or EEE2, further comprising, responsive to determining that the adjustment should be made, adjusting, by the control system, at least a portion of the local audio system.
[0129] EEE4. The method of any one of EEEs 1-3, further comprising updating at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.
[0130] EEE5. The method of EEE1, wherein the confidence value is a broadband confidence value and wherein determining whether to make the adjustment of the local audio system is based, at least in part, on the broadband confidence value. [0131] EEE6. The method of EEE4 or EEE5, further comprising determining whether to update one or more noise statistics based, at least in part, on the confidence value.
[0132] EEE7. The method of EEE1, further comprising determining a bandbased confidence value for each frequency band of a plurality of frequency bands and wherein determining whether to make the adjustment of the local audio system involves determining whether to update one or more noise statistics for each frequency band based, at least in part, on the band-based confidence value. [0133] EEE8. The method of any one of EEEs 1-7, wherein estimating the current noise spectrum involves estimating an echo coupling gain corresponding to local loudspeaker playback captured by a local microphone system.
[0134] EEE9. The method of EEE8, wherein estimating the echo coupling gain involves: determining a maximum band loudspeaker reference power for each frequency band of a current audio frame and previous N-l audio frames, where N s an integer corresponding to a number of audio frames in a delay line; tracking a minimum power for each frequency band of an input microphone signal; and estimating an echo coupling gain for each frequency band of the input microphone signal based, at least in part, on the maximum band loudspeaker reference power and the minimum power.
[0135] EEE10. The method of EEE9, further comprising determining whether to update an estimated echo coupling gain based, at least in part, on whether a current estimated echo coupling gain value has changed more than a threshold amount from a most recent estimated echo coupling gain value, on whether an estimated echo coupling gain has been updated within a threshold time interval, or on combinations thereof.
[0136] EEE 11. The method of any one of EEEs 1-10, further comprising , responsive to determining that the adjustment should be made, adjusting, by the control system, at least a portion of the local audio system to maintain the current SII within a target SII range.
[0137] EEE12. The method of EEE11, wherein maintaining the current SII within a target SII range involves increasing a loudspeaker playback volume of one or more audio frames until the current SII is greater than a first target SII and less than a high target SII.
[0138] EEE13. The method of EEE12, wherein the first target SII is greater than a median target SII and less than the high target SII.
[0139] EEE 14. The method of EEE 12 or EEE 13 , wherein maintaining the current SII within a target SII range involves decreasing a loudspeaker playback volume of one or more audio frames until the current SII is less than a second target SII and greater than a low target SII.
[0140] EEE15. The method of EEE14, wherein the second target SII is less than a median target SII and greater than the low target SII.
[0141] EEE16. An apparatus configured to perform the method of any one of EEEsl-15.
[0142] EEE17. A system configured to perform the method of any one of
EEEs 1-15.
[0143] EEE18. One or more non-transitory media having instructions stored thereon for controlling one or more devices to perform the method of any one of EEEsl-15.
[0144] Some aspects of the present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
[0145] Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.
[0146] Another aspect of the present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof.
[0147] While specific embodiments and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of this disclosure.

Claims

CLAIMS What Is Claimed Is:
1. A method of compensating for environmental noise during a teleconference, the method comprising: estimating, by a control system, a current speech spectrum corresponding to speech of remote teleconference participants; estimating, by the control system, a current noise spectrum corresponding to environmental noise in a local environment in which a local teleconference participant is located; calculating, by the control system, a current speech intelligibility index (SII) based, at least in part, on the current speech spectrum and the current noise spectrum; determining, by the control system and based at least in part on the current SII, whether to make an adjustment of a local audio system used by the local teleconference participant, wherein the determining involves evaluating the current SII according to one or more target SII parameters; and updating, by the control system, at least one of the one or more target SII parameters responsive to user input corresponding to a playback volume change.
2. The method of claim 1, wherein the determining involves determining whether the current SII is within a target SII range.
3. The method of claim 1 or claim 2, further comprising, responsive to determining that the adjustment should be made, adjusting, by the control system, at least a portion of the local audio system.
4. The method of any one of claims 1-3, further comprising determining a confidence value corresponding to the current noise spectrum, the confidence value indicating a likelihood of a current input audio frame corresponding mainly to ambient noise.
5. The method of claim 4, wherein the confidence value is a broadband confidence value and wherein determining whether to make the adjustment of the local audio system is based, at least in part, on the broadband confidence value.
6. The method of claim 4 or claim 5, further comprising determining whether to update one or more noise statistics based, at least in part, on the confidence value.
7. The method of claim 4, further comprising determining a band-based confidence value for each frequency band of a plurality of frequency bands and wherein determining whether to make the adjustment of the local audio system involves determining whether to update one or more noise statistics for each frequency band based, at least in part, on the band-based confidence value.
8. The method of any one of claims 1-7, wherein estimating the current noise spectrum involves estimating an echo coupling gain corresponding to local loudspeaker playback captured by a local microphone system.
9. The method of claim 8, wherein estimating the echo coupling gain involves: determining a maximum band loudspeaker reference power for each frequency band of a current audio frame and previous N-l audio frames, where N s an integer corresponding to a number of audio frames in a delay line; tracking a minimum power for each frequency band of an input microphone signal; and estimating an echo coupling gain for each frequency band of the input microphone signal based, at least in part, on the maximum band loudspeaker reference power and the minimum power.
10. The method of claim 9, further comprising determining whether to update an estimated echo coupling gain based, at least in part, on whether a current estimated echo coupling gain value has changed more than a threshold amount from a most recent estimated echo coupling gain value, on whether an estimated echo coupling gain has been updated within a threshold time interval, or on combinations thereof.
11. The method of any one of claims 1-10, further comprising, responsive to determining that the adjustment should be made, adjusting, by the control system, at least a portion of the local audio system to maintain the current SII within a target SII range.
12. The method of claim 11, wherein maintaining the current SII within a target SII range involves increasing a loudspeaker playback volume of one or more audio frames until the current SII is greater than a first target SII and less than a high target SII.
13. The method of claim 12, wherein the first target SII is greater than a median target SII and less than the high target SII.
14. The method of claim 12 or claim 13, wherein maintaining the current SII within a target SII range involves decreasing a loudspeaker playback volume of one or more audio frames until the current SII is less than a second target SII and greater than a low target SII.
15. The method of claim 14, wherein the second target SII is less than a median target SII and greater than the low target SII.
16. An apparatus configured to perform the method of any one of claims 1- 15.
17. A system configured to perform the method of any one of claims 1-15.
18. One or more non-transitory media having instructions stored thereon for controlling one or more devices to perform the method of any one of claims 1- 15.
PCT/US2024/036469 2023-07-07 2024-07-01 Environmental noise compensation in teleconferencing Pending WO2025014685A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202363512424P 2023-07-07 2023-07-07
US63/512,424 2023-07-07
US202463635570P 2024-04-17 2024-04-17
US63/635,570 2024-04-17

Publications (1)

Publication Number Publication Date
WO2025014685A1 true WO2025014685A1 (en) 2025-01-16

Family

ID=91953787

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/036469 Pending WO2025014685A1 (en) 2023-07-07 2024-07-01 Environmental noise compensation in teleconferencing

Country Status (1)

Country Link
WO (1) WO2025014685A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120121096A1 (en) * 2010-11-12 2012-05-17 Apple Inc. Intelligibility control using ambient noise detection
US20170140772A1 (en) * 2015-11-18 2017-05-18 Gwangju Institute Of Science And Technology Method of enhancing speech using variable power budget
US20210174821A1 (en) * 2019-12-09 2021-06-10 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120121096A1 (en) * 2010-11-12 2012-05-17 Apple Inc. Intelligibility control using ambient noise detection
US20170140772A1 (en) * 2015-11-18 2017-05-18 Gwangju Institute Of Science And Technology Method of enhancing speech using variable power budget
US20210174821A1 (en) * 2019-12-09 2021-06-10 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method thereof

Similar Documents

Publication Publication Date Title
US11164592B1 (en) Responsive automatic gain control
JP6773403B2 (en) Gain control system and gain control method
US8744091B2 (en) Intelligibility control using ambient noise detection
CN103295581B (en) Method and device for increasing speech clarity and computing device
JP7639070B2 (en) Background noise estimation using gap confidence
US8345885B2 (en) Sound signal adjustment apparatus and method, and telephone
US10878833B2 (en) Speech processing method and terminal
US8687796B2 (en) Method and electronic device for improving communication quality based on ambient noise sensing
US8718562B2 (en) Processing audio signals
TW201142831A (en) Adaptive environmental noise compensation for audio playback
CN105940449B (en) audio signal processing
CN112954115A (en) Volume adjusting method and device, electronic equipment and storage medium
CN103997561B (en) Communication device and voice processing method thereof
CN114868403A (en) Multiband Limiter Mode and Noise Compensation Method
TW201543471A (en) Audio signal processing
US12132458B2 (en) Long-term signal estimation during automatic gain control
JP7195344B2 (en) Forced gap insertion for pervasive listening
EP3748635B1 (en) Acoustic device and acoustic processing method
US20150350778A1 (en) Perceptual echo gate approach and design for improved echo control to support higher audio and conversational quality
WO2025014685A1 (en) Environmental noise compensation in teleconferencing
CN116614668A (en) Self-adaptive control method, system, equipment and storage medium for live broadcast volume
CN115280412A (en) Wide band adaptive change of echo path change in acoustic echo cancellers
CN103873981A (en) Audio adjustment method and acoustic processing device
HK1170839B (en) Speech intelligibility control using ambient noise detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24743629

Country of ref document: EP

Kind code of ref document: A1