[go: up one dir, main page]

WO2021114779A1 - Procédé, appareil et système d'annulation d'écho utilisant une détection de parole doublée - Google Patents

Procédé, appareil et système d'annulation d'écho utilisant une détection de parole doublée Download PDF

Info

Publication number
WO2021114779A1
WO2021114779A1 PCT/CN2020/114168 CN2020114168W WO2021114779A1 WO 2021114779 A1 WO2021114779 A1 WO 2021114779A1 CN 2020114168 W CN2020114168 W CN 2020114168W WO 2021114779 A1 WO2021114779 A1 WO 2021114779A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
input sound
utterance
sound signal
double
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/114168
Other languages
English (en)
Chinese (zh)
Inventor
潘思伟
罗本彪
雍雅琴
董斐
林福辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Spreadtrum Communications Shanghai Co Ltd
Original Assignee
Spreadtrum Communications Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spreadtrum Communications Shanghai Co Ltd filed Critical Spreadtrum Communications Shanghai Co Ltd
Publication of WO2021114779A1 publication Critical patent/WO2021114779A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • This application relates to the field of voice communication, and in particular to an echo cancellation method, device and system based on double-ended voice detection.
  • the acoustic echo is due to the coupling between the speaker and the terminal microphone, resulting in the telephone microphone not only containing useful voice signals, but also echo. If the microphone signal is not processed, the echo signal and the near-end voice signal will be transmitted to the far-end speaker for playback, and the far-end caller will hear his delayed voice, which will make people feel uncomfortable and affect the call Effect. When the echo is loud, the call cannot even be carried out normally. Therefore, effective measures must be taken to suppress the echo and eliminate its impact in order to improve the quality of voice communication.
  • Echo cancellation has become an engineering problem that needs to be solved since Bell invented the telephone.
  • communication methods and application scenarios have become increasingly diversified, and communication terminals have become more and more compact, making the coupling between speakers and microphones stronger and stronger, and the echo channel has become more and more complex and changeable. This is voice communication.
  • Acoustic echo cancellation in the system poses a great challenge.
  • Acoustic echo is generally produced in hands-free communication systems. It is an echo generation method affected by sound wave propagation. Generally, it can be divided into two situations: direct echo and indirect echo.
  • Direct echo means that the sound played by the speaker directly enters the microphone along the path without any reflection and is picked up. This echo has the shortest delay time, and the voice energy of the far-end speaker, the distance and angle between the speaker and the microphone, and the speaker The playback volume and the pickup sensitivity of the microphone are related to other factors.
  • Indirect echo refers to the collection of echoes generated by the sound played by the speaker entering the microphone after being reflected one or more times through different paths. The characteristics of this echo are long delay time, large delay jitter, and the amount of echo that is greatly affected by the environment.
  • an adaptive echo canceller (Acoustic Echo Canceller, AEC for short) is usually used to cancel the echo.
  • AEC Acoustic Echo Canceller
  • the basic principle of AEC can be summarized as adaptively estimating the echo and subtracting the estimated echo from the signal picked up by the microphone.
  • AEC can avoid the influence of echo between the callers; in the hands-free phone, AEC can minimize the echo.
  • the echo cancellation effect of AEC can meet the current needs; however, when there is obvious near-end sound, the performance of AEC based on various existing adaptive filtering algorithms will deteriorate, and it cannot even guarantee self-control. Adapt to the convergence of the filtering algorithm.
  • double-talk detector DTD
  • a typical application of DTD is to freeze AEC updates during double-talk periods to prevent adaptive filtering algorithms. Divergence.
  • the double-ended utterance detection algorithm may specifically include an energy-based double-ended utterance detection algorithm, a double-ended utterance detection algorithm based on signal correlation characteristics, and a double-ended utterance detection algorithm based on spectral characteristics.
  • These double-ended vocalization detection algorithms all rely on the selection of a fixed threshold, and the vocalization state is judged by comparing the calculated statistics with the threshold.
  • the fixed threshold method cannot accurately detect the double-ended voice state. This not only affects the robustness of echo cancellation, but also produces severe sound cuts during subsequent processing, that is, the sound transmitted to the remote user will be intermittent.
  • the main influencing factor in hands-free communication equipment is the signal-to-return ratio of the signal received by the microphone, that is, the amplitude (power) ratio of the near-end voice received by the microphone to the echo signal received from the speaker.
  • the microphone's response ratio is usually lower during hands-free calls, and the distance between the microphone and the near-end talker, the volume of the near-end talker, and the size of the echo will change the return ratio. This makes the traditional The double-ended voice detection algorithm based on a fixed threshold often fails, and it is difficult to balance the duplex and de-echo performance in hands-free calling.
  • the echo cancellation technology in the prior art cannot accurately filter out the echo interference in double-ended voice problems, especially in hands-free calls and conference calls, and the call quality is easily affected.
  • the technical problem solved by this application is how to better eliminate echo and improve the duplex call experience of the hands-free voice communication terminal.
  • embodiments of the present application provide an echo cancellation method, device, and system based on double-ended vocalization detection, where the echo cancellation method based on double-ended vocalization detection may include: acquiring an input sound signal from a sound collection device; Perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal; determine the current utterance state according to the near-end speech estimation signal; obtain a preset mapping relationship between the utterance state and the processing mode, according to the The mapping relationship obtains the processing mode corresponding to the current utterance state; processes the near-end speech estimation signal according to the processing mode; and outputs the processed near-end speech estimation signal to obtain an output signal.
  • the determining the current utterance state according to the near-end speech estimation signal includes: calculating the double-ended utterance state of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal The average value of the statistics; obtain the dual-speaker judgment threshold corresponding to the current frame, the dual-speaker judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal; according to the double-end utterance of the current frame The relationship between the average value of the state statistics and the dual-talk judgment threshold is used to determine the current utterance state.
  • the calculating the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal includes: calculating the current state according to the following formula The average value of the double-ended utterance state statistics of the frame: where, is the average value of the double-ended utterance state statistics of the current frame, represents the power of the near-end speech estimation signal at the kth frame and the nth sample point, and represents the total The power of the input sound signal in the k-th frame and the n-th sample point represents the average value of the values in the brackets.
  • the obtaining the dual-talk judgment threshold corresponding to the current frame includes: real-time estimation of the input sound signal To obtain the average response ratio of the current frame in the input sound signal; obtain multiple preset thresholds for the response ratio, and construct multiple response ratio intervals according to the multiple thresholds; Determine the interval of the average response ratio of the current frame to which the average response ratio belongs, and obtain the dual-speaking judgment threshold corresponding to the interval of the said current frame as the dual-speaking judgment threshold of the current frame.
  • the real-time estimation of the signal return ratio of the input sound signal to obtain the average signal return ratio of the current frame in the input sound signal includes: acquiring a near-end interference signal, where the near-end interference signal is and The sound signal generated by the sound generating device at the same end of the sound collection device; calculate the average response ratio of the current frame in the input sound signal according to the following formula; wherein, represents the estimated average response ratio of the k-th frame, Its unit is dB, which represents the power of the input sound signal at the k-th frame and the n-th sample point, represents the power of the near-end interference signal at the k-th frame and the n-th sample point, and represents the value in brackets average value.
  • the acquiring multiple preset thresholds of the return ratio, and constructing multiple intervals of the return ratio according to the multiple thresholds includes: comparing the acquired multiple thresholds with the return ratio. The two adjacent ones are used as the boundary value of the RR interval to obtain multiple RL interval.
  • the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the preset mapping relationship between the utterance state and the processing mode includes: when the utterance state is only the far-end utterance, Performing zeroing processing on the near-end speech estimation signal or suppressing it to be inaudible; when the utterance state is judged to be not only the far-end utterance, the near-end speech estimation signal is retained.
  • the not only far-end utterance includes two states: near-end utterance only and double-ended utterance.
  • the performing adaptive filtering on the input sound signal to obtain the near-end speech estimation signal includes: performing linear filtering and non-linear filtering on the input sound signal, respectively, to obtain the near-end speech estimation signal.
  • An embodiment of the present application also provides an echo cancellation device based on double-ended vocalization detection.
  • the device includes: an input sound signal acquisition module for acquiring an input sound signal from a sound collection device; a filtering module for evaluating the input sound The signal is adaptively filtered to obtain the near-end speech estimation signal; the current utterance state determination module is used to determine the current utterance state according to the near-end speech estimation signal; the processing method acquisition module is used to obtain the preset utterance state and The mapping relationship between the processing modes, the processing mode corresponding to the current utterance state is obtained according to the mapping relationship; a near-end processing module, configured to process the near-end speech estimation signal according to the processing mode; an output module , Used to output the processed near-end speech estimation signal to obtain an output signal.
  • the embodiment of the present application also provides an echo cancellation system based on double-ended voice detection, including a sound collection device, a same-end voice device, and an echo cancellation device, and the echo cancellation device executes the steps of any one of the above-mentioned methods.
  • An embodiment of the present application provides an echo cancellation method based on double-ended utterance detection.
  • the method includes: acquiring an input sound signal from a sound collection device; adaptively filtering the input sound signal to obtain a near-end speech estimation signal; Determine the current utterance state according to the near-end speech estimation signal; obtain the mapping relationship between the preset utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship;
  • the near-end speech estimation signal is processed in a manner; the processed near-end speech estimation signal is output to obtain an output signal.
  • the input sound signal in a voice call such as a telephone is different from the direct transmission to the peer device in the existing communication scheme or only the adaptive echo cancellation is transmitted to the peer device.
  • the technical scheme in this method Customize different processing methods according to different sounding states corresponding to the input sound signal, and accurately filter out the echo in the input sound signal by combining the characteristics of double-ended sounding. Especially in the call system that is greatly affected by the interference of double-end vocalization, such as hands-free call and voice conference, the call quality can be significantly improved.
  • the real-time sounding state judgment is performed on each frame of the input sound signal to realize the real-time update of the processing method of the near-end voice estimation signal, so that the input sound signal can be accurately and completely echo canceled, and the call can be guaranteed. Stability of the process.
  • the signal-to-return ratio of the input sound signal with the near-end interference signal as the echo source is calculated in real time by sampling, and different dual-talk judgment thresholds are set when the influence of the near-end interference signal on the input sound signal is different. It can more accurately determine the current sounding state and improve the accuracy of echo cancellation for the input sound signal.
  • two utterance states are defined, and processing methods corresponding to the two utterance states are specified, which can basically meet the requirements of real-time echo cancellation in common voice calls.
  • the adaptive filtering of the input sound signal includes two operations of linear filtering and non-linear filtering, which can further suppress the echo of the input sound signal.
  • the echo cancellation system based on double-ended vocalization detection provided by the embodiments of the present application can perform real-time detection based on the acoustic echo generated in the communication process, and eliminate it based on the detection result, so that the echo cancellation system can be improved when the voice communication terminal is in the hands-free mode. Eliminate the effect to improve the quality of the call.
  • the echo cancellation method, device and system based on double-ended utterance detection provided in the embodiments of the present application can distinguish between only far-end utterance and only near-end utterance or double-ended utterance in real time.
  • the time-domain output result is zeroed or suppressed to inaudible, so that the echo can be eliminated to the greatest extent while ensuring the duplex call performance, so as to improve the echo cancellation and duplex performance at the same time.
  • the purpose is to improve the duplex call experience of the hands-free voice communication terminal.
  • FIG. 1 is a schematic flowchart of an echo cancellation method based on double-ended vocalization detection according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of the application of an echo cancellation method based on double-ended vocalization detection according to an embodiment of the present application
  • FIG. 3 is a schematic flowchart of step S103 in FIG. 1 in an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of step S302 in FIG. 3 in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a response ratio interval according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an echo cancellation device based on double-ended vocalization detection according to an embodiment of the present application
  • FIG. 7 is a schematic structural diagram of an echo cancellation system based on double-ended vocalization detection according to an embodiment of the present application.
  • the echo cancellation technology in the prior art cannot accurately filter out the echo interference in double-ended voice problems, especially in hands-free calls and conference calls, and the call quality is easily affected.
  • an embodiment of the present application provides an echo cancellation method based on double-ended vocalization detection.
  • the method includes: acquiring an input sound signal from a sound collection device; performing adaptive filtering on the input sound signal to obtain a close End speech estimation signal; determine the current utterance state according to the near-end speech estimation signal; obtain the mapping relationship between the preset utterance state and processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship ; Process the near-end speech estimation signal according to the processing mode; output the processed near-end speech estimation signal to obtain an output signal.
  • Adopting the solution described in this embodiment can filter out the interference signal in the double-ended voice, and significantly improve the quality of the call.
  • Figure 1 provides a schematic flow chart of an echo cancellation method based on double-ended vocalization detection; the method may specifically include the following steps:
  • S101 Obtain an input sound signal from a sound collection device.
  • the input sound signal is the sound signal collected by the sound collection device.
  • the sound collection device may be a microphone or other device, and for a telephone or phone-like call, it is a sound collection device that comes with a terminal such as a mobile phone, a landline or a computer.
  • the terminal such as the telephone collects the sound of the local end through the sound collection device in real time, and transmits it to the opposite end of the call through the communication line.
  • the sound collection device at the local end collects the input sound signal, it is not directly transmitted to the call. Instead, through the following steps S102 to S106, the input sound signal is echo canceled to improve the quality of the voice call.
  • S102 Perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal.
  • the adaptive filtering method After acquiring the input sound signal from the sound collection device, the acquired input sound signal is filtered to filter out the echo signal generated at the local end that interferes with the normal call, and to obtain the near-end voice estimation signal after the echo signal is filtered out.
  • the adaptive filtering method can use an adaptive echo canceller (ie, AEC) to filter the input sound signal to filter out the near-end speech estimation signal.
  • AEC adaptive echo canceller
  • S103 Determine the current utterance state according to the near-end speech estimation signal.
  • the utterance state can include different states such as far-end utterance only, double-end utterance, and near-end utterance only.
  • the utterance state corresponds to different processing methods for the obtained near-end speech estimation signal, which can be set according to needs.
  • the vocal state of is not limited to the examples mentioned above.
  • the current utterance state is to determine the real-time utterance state of the near-end speech estimation signal obtained this time to determine its real-time corresponding utterance state.
  • the corresponding utterance state can be determined according to the waveform, channel and other attributes of the speech signal.
  • S104 Obtain a preset mapping relationship between the utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship.
  • the processing method is a corresponding processing method for the near-end speech estimation signal of each utterance state, and may include processing methods such as setting the near-end speech estimation signal to zero (0), fully retaining or retaining part, and so on.
  • the mapping relationship between the utterance state and the processing mode can be set in advance. After the current utterance state is determined, the corresponding processing mode can be automatically obtained according to the mapping relationship.
  • S105 Process the near-end speech estimation signal according to the processing manner.
  • the near-end speech estimation signal is processed according to this processing mode.
  • S106 Output the processed near-end speech estimation signal to obtain an output signal.
  • the processed near-end voice estimation signal can correctly reflect the call information of the local end, and this output signal can be transmitted to the call peer through the communication link.
  • the input sound signal in a voice call such as a telephone
  • it is different from the direct transmission to the opposite device in the existing communication scheme or the transmission to the opposite device with only adaptive echo cancellation.
  • Different processing methods can be customized according to the different sounding states of the input sound signal, and the interference or echo in the input sound signal can be accurately filtered by combining the characteristics of double-ended sounding.
  • the call quality can be significantly improved.
  • Figure 2 provides a schematic diagram of the application of an echo cancellation method based on double-ended utterance detection; in the application scenario shown in Figure 2, the call object includes a far-end device 200 and a near-end device 210, where the far The end device 200 includes a far-end microphone 201 and a far-end speaker 202, and the near-end device 210 includes a near-end speaker 203 and a near-end microphone 204.
  • the far-end microphone 201 sends the downlink signal S1 to the near-end speaker 203
  • the direct echo S2 is the sound signal that is emitted by the near-end speaker 203 and is directly picked up by the near-end microphone 204
  • the indirect echo S3 is the sound signal from the near-end speaker.
  • 203 emits a sound signal that is reflected by the environment and indirectly picked up by the near-end microphone 204. While picking up the echoes (direct echo S2 and indirect echo S3), a person (not shown) sends a voice to the near-end microphone 204 (marked "voice" in the figure), and the near-end microphone 204 picks up the voice and generates an uplink signal S4 is sent to the remote speaker 202 to be played out.
  • the echo cancellation method based on double-ended voice detection in FIG. 1 can be applied to the near-end microphone 204 side in FIG. 2 where the near-end microphone 204 obtains the input sound signal to be sent to the far-end device 200 (that is, according to the voice in FIG. 2 Before the obtained sound signal), the input sound signal is processed by the echo cancellation method in FIG. 1 first.
  • Step S103 in FIG. 1 determines the current utterance state according to the near-end speech estimation signal, which may specifically include steps S301 to S303 in FIG. 3.
  • S301 Calculate the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal.
  • the double-ended utterance state statistics of the current frame are based on the current frame in the input sound signal as the reference point, and the input sound signal and the near-end speech estimation signal before the reference point are respectively sampled, and the input sound signal and the near-end speech estimation signal are respectively sampled. Signals are compared, calculated, and used to reflect the current sounding state of the input sound signal.
  • the average value is the average value of the double-ended voice state statistics at several sampling points.
  • the average value of the double-ended utterance state statistics of the current frame may be obtained by inputting the input sound signal and the near-end speech estimation signal into the double-ended utterance detector.
  • S302 Acquire a dual-talk judgment threshold corresponding to the current frame, where the dual-talk judgment threshold is obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal.
  • the signal-to-return ratio of the input sound signal is the energy ratio of the signal and the echo in the input sound signal, and the signal-to-return ratio of the input sound signal can be calculated to obtain the signal-to-return ratio.
  • the near-end interference signal is the interference signal generated by the sound generated by the same-end sounding device corresponding to the sound collecting device on the reception of the microphone, and can be obtained from the sounding device corresponding to the sound collecting device.
  • the sound-producing device can be a device such as a speaker corresponding to the local microphone in telephone communication.
  • the dual-talk judgment threshold is the threshold value used to determine the utterance state corresponding to the average of the double-ended utterance state statistics of the current frame. Set multiple thresholds of the utterance state for the average value of the double-ended utterance state statistics, that is, dual-talk judgment Threshold.
  • the dual-talk judgment threshold is set based on two factors: the signal-to-return ratio of the input sound signal and the near-end interference signal.
  • S303 Determine the current utterance state according to the magnitude relationship between the average value of the double-ended utterance state statistics of the current frame and the double-talk determination threshold.
  • step S302 According to the relationship between the average value of the double-ended utterance state statistics of the current frame in the input sound signal and the double-talk judgment threshold obtained in step S302, it is determined which utterance state the average value of the double-ended utterance state statistics of the current frame is in. Within the threshold interval to determine the current vocalization state.
  • the real-time sounding state judgment is performed on each frame of the input sound signal to realize real-time update of the processing method of the near-end speech estimation signal, so that the input sound signal can be accurately and completely echo canceled. Ensure the stability of the call process.
  • step S301 the calculation method of the average value of the double-ended utterance state statistics of the current frame in the input sound signal can be calculated according to the following formula:
  • Step S302 in FIG. 3 obtains the dual-talk judgment threshold corresponding to the current frame.
  • the dual-talk judgment threshold is based on the signal-back ratio and near-end interference of the input sound signal.
  • Obtaining the signal may include steps S401 to S403 in Fig. 4, where:
  • S401 Estimate the signal-to-return ratio of the input sound signal in real time to obtain an average signal-to-return ratio of the current frame in the input sound signal.
  • S402 Acquire multiple preset thresholds of the response ratio, and construct multiple intervals of the ratio of the response ratio according to the multiple thresholds.
  • the preset multiple thresholds are values obtained through experience or extreme technical personnel.
  • the boundary values of multiple thresholds can be generated based on multiple thresholds to define multiple thresholds.
  • the corresponding dual-speaking judgment threshold is set for each response ratio interval.
  • S403 Determine a signal-to-return ratio interval to which the average return ratio of the current frame belongs, and obtain a dual-speaking judgment threshold corresponding to the signal-to-return ratio interval as the dual-speaking judgment threshold of the current frame.
  • the signal-to-return ratio of the input sound signal with the near-end interference signal as the echo source is calculated in real time by sampling.
  • different dual-talk judgments are set Threshold, more accurately determine the current sounding state, and improve the accuracy of echo cancellation for the input sound signal.
  • the calculation method of the average response ratio of the current frame in step S401 in FIG. 4 is as follows:
  • the near-end interference signal is a sound signal generated by a sound-producing device at the same end as the sound collection device.
  • P m (k, n) represents the power of the input sound signal at the k-th frame and the n-th sample point
  • P x (k, n ) Represents the power of the near-end interference signal at the k-th frame and the n-th sample point
  • mean() represents the average value of the values in the brackets.
  • P m (k, n) and P x (k, n) are the power values of the sampling points obtained by sampling the input sound signal and the near-end interference signal respectively in frames.
  • the sampling process is: acquiring n sample points in the input sound signal and the near-end interference signal respectively, and the signal frame corresponding to each sample point is the k-th frame. Among them, n and k are variable count values.
  • Step S402 in FIG. 4 obtains multiple preset thresholds of the response ratio, and constructs multiple intervals of the response ratio according to the multiple thresholds, which may include: Two adjacent ones of the acquired multiple thresholds of the response ratio are used as the boundary value of the response ratio interval to obtain a plurality of ratio intervals.
  • the threshold value is used as the boundary value of a RL interval to obtain multiple RL interval.
  • the preset multiple thresholds of the return ratio are SER_thr_1, SER_thr_2, SER_thr_3,..., SER_thr_k, and the return ratio interval is formed by the thresholds.
  • the information response ratio interval can be expressed as: the information response ratio interval 501, the information response ratio interval 502,..., the information response ratio interval 50k, where k is a variable value, which represents the kth information response ratio interval 50k, according to K+1 thresholds of the response ratio can construct k response ratio intervals of 50k.
  • the corresponding dual-speaking judgment threshold for each response ratio interval, that is, the dual-speaking judgment threshold m1, the dual-speaking judgment threshold m2, ..., the dual-speaking judgment threshold mk in Fig. 5.
  • the corresponding dual-talk judgment threshold is obtained, that is, step S403.
  • the RR interval is automatically constructed based on the preset RR threshold value as the interval boundary value.
  • the utterance state includes two states: only the far-end utterance and not only the far-end utterance, and the preset mapping relationship between the utterance state and the processing mode includes: when the utterance state is only the far-end utterance When the near-end speech estimation signal is zeroed or suppressed to be inaudible; when the utterance state is not only the far-end utterance, the near-end speech estimation signal is retained.
  • two voice states can be set, namely, only the far-end voice and not only the far-end voice.
  • the near-end speech estimation signal needs to be zeroed or suppressed to be inaudible, that is, the near-end speech estimation signal is filtered out, and the mute signal is used as The transmission signal of the local end is transmitted to the opposite end device of the call.
  • the near-end voice estimation signal When it is determined based on the near-end voice estimation signal that the current utterance state is not only the far-end utterance, the near-end voice estimation signal needs to be retained, and the near-end voice estimation signal is transmitted to the peer device of the call as a transmission signal of the local end.
  • two utterance states are defined, and processing methods corresponding to the two utterance states are specified, which can basically meet the requirements of real-time echo cancellation in common voice calls.
  • the not only far-end utterance includes two states: near-end utterance only and double-ended utterance.
  • near-end sound-only means that the sound collection device only collects the transmission from the local end. Signals, but no near-end interference signal is collected; the double-ended sounding state means that the sound collection device collects both the local transmission signal and the near-end interference signal.
  • the processing method can be further specified for these two states. For example, for only the near-end voice, no processing is done and the voice signal is directly transmitted to the opposite end, and so on.
  • Step S102 in FIG. 1 performs adaptive filtering on the input sound signal to obtain a near-end speech estimation signal, which may specifically include two filtering operations, namely linear filtering and non-linear filtering. .
  • the input sound signal is processed by linear filtering in filters such as AEC to eliminate part of the echo.
  • the input sound signal still contains linear residual echo and nonlinear echo.
  • near-end utterance it also contains near-end speech.
  • Continuous non-linear processing and filtering of the sound signal containing residual echo can be used to achieve further echo suppression.
  • the adaptive filtering of the input sound signal includes two operations of linear filtering and non-linear filtering, which can further suppress the echo of the input sound signal.
  • the embodiment of the application also provides an echo cancellation device based on double-ended vocalization detection. Please refer to FIG. 6.
  • the device may include an input sound signal acquisition module 601, a filtering module 602, a sound state determination module 603, and a processing mode acquisition module 604. , Near-end processing module 605 and output module 606, where:
  • the input sound signal acquisition module 601 is used to acquire the input sound signal from the sound collection device.
  • the filtering module 602 is configured to perform adaptive filtering on the input sound signal to obtain a near-end speech estimation signal.
  • the utterance state determination module 603 is configured to determine the current utterance state according to the near-end speech estimation signal.
  • the processing mode obtaining module 604 is configured to obtain a preset mapping relationship between the utterance state and the processing mode, and obtain the processing mode corresponding to the current utterance state according to the mapping relationship.
  • the near-end processing module 605 is configured to process the near-end speech estimation signal according to the processing mode.
  • the output module 606 is configured to output the processed near-end speech estimation signal to obtain an output signal.
  • the utterance state determination module 603 may include:
  • a real-time utterance state acquisition unit configured to calculate the average value of the double-ended utterance state statistics of the current frame in the input sound signal according to the input sound signal and the near-end speech estimation signal;
  • a threshold obtaining unit configured to obtain a dual-talk judgment threshold corresponding to the current frame, the dual-talk judgment threshold being obtained according to the signal-to-return ratio of the input sound signal and the near-end interference signal;
  • the utterance state determination unit is configured to determine the current utterance state according to the magnitude relationship between the average value of the double-ended utterance state statistics of the current frame and the dual-talk determination threshold.
  • the threshold value acquisition unit includes:
  • the current response ratio obtaining subunit is used to estimate the signal response ratio of the input sound signal in real time to obtain the average signal response ratio of the current frame in the input sound signal;
  • a signal response ratio interval construction subunit for obtaining a plurality of preset signal response ratio thresholds, and constructing a plurality of signal response ratio intervals according to the plurality of signal response ratio thresholds;
  • the threshold judging subunit is used to judge the interval of the average response ratio of the current frame to which the average response ratio belongs, and obtain the dual-speaking judgment threshold corresponding to the said interval as the dual-speaking judgment threshold of the current frame.
  • the above-mentioned signal response ratio interval construction subunit is further used to use two adjacent ones of the obtained multiple signal response ratio threshold values as the boundary value of the signal response ratio interval to obtain multiple signal response ratios. Back to the interval.
  • the filtering module 602 in FIG. 6 is further configured to perform linear filtering and non-linear filtering on the input sound signal to obtain the near-end speech estimation signal.
  • the embodiment of the present application also provides an echo cancellation system based on double-ended vocalization detection, including a sound collection device, a same-end sounding device, and an echo cancellation device.
  • the echo cancellation device performs the double-ended-based echo cancellation system provided in FIGS. 1 to 5. The steps of the echo cancellation method for vocal detection.
  • Figure 7 is a schematic diagram of an echo cancellation system based on double-ended voice detection; the system includes a sound collection device 701, an echo cancellation device 702, and a same-end voice device 703.
  • the sound collection device 701 may be a microphone in telephone communication for collecting the input sound signal A1.
  • the same-end sounding device 703 may be a speaker connected to the same end as a microphone in telephone communication to generate a sound signal, but it may interfere with the input sound signal A1, so it is used as the interference sound signal A6.
  • the echo cancellation device 702 is a device for implementing the echo cancellation method based on double-ended vocalization detection in FIGS. 1 to 5 in this application.
  • the function of the echo cancellation device can be realized by means of entity or logic circuit, software programming, etc.
  • the echo cancellation device 702 may include a linear AEC filter 7021, an NLP filter 7022, a double-ended utterance detector 7023, a signal-to-return ratio estimator 7024, a threshold determiner 7025 and a processor 7026.
  • the echo cancellation device 702 processes the sound signals received from the sound collection device 701 and the same-end sound device 703 in the communication process as follows:
  • the input sound signal A1 is linearly filtered through the linear AEC filter 7021 to obtain the linearly filtered sound signal A2, and then the NLP filter is applied to A2.
  • Non-linear filtering obtains the near-end speech estimation signal A3, which is used as an input signal of the double-ended utterance detector 7023.
  • the input sound signal A1 is directly used as another input signal of the double-ended sounding detector.
  • the linear AEC filter 7021 uses the interference sound signal A6 as a filtering reference factor to linearly filter the input sound signal A1.
  • the input sound signal A1 is input to the echo ratio estimator 7024, the average echo ratio A4 of the current frame of the input sound signal is calculated in real time, and the average echo ratio A4 is transmitted to the threshold determiner 7025, which is based on the preset Multiple signal response ratio intervals constructed by multiple signal response ratio thresholds to determine the dual-talk judgment threshold A5 corresponding to the average return ratio of the current frame A4, and send the double-talk judgment threshold A5 to the double-ended utterance detector 7023 As the basis for judging the current utterance state.
  • the signal-to-return ratio estimator 7024 samples the input sound signal A1 and the interference sound signal A6, and calculates the average signal-to-return ratio A4 of the current frame according to the following formula:
  • the double-ended utterance detector 7023 acquires the first input signal (ie the near-end voice estimation signal A3), the second input signal (ie the input sound signal A1), and the dual-talk judgment threshold A5, and determines the current utterance in real time based on this information State A7.
  • the current utterance state is obtained based on the average of the double-ended utterance state statistics of the current frame.
  • the double-ended utterance detector 7023 samples the near-end speech estimation signal A3 and the input sound signal A1, and calculates the average value of the double-ended utterance state statistics of the current frame according to the following formula:
  • the double-ended utterance detector 7023 sends the obtained current utterance state A7 to the processor 7026, and the processor 7026 processes the near-end voice estimation signal A3 according to the current utterance state A7.
  • the processing method is: when the utterance state is only the far-end utterance, the near-end speech estimation signal A3 is zeroed or suppressed to inaudible; when the utterance state is not only the far-end utterance, the near-end speech estimation signal A3 is retained.
  • the processed near-end speech estimation signal is output to obtain an output signal A8, and the output signal A8 can be transmitted to the device of the communication opposite end via the communication link.
  • the above-mentioned echo cancellation system based on double-ended vocalization detection performs real-time detection based on the acoustic echo generated in the communication process, and eliminates it according to the detection result, so that the echo cancellation effect can be improved when the voice communication terminal is in the hands-free mode to improve the call quality.
  • the echo cancellation method, device and system based on double-ended utterance detection provided in the embodiments of the present application can distinguish between only far-end utterance and only near-end utterance or double-ended utterance in real time.
  • the time-domain output result is zeroed or suppressed to inaudible, so that the echo can be eliminated to the greatest extent while ensuring the duplex call performance, so as to improve the echo cancellation and duplex performance at the same time.
  • the purpose is to improve the duplex call experience of the hands-free voice communication terminal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

L'invention concerne un procédé, un dispositif et un système d'annulation d'écho mettant en œuvre une détection de parole doublée. Le procédé comprend les étapes suivantes : acquérir un signal sonore d'entrée à partir d'un dispositif de collecte de sons ; effectuer un filtrage adaptatif sur le signal sonore d'entrée pour obtenir un signal d'estimation de la parole locale ; déterminer l'état actuel de production de sons selon le signal d'estimation de la parole locale ; acquérir des mappages prédéfinis entre des états de production de sons et des procédures de traitement, et acquérir, selon les mappages, une procédure de traitement correspondant à l'état actuel de production de sons ; traiter le signal d'estimation de la parole locale selon la procédure de traitement ; et fournir le signal d'estimation de la parole locale traité pour obtenir un signal de sortie. Le procédé améliore l'annulation d'écho et améliore l'expérience de conversation bidirectionnelle pour des terminaux de communication vocale mains libres.
PCT/CN2020/114168 2019-12-13 2020-09-09 Procédé, appareil et système d'annulation d'écho utilisant une détection de parole doublée Ceased WO2021114779A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911284296.3 2019-12-13
CN201911284296.3A CN110995951B (zh) 2019-12-13 2019-12-13 基于双端发声检测的回声消除方法、装置及系统

Publications (1)

Publication Number Publication Date
WO2021114779A1 true WO2021114779A1 (fr) 2021-06-17

Family

ID=70093348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/114168 Ceased WO2021114779A1 (fr) 2019-12-13 2020-09-09 Procédé, appareil et système d'annulation d'écho utilisant une détection de parole doublée

Country Status (2)

Country Link
CN (1) CN110995951B (fr)
WO (1) WO2021114779A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115512691A (zh) * 2022-10-11 2022-12-23 四川虹微技术有限公司 一种人机连续对话中基于语义层面判断回声的方法
CN115706875A (zh) * 2021-08-07 2023-02-17 深圳方位通讯科技有限公司 对讲语音质量优化方法、装置、设备及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110995951B (zh) * 2019-12-13 2021-09-03 展讯通信(上海)有限公司 基于双端发声检测的回声消除方法、装置及系统
CN111556210B (zh) * 2020-04-23 2021-10-22 深圳市未艾智能有限公司 通话语音处理方法与装置、终端设备和存储介质
CN113225442B (zh) * 2021-04-16 2022-09-02 杭州网易智企科技有限公司 一种消除回声的方法及装置
CN113241085B (zh) * 2021-04-29 2022-07-22 北京梧桐车联科技有限责任公司 回声消除方法、装置、设备及可读存储介质
CN113808609B (zh) * 2021-09-18 2024-07-19 展讯通信(上海)有限公司 回声检测方法及装置、计算机可读存储介质、终端设备
CN113936682A (zh) * 2021-11-30 2022-01-14 杭州网易智企科技有限公司 回声消除方法及装置、存储介质和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179294A (zh) * 2006-11-09 2008-05-14 爱普拉斯通信技术(北京)有限公司 自适应回声消除器及其回声消除方法
CN102655558A (zh) * 2012-05-21 2012-09-05 宁波工程学院 一种双端发音鲁棒结构及其消除声学回声的方法
US20120281603A1 (en) * 2011-05-06 2012-11-08 Futurewei Technologies, Inc. Transmit Phase Control for the Echo Cancel Based Full Duplex Transmission System
CN107635082A (zh) * 2016-07-18 2018-01-26 深圳市有信网络技术有限公司 一种双端发声端检测系统
CN110995951A (zh) * 2019-12-13 2020-04-10 展讯通信(上海)有限公司 基于双端发声检测的回声消除方法、装置及系统

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3922997B2 (ja) * 2002-10-30 2007-05-30 沖電気工業株式会社 エコーキャンセラ
CN1925346A (zh) * 2006-09-05 2007-03-07 华为技术有限公司 一种回波抵消中双讲状态的检测方法
JP4411309B2 (ja) * 2006-09-21 2010-02-10 Okiセミコンダクタ株式会社 ダブルトーク検出方法
US8345860B1 (en) * 2008-05-09 2013-01-01 Hellosoft India PVT. Ltd Method and system for detection of onset of near-end signal in an echo cancellation system
WO2010083641A1 (fr) * 2009-01-20 2010-07-29 华为技术有限公司 Procédé et appareil de détection de double parole
CN101719969B (zh) * 2009-11-26 2013-10-02 美商威睿电通公司 判断双端对话的方法、系统以及消除回声的方法和系统
CN102377453B (zh) * 2010-08-06 2014-02-26 联芯科技有限公司 一种控制自适应滤波器更新的方法、装置及回声消除器
CN101917527B (zh) * 2010-09-02 2013-07-03 杭州华三通信技术有限公司 回声消除的方法和装置
CN102065190B (zh) * 2010-12-31 2013-08-28 杭州华三通信技术有限公司 一种回声消除方法及其装置
CN103179296B (zh) * 2011-12-26 2017-02-15 中兴通讯股份有限公司 一种回波抵消器及回波抵消方法
US9100466B2 (en) * 2013-05-13 2015-08-04 Intel IP Corporation Method for processing an audio signal and audio receiving circuit
CN104519212B (zh) * 2013-09-27 2017-06-20 华为技术有限公司 一种消除回声的方法及装置
US20160171988A1 (en) * 2014-12-15 2016-06-16 Wire Swiss Gmbh Delay estimation for echo cancellation using ultrasonic markers
CN106533500B (zh) * 2016-11-25 2019-11-12 上海伟世通汽车电子系统有限公司 一种优化回声消除器收敛特性的方法
CN108134863B (zh) * 2017-12-26 2020-06-19 中山大学花都产业科技研究院 一种基于双统计量的改进型双端检测装置及检测方法
CN108540680B (zh) * 2018-02-02 2021-03-02 广州视源电子科技股份有限公司 讲话状态的切换方法及装置、通话系统
CN108696648B (zh) * 2018-05-16 2021-08-24 上海小度技术有限公司 一种短时语音信号处理的方法、装置、设备及存储介质
CN109547655A (zh) * 2018-12-30 2019-03-29 广东大仓机器人科技有限公司 一种网络语音通话的回声消除处理的方法
CN110138990A (zh) * 2019-05-14 2019-08-16 浙江工业大学 一种消除移动设备VoIP电话回声的方法
CN110335618B (zh) * 2019-06-06 2021-07-30 福建星网智慧软件有限公司 一种改善非线性回声抑制的方法及计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179294A (zh) * 2006-11-09 2008-05-14 爱普拉斯通信技术(北京)有限公司 自适应回声消除器及其回声消除方法
US20120281603A1 (en) * 2011-05-06 2012-11-08 Futurewei Technologies, Inc. Transmit Phase Control for the Echo Cancel Based Full Duplex Transmission System
CN102655558A (zh) * 2012-05-21 2012-09-05 宁波工程学院 一种双端发音鲁棒结构及其消除声学回声的方法
CN107635082A (zh) * 2016-07-18 2018-01-26 深圳市有信网络技术有限公司 一种双端发声端检测系统
CN110995951A (zh) * 2019-12-13 2020-04-10 展讯通信(上海)有限公司 基于双端发声检测的回声消除方法、装置及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115706875A (zh) * 2021-08-07 2023-02-17 深圳方位通讯科技有限公司 对讲语音质量优化方法、装置、设备及存储介质
CN115512691A (zh) * 2022-10-11 2022-12-23 四川虹微技术有限公司 一种人机连续对话中基于语义层面判断回声的方法

Also Published As

Publication number Publication date
CN110995951B (zh) 2021-09-03
CN110995951A (zh) 2020-04-10

Similar Documents

Publication Publication Date Title
WO2021114779A1 (fr) Procédé, appareil et système d'annulation d'écho utilisant une détection de parole doublée
KR101089481B1 (ko) 스펙트럼 음향 특성에 기초한 더블 토크 검출 방법
KR101444100B1 (ko) 혼합 사운드로부터 잡음을 제거하는 방법 및 장치
CN105825864B (zh) 基于过零率指标的双端说话检测与回声消除方法
CN103428385B (zh) 用于处理音频信号的方法及用于处理音频信号的电路布置
US6792107B2 (en) Double-talk detector suitable for a telephone-enabled PC
US9094744B1 (en) Close talk detector for noise cancellation
CN102160296B (zh) 双端通话检测方法及装置
CN104158990B (zh) 用于处理音频信号的方法和音频接收电路
CN102025852B (zh) 在近端对回传音频的检测和抑制
US9443528B2 (en) Method and device for eliminating echoes
JP4568439B2 (ja) エコー抑圧装置
CN108134863B (zh) 一种基于双统计量的改进型双端检测装置及检测方法
WO2015043150A1 (fr) Procédé et appareil d'annulation d'écho
WO2019140755A1 (fr) Procédé et système de suppression d'écho basés sur un réseau de microphones
JPH09172396A (ja) 音響結合の影響を除去するためのシステムおよび方法
CN106571147B (zh) 用于网络话机声学回声抑制的方法
TWI506620B (zh) 通訊裝置及其語音處理方法
CN111742541B (zh) 声学回波抵消方法、装置、存储介质
US8041028B2 (en) Double-talk detection
WO2021077599A1 (fr) Procédé et appareil de détection de diaphonie, dispositif informatique et support d'enregistrement
CN111556210B (zh) 通话语音处理方法与装置、终端设备和存储介质
CN111970610A (zh) 回声路径检测方法、音频信号处理方法及系统、存储介质、终端
JP3047300B2 (ja) ハンズフリー会話機能を有する通話装置
CN100508031C (zh) 一种在scdma手机中识别并消除远端语音产生的回声的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20899723

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20899723

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20899723

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20899723

Country of ref document: EP

Kind code of ref document: A1