[go: up one dir, main page]

WO2020252782A1 - 语音检测方法、语音检测装置、语音处理芯片以及电子设备 - Google Patents

语音检测方法、语音检测装置、语音处理芯片以及电子设备 Download PDF

Info

Publication number
WO2020252782A1
WO2020252782A1 PCT/CN2019/092361 CN2019092361W WO2020252782A1 WO 2020252782 A1 WO2020252782 A1 WO 2020252782A1 CN 2019092361 W CN2019092361 W CN 2019092361W WO 2020252782 A1 WO2020252782 A1 WO 2020252782A1
Authority
WO
WIPO (PCT)
Prior art keywords
time domain
domain signal
signal
amplitude
current time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2019/092361
Other languages
English (en)
French (fr)
Inventor
蒋斌
毛健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Goodix Technology Co Ltd
Original Assignee
Shenzhen Goodix Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Goodix Technology Co Ltd filed Critical Shenzhen Goodix Technology Co Ltd
Priority to PCT/CN2019/092361 priority Critical patent/WO2020252782A1/zh
Priority to CN201980001072.9A priority patent/CN110431625B/zh
Priority to EP19933225.5A priority patent/EP3800640B1/en
Priority to US17/034,096 priority patent/US11322174B2/en
Publication of WO2020252782A1 publication Critical patent/WO2020252782A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • G10L2025/937Signal energy in various frequency bands
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the embodiments of the present application relate to the field of signal processing technology, and in particular, to a voice detection method, a voice detection device, a voice processing chip, and electronic equipment.
  • Voice wake-up has a wide range of applications, such as robots, mobile phones, wearable devices, smart homes, and in-vehicle devices. Almost many devices with voice functions require voice wake-up technology as a start or entrance for human-machine interaction, allowing devices in a dormant state to directly enter the waiting state for instructions, and start the first step of voice interaction. Different products will have different wake-up words. When users need to wake up the device, they need to speak a specific wake-up word.
  • the realization of the above-mentioned voice wake-up mainly relies on the voice activity detection algorithm.
  • the voice activity detection algorithm is all processed in the frequency domain, which results in high algorithm complexity and high power consumption.
  • one of the technical problems solved by the embodiments of the present application is to provide a voice detection method, a voice detection device, a voice processing chip, and electronic equipment to overcome the above-mentioned defects in the prior art.
  • the embodiment of the application provides a voice detection method, which includes:
  • the embodiment of the present application provides a voice detection device, which includes: a subband generation module and a voice activity detection module.
  • the subband generation module is used to process a current time domain signal frame to obtain several subband time domain signals, so
  • the voice activity detection module is used to determine whether the current time domain signal frame is a valid voice signal according to the amplitudes of the several subband time domain signals of the current time domain signal frame.
  • the embodiment of the present application provides a voice processing chip, which includes: a voice detection device and a processor.
  • the voice detection device includes: a subband generation module and a voice activity detection module.
  • the subband generation module is used to compare the current time domain signal frame Processing to obtain several sub-band time-domain signals, and the voice activity detection module is configured to determine whether the current time-domain signal frame is valid according to the amplitude of the several sub-band time-domain signals of the current time-domain signal frame Voice signal; the processor is used to recognize the effective voice signal to perform voice control according to the recognition result.
  • An embodiment of the present application provides an electronic device, which includes the voice processing chip described in any embodiment of the present application.
  • the current time domain signal frame is processed to obtain several subband time domain signals; according to the amplitude of the several subband time domain signals of the current time domain signal frame, the current time domain signal frame is determined Whether the time-domain signal frame is a valid speech signal can be executed in the time domain, thereby reducing the complexity of the algorithm and reducing the power consumption.
  • FIG. 1 is a schematic structural diagram of a voice detection device in Embodiment 1 of this application;
  • FIG. 2 is a schematic diagram of the structure of the voice detection device in the second embodiment of the application.
  • FIG. 3 is a schematic structural diagram of a voice detection device in Embodiment 3 of this application.
  • FIG. 4 is a schematic flowchart of a voice detection method in Embodiment 4 of this application.
  • FIG. 5 is a schematic flowchart of a voice detection method in Embodiment 5 of this application.
  • FIG. 6 is a schematic flowchart of a voice detection method in Embodiment 6 of this application.
  • the current time domain signal frame is processed to obtain several subband time domain signals; the current time domain signal is determined according to the amplitude of the several subband time domain signals of the current time domain signal frame Whether the frame is a valid speech signal can be executed in the time domain, thereby reducing the complexity of the algorithm and reducing the power consumption. At the same time, it has a high voice detection accuracy rate.
  • Figure 1 is a schematic structural diagram of a voice detection device in Embodiment 1 of the application; as shown in Figure 1, it includes: a subband generation module, an energy calculation module, a noise calculation module, and a voice activity detection module (Voice Activity Detection for short VAD),
  • the sub-band generation module is used to process the current time-domain signal frame to obtain several sub-band time-domain signals
  • the energy calculation module is used to calculate the current time according to the amplitudes of the several sub-band time-domain signals of the current time-domain signal frame.
  • the signal amplitude of the sub-band time-domain signal in the signal frame, and the noise calculation module is configured to calculate the noise amplitude of the sub-band time-domain signal according to the amplitude of the several sub-band time-domain signals in the current time-domain signal frame
  • the voice activity detection module is configured to determine whether the current time domain signal frame is a valid voice signal according to the amplitude of the several subband time domain signals of the current time domain signal frame, specifically according to the subband
  • the noise amplitude of the time domain signal and the signal amplitude determine whether the current time domain signal frame is a valid speech signal.
  • the current time-domain signal frame comes from the voice acquisition module.
  • the voice acquisition module collects a segment of voice signal, which may actually include several time-domain signal frames. Therefore, when judging this segment Whether the voice signal comes from the user, that is, whether it is a valid voice signal, it is processed in frame units, that is, each time domain signal frame is grouped, energy calculation processing, noise calculation processing, and voice activity detection. Determine whether the corresponding time sequence signal frame is a valid voice signal.
  • the voice collection module may be a microphone.
  • the subband generation module is a filter bank, and the filter bank processes the current time domain signal frame according to the set frequency threshold to obtain several subband time domain signals.
  • the filter bank may include multiple filters, each filter has a set frequency threshold, and the multiple filters respectively perform filtering processing on the current time domain signal frame to obtain multiple subband time domain signals.
  • Each subband time domain signal corresponds to a subband identifier.
  • the number of sub-filters in the filter bank is set as required, that is, to split the current time domain signal frame into several sub-bands, several sub-filters are set.
  • the number of filters it is necessary to balance performance and complexity. For example, considering power consumption and other reasons, set 2 to 3 filters.
  • the number of sub-filters here is only an example, not a unique limitation.
  • the filter is, for example, a finite impulse response filter (Finite Impulse Response, FIR) or an infinite impulse response filter (Infinite Impulse Response filter, IIR) filter. If the characteristic angle is distinguished, it can be a bandpass filter.
  • the filter is specifically a cascaded biquad IIR bandpass filter.
  • the energy calculation module includes: an average amplitude calculation unit, configured to calculate the average amplitude of the sub-band time domain signal in the current time domain signal frame; and an energy calculation unit, configured according to the current time Calculate the signal amplitude of the sub-band time-domain signal in the current time-domain signal frame by calculating the average amplitude of the sub-band time-domain signal in the domain signal frame.
  • the energy calculation unit further uses the average amplitude of the subband time domain signal in the current time domain signal frame to characterize the signal amplitude of the subband time domain signal.
  • the current time domain signal frame refers to one frame of speech signal that participates in the detection of speech signal.
  • the filtering process mentioned above is for one frame
  • the speech signal is processed, so that several sub-band time domain signals are obtained by filtering a frame of speech signal.
  • the energy calculation module performs energy calculations, the calculation is performed in units of subband time domain signals, that is, the signal amplitude of each subband time domain signal is calculated. It should be noted here that the calculation here can be considered as estimate.
  • the estimated amplitude of each sub-band time-domain signal is used to express the corresponding signal amplitude.
  • the mean square of the amplitude of all sampling points in a sub-band time-domain signal can be calculated.
  • the root value, the average value of the absolute value, etc. represent the above-mentioned amplitude.
  • the energy calculation unit further calculates the current time domain signal frame according to the average amplitude and amplitude smoothing value of the sub-band time domain signal in the current time domain signal frame.
  • the signal amplitude of the sub-band time domain signal in the time domain signal frame is a signal amplitude of the sub-band time domain signal in the time domain signal frame.
  • the energy calculation module is further configured to determine the amplitude smoothing value according to the amplitude smoothing coefficient and the signal amplitude of the previous time domain signal frame.
  • the magnitude of the amplitude smoothing coefficient is flexibly set according to the application scenario, and the signal amplitude of the previous time domain signal frame is actually the signal amplitude obtained by performing the above-mentioned voice signal detection using the previous time domain signal frame as the current time sequence signal frame.
  • the noise calculation module is further configured to calculate the subband according to the current time domain signal frame.
  • the signal amplitude of the time domain signal calculates the noise amplitude of the subband time domain signal.
  • the relationship between the signal amplitude of the subband time domain signal of the current time domain signal frame and the signal amplitude of the subband time domain signal with the same subband identifier in the previous time domain signal frame and the current time domain signal frame may be used, Determine the noise amplitude in the current time domain signal frame.
  • the noise calculation module is further configured to The signal amplitude and the noise smoothing value of the Nth subband time domain signal in the current time domain signal frame are used to calculate the noise amplitude of the Nth subband time domain signal, where the Nth subband time domain signal is the subband time domain Any one of the signals, N>0 and an integer; specifically, in order to prevent sudden changes in the noise of two consecutive time-domain signal frames, the noise calculation module is further configured to calculate the noise according to the noise smoothing coefficient and the noise of the previous time-domain signal frame The amplitude and the signal amplitude respectively determine the noise smoothing value.
  • the noise calculation module is further configured to The signal amplitude of the Nth subband time domain signal in the current time domain signal frame is directly used as the noise amplitude of the Nth subband time domain signal, and the Nth subband time domain signal is any of the subband time domain signals One, N>0 and an integer.
  • FIG 2 is a schematic structural diagram of the voice detection device in the second embodiment of the application; as shown in Figure 2, the difference from the above embodiment is that in this embodiment, in addition to including a subband generation module, an energy calculation module, a noise calculation module,
  • the voice activity detection module also includes a voice collection module. That is, it can be understood that the voice collection is a component of the voice detection device, and in the first embodiment, the voice collection module is independent of the voice detection device and is not a component of the voice detection device.
  • the signal amplitudes of the multiple subband time domain signals included in the current time domain signal frame are calculated by the method of the above-mentioned embodiment 1, and the current time domain signal can be further calculated The total signal amplitude and total noise amplitude of the frame. Therefore, in order to reduce resource consumption and save power, the energy calculation module is further configured to calculate the current time based on the signal amplitude of the subband time domain signal in the current time domain signal frame.
  • the total signal amplitude of the signal frame in the current time domain and the noise calculation module is further configured to calculate the total noise amplitude of the current time domain signal frame according to the noise amplitude of the subband time domain signal in the current time domain signal frame, And the voice activity detection module is further configured to determine whether the current time domain signal frame is a valid voice signal according to the total noise amplitude and the total signal amplitude. It can be understood that, in this embodiment, it is judged whether the current time domain signal frame is a valid speech signal from the total noise amplitude and the total signal amplitude of the current time domain signal frame, thereby effectively reducing the technical complexity, and Reduce the consumption of resources, or also known as lower resource requirements.
  • the smallest noise energy level is called the lower limit of noise energy level
  • the largest noise energy level is called the upper limit of noise energy level. Therefore, when judging the current time When the domain signal frame is a valid speech signal, compare the total noise amplitude and the total signal amplitude with multiple noise energy levels respectively, if both the total noise amplitude and the total signal amplitude are less than the noise energy level
  • the voice activity detection module determines that the current time domain signal frame is an invalid voice signal; or, if the total noise amplitude is greater than or equal to the upper limit of the noise energy level, the voice activity detection module determines the current time domain signal according to the default configuration item Whether the frame is a valid voice signal.
  • the default configuration items here can be flexibly set according to the application scenario. If the configuration item is that the total noise amplitude is greater than or equal to the upper limit of the noise energy level, the current time domain signal frame can be considered to be a valid speech signal, and then when the total noise amplitude is greater than Or equal to the upper limit of the noise energy level, the voice activity detection module determines that the current time domain signal frame is a valid voice signal. If the configuration item is that the total noise is greater than or equal to the upper limit of noise energy level, the current time domain signal frame can be directly considered as an invalid speech signal, that is, when the total noise amplitude is greater than or equal to the upper limit of noise energy level, the voice activity detection module Determine that the current time domain signal frame is an invalid speech signal.
  • Figure 3 is a schematic structural diagram of the voice detection device in the third embodiment of the application; as shown in Figure 3, different from the above embodiment, in this embodiment, the subband generation module, the energy calculation module, the noise calculation module, and the voice activity
  • the detection module further includes a signal-to-noise ratio calculation module for calculating the signal-to-noise ratio of the sub-band time-domain signal according to the noise amplitude of the several sub-band time-domain signals of the current time-domain signal frame and the signal amplitude
  • the voice activity detection module is further configured to determine the current time domain signal according to the total noise amplitude of the current time domain signal frame and the SNR of the subband time domain signal of the current time domain signal frame Whether the frame is a valid voice signal.
  • multiple signal-to-noise ratio levels are set to determine whether the current time-domain signal frame is a valid voice based on the signal-to-noise ratio and the signal-to-noise ratio level of the subband time-domain signal of the current time-domain signal frame signal.
  • multiple signal-to-noise ratio levels may be set correspondingly according to multiple noise energy levels of the subband time-domain signal of the current time-domain signal frame.
  • the lower limit of the noise energy level corresponds to the upper limit of the signal-to-noise ratio level; and if the total noise amplitude of the current time domain signal frame is less than or equal to the lower limit of the noise energy level, then determine the current time domain signal Whether the signal to noise ratio of the subband time domain signal of the frame is greater than or equal to the upper limit of the signal to noise ratio level, if the signal to noise ratio of the subband time domain signal of the current time domain signal frame is greater than or equal to the signal to noise ratio
  • the voice activity detection module determines that the current time domain signal frame is a valid voice signal; otherwise, it determines that it is an invalid voice signal;
  • the upper limit of the noise energy level corresponds to the lower limit of the signal-to-noise ratio level, and if the total noise amplitude of the current time domain signal frame is greater than or equal to the upper limit of the noise energy level, then the current time domain signal is determined Whether the signal to noise ratio of the subband time domain signal of the frame is greater than or equal to the lower limit of the signal to noise ratio level, if the signal to noise ratio of the subband time domain signal of the current time domain signal frame is greater than or equal to the signal to noise ratio
  • the voice activity detection module determines that the current time domain signal frame is a valid voice signal; otherwise, it determines that it is an invalid voice signal;
  • an intermediate threshold of the signal-to-noise ratio level between the upper limit and the lower limit of the signal-to-noise ratio level is correspondingly set, if the total noise amplitude of the current time domain signal frame is greater than Or equal to the intermediate threshold of the noise energy level, it is determined whether the SNR of the sub-band time domain signal of the current time domain signal frame is greater than or equal to the intermediate threshold of the corresponding signal to noise level, if the current If the signal-to-noise ratio of the sub-band time-domain signal of the time-domain signal frame is greater than or equal to the intermediate threshold of the signal-to-noise ratio level, the voice activity detection module determines that the current time-domain signal frame is a valid voice signal; otherwise , It is judged that the voice signal is invalid.
  • the speech detection device may include an energy calculation module and a noise calculation module as an example for description, and it does not mean that the energy calculation module and the noise calculation module are indispensable modules for implementing this application.
  • Fig. 4 is a schematic flowchart of the voice detection method in the fourth embodiment of this application; as shown in Fig. 4, it includes:
  • the subband generation module processes the current time domain signal frame to obtain several subband time domain signals.
  • the filter bank is used as a subband generation module to implement filtering processing on the current time domain signal frame to obtain several subband time domain signals.
  • the current time-domain signal frame comes from the voice acquisition module.
  • the energy calculation module calculates the signal amplitude of the subband time domain signal in the current time domain signal frame according to the amplitude of the several subband time domain signals of the current time domain signal frame, and the noise calculation module calculates the subband time domain signal amplitude. Noise amplitude with time domain signal;
  • the current time domain is calculated according to the average amplitude of the subband time domain signal in the current time domain signal frame.
  • the signal amplitude of the sub-band time domain signal in the signal frame in specific implementation, if the current time domain signal frame is calculated according to the average amplitude and amplitude smoothing value of the sub-band time domain signal in the current time domain signal frame
  • the following formula (1) can be referred to.
  • the average amplitude calculation unit uses the following formula (1) to calculate the average amplitude of each subband time domain signal in the current time domain signal frame.
  • x m, i (n) represents the time domain signal m-th n-th frame time domain signal band
  • E m (n) is the m-th n-th frame time domain signal time-domain The average amplitude of the signal.
  • the nth frame of time domain signal is the current time domain signal frame
  • i is the sampling point
  • N is the number of sampling points.
  • the energy calculation unit calculates the signal amplitude of the subband time domain signal in the current time domain signal frame by the following formula (2), and the signal amplitude is used to represent the signal amplitude corresponding to the subband time domain signal.
  • S m (n) represents the signal amplitude of the m-th subband time-domain signal of the n-th frame time domain signal
  • S m (n-1) represents the signal of the m-th subband time-domain signal of the n-1th frame time domain signal amplitude
  • E m (n) is the m-th average amplitude of the n-th frame with the time domain signal of the time domain signal
  • ⁇ 1 is the intensity of the smoothing coefficient, 0 ⁇ 1 ⁇ 1.
  • the signal amplitude S m (n-1) of the m-th subband time-domain signal of the n-1th frame time-domain signal may be a smoothed amplitude
  • n is greater than or equal to 1.
  • an initial amplitude can be set according to the application scenario in the above formula to represent S m (n-1).
  • the smoothing process it is mainly to avoid the sudden change of the amplitude between the sub-band time domain signals in the two frame signals.
  • the initial amplitude can be more directly Directly 0.
  • the amplitude smoothing value ⁇ 1 *S m (n-1) is determined according to the amplitude smoothing coefficient ⁇ 1 and the signal amplitude S m (n-1) of the previous time domain signal frame.
  • step S402 when the noise calculation module calculates the noise amplitude of the sub-band time domain signal, if the signal amplitude of the sub-band time domain signal in the current time domain signal frame is compared with the current time domain signal in the previous time domain signal frame The relationship between the signal amplitude of the subband time domain signals with the same subband identifier in the frame determines the noise amplitude in the current time domain signal frame. Therefore, if there are the following situations:
  • the noise calculation module is further configured to smooth the noise according to the signal amplitude and noise of the Nth subband time domain signal in the current time domain signal frame Calculate the noise amplitude of the Nth subband time-domain signal; specifically, in order to prevent sudden changes in the noise amplitude of two consecutive time-domain signal frames, the noise calculation module is further used to calculate the noise amplitude according to the noise smoothing coefficient and the previous time-domain signal frame The noise amplitude and the signal amplitude respectively determine the noise smoothing value.
  • N m (n) represents the noise amplitude of the m-th subband time-domain signal of the n-th frame time-domain signal, which is used to characterize the corresponding noise amplitude
  • N m (n-1) represents the n-th The noise amplitude of the m-th sub-band time-domain signal of a frame of time-domain signal
  • S m (n) represents the signal amplitude of the m-th sub-band time-domain signal of the n-th frame of time domain signal
  • S m (n-1) represents the The signal amplitude of the m-th subband time domain signal of the n-1 frame time domain signal
  • ⁇ and ⁇ are noise smoothing coefficients, 0 ⁇ 1, 0 ⁇ 1, and n is greater than or equal to 1.
  • the above formula can set an initial amplitude for N m (n-1) and S m (n-1) according to the application scenario.
  • the smoothing process it is mainly to avoid the sudden change of the amplitude between the sub-band time domain signals in the two frame signals.
  • the N m ( The initial amplitude of n-1) and S m (n-1) can be directly zero.
  • N m (n-1) and S m (n-1) respectively represent the corresponding amplitude after smoothing.
  • the noise smoothing value is determined according to the noise smoothing coefficient and the noise amplitude and signal amplitude of the previous time-domain signal frame.
  • ⁇ *N m (n-1) is a noise smoothing value, Is another noise smoothing value, or can be briefly summarized as: set the first noise smoothing coefficient and the second noise smoothing coefficient, and obtain the first noise smoothing value according to the first noise smoothing coefficient and the noise amplitude of the previous time-domain signal frame.
  • the first noise smoothing coefficient and the second noise smoothing coefficient and the signal amplitude of the previous time domain signal frame get the second smooth value, thereby avoiding the mth subband time domain signal of the nth frame time domain signal in the current speech signal x(i) Noise mutation.
  • the noise calculation module is further configured to directly use the signal amplitude of the N-th sub-band time-domain signal in the current time-domain signal frame as the first The noise amplitude of the N subband time domain signal.
  • the noise amplitude of the mth subband time domain signal of the nth frame time domain signal is calculated with reference to the following formula (4).
  • N m (n) S m (n) (4)
  • N m (n) represents the noise amplitude of the m-th sub-band time domain signal of the n-th frame time domain signal
  • S m (n) represents the m-th sub-band time domain signal of the n-th frame time domain signal
  • the signal amplitude of S m (n-1) represents the signal amplitude of the m-th subband time-domain signal of the n-1-th frame time-domain signal, which can be smoothed.
  • the subband time domain signal when calculating the noise amplitude of the subband time domain signal in step S402, the subband time domain signal is calculated according to the signal amplitude of the subband time domain signal in the current time domain signal frame The noise amplitude. Further, when the signal amplitude of the subband time domain signal in the current time domain signal frame is greater than the noise of the subband time domain signal with the same subband identifier in the previous time domain signal frame as in the current time domain signal frame, according to The signal amplitude of the subband time domain signal in the current time domain signal frame and the noise smoothing value calculate the noise amplitude of the subband time domain signal in the current time domain signal frame.
  • step S402 when calculating the signal amplitude of the sub-band time-domain signal in the current time-domain signal frame in step S402, first calculate the sub-band time-domain signal in the current time-domain signal frame. The average amplitude of the signal; then, the signal amplitude of the subband time domain signal in the current time domain signal frame is calculated according to the average amplitude of the subband time domain signal in the current time domain signal frame.
  • the signal amplitude of the subband time domain signal in the current time domain signal frame is less than or equal to the same subband in the previous time domain signal frame as in the current time domain signal frame.
  • the signal amplitude of the subband time domain signal in the current time domain signal frame is directly used as the noise amplitude of the subband time domain signal in the current time domain signal frame.
  • the voice activity detection module determines whether the current time domain signal frame is a valid voice signal according to the noise amplitude of the subband time domain signal and the signal amplitude.
  • step S403 the noise energy level and energy level of multiple sub-band time-domain signals are set for the sub-band time-domain signal, and the voice activity detection module may specifically be based on the noise amplitude of the sub-band time-domain signal and the signal amplitude and The noise energy level and the energy level are compared to determine whether the time domain signal of the nth frame in the current speech signal x(i) is a valid speech signal.
  • FIG. 5 is a schematic flowchart of the voice detection method in Embodiment 5 of this application; as shown in FIG. 5, it includes the following steps:
  • the subband generation module processes the current time domain signal frame to obtain several subband time domain signals.
  • the energy calculation module calculates the signal amplitude of the subband time domain signal in the current time domain signal frame, and the noise calculation module calculates the noise amplitude of the subband time domain signal in the current time domain signal frame;
  • steps S501 and S502 are respectively similar to S401 and S402 in the embodiment shown in FIG. 4.
  • S503 Calculate the total signal amplitude of the current time domain signal frame according to the signal amplitude of the subband time domain signal in the current time domain signal frame;
  • S t (n) represents the total signal amplitude of the time domain signal of the nth frame.
  • S t (n) is actually the sum of the signal amplitudes of the M subband time domain signals of the nth frame time domain signal.
  • S504 Calculate the total noise amplitude of the current time domain signal frame according to the noise amplitude of the subband time domain signal;
  • N t (n) represents the total noise amplitude of the n-th frame time domain signal, which is used to characterize the total noise amplitude.
  • N t (n) is actually the sum of the noise amplitudes of the M subband time domain signals of the nth frame time domain signal.
  • S505 Determine whether the current time domain signal frame is a valid voice signal according to the total noise amplitude and the total signal amplitude.
  • step S505 when judging whether the current time domain signal frame is a valid speech signal in step S505, since multiple noise energy levels are set as described above, if the total noise amplitude and the total signal amplitude are both If it is less than the lower limit of the noise energy level, it is determined that the current time domain signal frame is an invalid speech signal.
  • the number K of noise energy levels is set according to the requirements for judgment accuracy.
  • N t (n) ⁇ thn(1)&&S t (n) ⁇ thn(1) that is, the total signal amplitude and total noise amplitude of the nth frame time domain signal in the current speech signal x(i) are less than the noise energy level Lower limit. It shows that the noise intensity is very low at this time and there is no speech, that is, the time domain signal of the nth frame is judged as an invalid speech signal.
  • the current time domain is determined according to the default configuration items. Whether the signal frame is a valid voice signal.
  • N t (n)>thn(K) that is, the total noise amplitude of the time domain signal of the nth frame is greater than the upper limit of the noise energy level, indicating that the noise intensity is very high at this time and it is difficult to make a determination.
  • FIG. 6 is a schematic flowchart of a voice detection method in Embodiment 6 of this application; as shown in FIG. 6, it includes:
  • the subband generation module processes the current time domain signal frame to obtain several subband time domain signals.
  • the energy calculation module calculates the signal amplitude of the subband time domain signal in the current time domain signal frame, and the noise calculation module calculates the noise amplitude of the subband time domain signal in the current time domain signal frame;
  • S603 Calculate the signal to noise ratio of the subband time domain signal in the current time domain signal frame according to the noise amplitude of the subband time domain signal in the current time domain signal frame and the signal amplitude;
  • the signal-to-noise ratio is calculated with reference to the following formula (7).
  • SNR m (n) in the above formula (7) represents the signal-to-noise ratio of the time domain signal of the nth frame.
  • S604 Determine whether the current time domain signal frame is a valid speech signal according to the total noise amplitude of the current time domain signal frame and the signal-to-noise ratio of the sub-band time domain signal.
  • step S604 may specifically include determining whether the current time domain signal frame is a valid speech signal according to the signal-to-noise ratio and the signal-to-noise ratio level of the sub-band time domain signal of the current time domain signal frame.
  • the signal-to-noise ratio is closely related to the total noise amplitude, and multiple noise energy levels are set for the total noise amplitude.
  • multiple signal-to-noise ratio levels there is a mapping relationship between the noise energy level and the signal-to-noise ratio level, so as to determine whether the time domain signal of the nth frame is a valid speech signal.
  • the noise energy level corresponds to the signal-to-noise ratio level
  • the noise energy level thn(1) to thn(K) are sorted from the minimum to the maximum
  • thn(1) is the lower limit of the noise energy level
  • thn(K) Is the upper limit of the noise energy level
  • the SNR level can be sorted from thsnr(1) to thsnr(K) from maximum to minimum
  • thsnr(1) is the upper limit of the SNR level
  • thsnr(K) is the signal to noise
  • the lower limit of the ratio level a smaller noise energy level corresponds to a larger signal-to-noise ratio level
  • a larger noise energy level corresponds to a smaller signal-to-noise ratio level
  • the number of noise energy levels is equal to the number of signal-to-noise ratio levels.
  • the SNR of the subband time domain signal of the current time domain signal frame is greater than Or equal to the upper limit of the signal-to-noise ratio level, if the signal-to-noise ratio of the subband time-domain signal of the current time-domain signal frame is greater than or equal to the upper limit of the signal-to-noise ratio level, it is determined that the current time-domain signal frame is a valid speech signal Otherwise, it is judged to be an invalid voice signal.
  • N t (n) ⁇ thn(1) it is determined whether the SNR of the subband time domain signal of the current time domain signal frame is greater than or equal to the upper limit of the SNR level, if If the signal-to-noise ratio SNR m (n) of the time domain signal of the nth frame is greater than or equal to thsnr(1), it is determined that the current time domain signal frame is a valid speech signal; otherwise, it is determined to be an invalid speech signal.
  • the signal to noise ratio of the subband time domain signal of the current time domain signal frame is Is greater than or equal to the lower limit of the signal-to-noise ratio level, and if the signal-to-noise ratio of the subband time domain signal of the current time domain signal frame is greater than or equal to the lower limit of the signal-to-noise ratio level, it is determined that the current time domain signal frame is Valid voice signal, otherwise, it is judged as invalid voice signal.
  • N t (n)>thn(K) determine whether the signal-to-noise ratio of the sub-band time-domain signal of the current time-domain signal frame is greater than or equal to the lower limit of the signal-to-noise ratio level The lower limit of the ratio level thsnr(K); if the signal-to-noise ratio SNR m (n) of the time domain signal of the nth frame is greater than or equal to thsnr(K), it is determined that the current time domain signal frame is a valid speech signal; otherwise, it is determined It is an invalid voice signal.
  • the current time domain signal is determined whether the signal to noise ratio of the subband time domain signal of the current time domain signal frame is greater than Or equal to the intermediate threshold of the corresponding signal-to-noise ratio level, if the signal-to-noise ratio of the subband time domain signal of the current time domain signal frame is greater than or equal to the intermediate threshold of the corresponding signal-to-noise ratio level, then the current time domain signal is determined
  • the frame is a valid speech signal, otherwise, it is judged to be an invalid speech signal.
  • the intermediate threshold of noise energy level is thn(q), 1 ⁇ q ⁇ K
  • thn(q) can be any noise energy level between thn(1) and thn(K), if thn(q-1) ⁇ N t (n) ⁇ thn(q), 1 ⁇ q ⁇ K, then determine whether the SNR of the sub-band time domain signal of the current time domain signal frame is greater than or equal to the corresponding SNR level
  • the intermediate threshold thsnr(q-1), the intermediate threshold thsnr(q-1) of the signal-to-noise ratio level corresponds to the noise energy level thn(q-1); if the signal-to-noise ratio SNR m (n) of the nth frame time domain signal is greater than Or equal to thsnr(q-1), it is determined that the current time domain signal frame is a valid speech signal, otherwise, it is determined to be an invalid speech signal.
  • the intermediate threshold of the noise energy level can be considered as the noise energy level. Any threshold, in addition, in this embodiment, if thn(q-1) ⁇ N t (n) ⁇ thn(q), 1 ⁇ q ⁇ K, it can also be determined that the sub-frame of the current time domain signal frame Whether the signal-to-noise ratio of the time-domain signal is greater than or equal to the intermediate threshold thsnr(q) of the corresponding signal-to-noise ratio level, the intermediate threshold thsnr(q) of the signal-to-noise ratio level corresponds to the noise energy level thn(q); In the case of, select a larger value of signal-to-noise ratio level for comparison with the signal-to-noise ratio. In the case of greater noise, select a smaller value of signal-to-noise ratio level for comparison, which can more accurately determine whether it is a valid voice signal .
  • the above process actually believes that the noise energy level corresponding to N t (n) is first judged, and then the signal-to-noise ratio level thsnr(q) corresponding to the noise energy level is determined according to the comparison result of the noise energy level, and N t (n) corresponds to
  • the signal-to-noise ratio SNR m (n) is compared with the signal-to-noise ratio level thsnr(q), and the signal-to-noise ratio SNR m (n) of any sub-band time-domain signal in the n-th frame time-domain signal is greater than the corresponding If the signal-to-noise ratio level is thsnr(q), it is determined that the time domain signal of the nth frame is a valid speech signal.
  • the next level of voice signal transmission can buffer a part of the historical voice signal.
  • the historical voice signal can be obtained from the buffer area and transmitted, which is equivalent to advance the voice detection time and guarantee the small amplitude of the voice at the beginning The voice signal will not be missed.
  • the size of the buffer area can be flexibly configured according to the application scenario. That is, when it is determined that a valid voice signal is detected, the detected valid voice is buffered.
  • FIG. 5 is a schematic structural diagram of the voice processing chip in Embodiment 5 of the application; as shown in Figure 5, it includes: a voice detection device and a processor.
  • the voice detection device includes: a subband generation module, an energy calculation module, a noise calculation module,
  • the voice activity detection module, the subband generation module is used to process the current time domain signal frame to obtain a number of subband time domain signals, and the energy calculation module is used to calculate the subband time domain signal in the current time domain signal frame
  • the signal amplitude, the noise calculation module is used to calculate the noise of the sub-band time domain signal, and the voice activity detection module is used to calculate the amplitude of the several sub-band time domain signals according to the current time domain signal frame,
  • the processor is configured to Recognizing the effective voice signal to perform voice control according to the recognition result.
  • the technical solution is configured to only address one of the situations, such as: the above-mentioned total signal amplitude and total noise amplitude are used to determine whether the current time domain signal is a valid voice signal. If it can be based on the total signal amplitude and total noise amplitude If the judgment is made, the judgment is made directly. If the judgment cannot be made based on the total signal amplitude and total noise amplitude, then jump directly to processing the next time domain signal frame; or perform simple processing with reference to the above-mentioned default configuration items to save Power consumption and reduce the complexity of the technology.
  • the voice signal when it is determined that the voice signal is valid, it may indicate that there is a voice signal from the signal source of interest, and when it is determined that the voice signal is invalid, it may indicate that there is no voice signal from the signal source of interest.
  • An embodiment of the application further provides an electronic device, which includes the voice processing chip described in any embodiment of the application.
  • Mobile communication equipment This type of equipment is characterized by mobile communication functions, and its main goal is to provide voice and data communications.
  • Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has calculation and processing functions, and generally also has mobile Internet features.
  • Such terminals include: PDA, MID and UMPC devices, such as iPad.
  • Portable entertainment equipment This type of equipment can display and play multimedia content.
  • Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
  • the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • This application may be described in the general context of computer-executable instructions executed by a computer, such as program modules.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific transactions or implement specific abstract data types.
  • This application can also be practiced in distributed computing environments. In these distributed computing environments, remote processing devices connected through a communication network execute transactions.
  • program modules can be located in local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Telephone Function (AREA)

Abstract

一种语音检测方法、语音检测装置、语音处理芯片以及电子设备,语音检测装置包括:子带生成模块、语音活动检测模块,所述子带生成模块用于对当前时域信号帧进行处理以得到若干个子带时域信号,所述语音活动检测模块用于根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号。语音检测装置在时域上即可执行,由此降低了算法的复杂度,减少了功耗。

Description

语音检测方法、语音检测装置、语音处理芯片以及电子设备 技术领域
本申请实施例涉及信号处理技术领域,尤其涉及一种语音检测方法、语音检测装置、语音处理芯片以及电子设备。
背景技术
语音唤醒的应用领域比较广泛,例如机器人、手机、可穿戴设备、智能家居、车载设备等。几乎很多带有语音功能的设备,都会需要语音唤醒技术作为人和机器互动的一个开始或入口,让处于休眠状态下的设备直接进入到等待指令状态,开启语音交互第一步。不同的产品会有不同的唤醒词,当用户需要唤醒设备时需要说出特定的唤醒词。
上述语音唤醒的实现主要依赖于语音活动检测算法,但是现有技术中,语音活动检测算法均是在频域上进行处理,由此导致算法复杂度高,进一步导致功耗较大。
发明内容
有鉴于此,本申请实施例所解决的技术问题之一在于提供一种语音检测方法、语音检测装置、语音处理芯片以及电子设备,用以克服现有技术中的上述缺陷。
本申请实施例提供一种语音检测方法,其包括:
对当前时域信号帧进行处理以得到若干个子带时域信号;
根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号。
本申请实施例提供一种语音检测装置,其包括:子带生成模块、语音活动检测模块,所述子带生成模块用于对当前时域信号帧进行处理以得到若干个子带时域信号,所述语音活动检测模块用于根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号。
本申请实施例提供一种语音处理芯片,其包括:语音检测装置以及处理器,语音检测装置包括:子带生成模块、语音活动检测模块,所述子带生成模块用于对当前时域信号帧进行处理以得到若干个子带时域信号,所述语音活动检测模块用于根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号;所述处理器用于对所述有效语音信号进行识别,以根据所述识别 的结果进行语音控制。
本申请实施例提供一种电子设备,其包括本申请任一实施例所述的语音处理芯片。
本申请实施例提供的方案中,对当前时域信号帧进行处理以得到若干个子带时域信号;根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号,由此在时域上即可执行,由此降低了算法的复杂度高,减少了功耗。
附图说明
后文将参照附图以示例性而非限制性的方式详细描述本申请实施例的一些具体实施例。附图中相同的附图标记标示了相同或类似的部件或部分。本领域技术人员应该理解,这些附图未必是按比例绘制的。附图中:
图1为本申请实施例一中语音检测装置的结构示意图;
图2为本申请实施例二中语音检测装置的结构示意图;
图3为本申请实施例三中语音检测装置的结构示意图;
图4为本申请实施例四中语音检测方法的流程示意图;
图5为本申请实施例五中语音检测方法的流程示意图;
图6为本申请实施例六中语音检测方法的流程示意图。
具体实施方式
实施本申请实施例的任一技术方案必不一定需要同时达到以上的所有优点。
下面结合本申请实施例附图进一步说明本申请实施例具体实现。
本申请实施例中,对当前时域信号帧进行处理以得到若干个子带时域信号;根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号,由此在时域上即可执行,由此降低了算法的复杂度高,减少了功耗。同时,具有较高的语音检测正确率。
图1为本申请实施例一中语音检测装置的结构示意图;如图1所示,其包括:子带生成模块、能量计算模块、噪声计算模块、语音活动检测模块(Voice Activity Detection简称VAD),所述子带生成模块用于对当前时域信号帧进行处理以得到若干个子带时域信号,能量计算模块用于根据当前时域信号帧的所述若干个子带时域信号的幅度计算当前时域信号帧中所述子带时域信号的信号幅度,所述噪声计算模块用于根据当前时域信号帧的所述若干个子带时域信号的幅度计算所述子带时域信号的噪声幅度,所述语音活动检测模块用于在根据所述当前时域信号帧的所述若干个子带时域 信号的幅度,判断所述当前时域信号帧是否是有效语音信号时,具体根据所述子带时域信号的噪声幅度以及所述信号幅度判断所述当前时域信号帧是否是有效语音信号。
本实施例中,当前时域信号帧来自语音采集模块,比如,在一个采样周期内,语音采集模块采集到一段语音信号,其实际上可包括若干个时域信号帧,因此,在判断这一段语音信号是否为来自用户时,即是否是有效语音信号,是以帧为单位进行处理,即对其中的每个时域信号帧进行分组处理、能量计算处理、噪声计算处理、语音活动检测,从而判断对应的时序信号帧是否是有效语音信号。在一具体应用场景中,语音采集模块可以为麦克风。
具体地,所述子带生成模块为滤波器组,所述滤波器组根据设置的频率门限对当前时域信号帧进行处理以得到若干个子带时域信号。滤波器组可以包括多个滤波器,每个滤波器具有设定的频率门限,多个滤波器分别对当前时域信号帧进行滤波处理从而得到多个子带时域信号。每个子带时域信号对应有一个子带标识。
本实施例中,所述滤波器组中子滤波器的数量根据需要进行设置,即要将当前时域信号帧拆分成几个子带,就设置几个子滤波器。此处,在具体设置滤波器数量的时候,要平衡性能和复杂度,比如考虑到功耗等原因,设置2~3个滤波器。当然,此处子滤波器的数量仅仅是示例,并非唯一性限定。
进一步地,在一具体应用场景中,滤波器比如为有限脉冲响应滤波器(Finite Impulse Response,简称FIR)或者无限脉冲响应滤波器(Infinite impulse response filter,简称IIR)滤波器,如果进一步从频率响应特性角度进行区分的话,则可以是带通滤波器,比如,滤波器具体为级联双二阶IIR的带通滤波器。
本实施例中,所述能量计算模块包括:平均幅度计算单元,用于计算所述当前时域信号帧中所述子带时域信号的平均幅度;能量计算单元,用于根据所述当前时域信号帧中所述子带时域信号的平均幅度计算所述当前时域信号帧中所述子带时域信号的信号幅度。所述能量计算单元进一步使用所述当前时域信号帧中所述子带时域信号的平均幅度表征所述子带时域信号的信号幅度。如前所述,对于采集到的一段语音信号可包括若干帧语音信号的话,当前时域信号帧指的就是其中参与语音信号检测的一帧语音信号,进一步地,由于上述滤波处理是针对一帧语音信号进行的,从而通过对一帧语音信号进行滤波处理得到若干个子带时域信号。在能量计算模块进行能量的计算时,是以子带时域信号为单位进行计算的,即计算每一个子带时域信号的信号幅度,此处需要说明的是,此处的计算可认为是估计。
进一步地,在一应用场景中,具体通过每一个子带时域信号的估计幅度来表示其对应的信号幅度,具体可以通过以求一个子带时域信号中所有采样点的幅值的均方根值、绝对值的平均值等表征上述幅度。
进一步地,为了防止连续两个时域信号帧的信号幅度发生突变,所述能量计算单元进一步根据所述当前时域信号帧中所述子带时域信号的平均幅度以及幅度平滑值,计算当前时域信号帧中所述子带时域信号的信号幅度。
具体地,所述能量计算模块进一步用于根据幅度平滑系数以及上一时域信号帧的信号幅度确定所述幅度平滑值。此处,幅度平滑系数的大小根据应用场景灵活设置,而上一时域信号帧的信号幅度实际上也是通过把上一时域信号帧作为当前时序信号帧执行上述语音信号检测得到的信号幅度。
从信号处理角度来看,由于噪声的影响会反映到当前时域信号帧的信号幅度上,因此,本实施例中,所述噪声计算模块进一步用于根据当前时域信号帧中所述子带时域信号的信号幅度计算所述子带时域信号的噪声幅度。在根据当前时域信号帧中所述子带时域信号的信号幅度计算所述子带时域信号的噪声幅度时,由于此处的子带时域信号对应于当前时域信号帧,而上一时域信号帧的信号幅度已经已知,可以有效地作为参考,从而确定当前时域信号帧中的噪声幅度。具体实施时,可以根据当前时域信号帧子带时域信号的信号幅度与上一时域信号帧中与当前时域信号帧中具有相同子带标识的子带时域信号的信号幅度的关系,确定当前时域信号帧中的噪声幅度。由此可能会存在如下几种情形:
(1)在所述当前时域信号帧中第N子带时域信号的信号幅度大于上一时域信号帧中第N子带时域信号的噪声幅度时,所述噪声计算模块进一步用于根据所述当前时域信号帧中第N子带时域信号的信号幅度以及噪声平滑值计算第N子带时域信号的噪声幅度,所述第N子带时域信号为所述子带时域信号中的任意一个,N>0且为整数;具体地,为了防止连续两个时域信号帧的噪声发生突变,所述噪声计算模块进一步用于根据噪声平滑系数以及上一时域信号帧的噪声幅度和信号幅度分别确定所述噪声平滑值。
(2)当前时域信号帧中第N子带时域信号的信号幅度小于或者等于上一时域信号帧中第N子带时域信号的噪声幅度时,所述噪声计算模块进一步用于将所述当前时域信号帧中第N子带时域信号的信号幅度直接作为第N子带时域信号的噪声幅度,所述第N子带时域信号为所述子带时域信号中的任意一个,N>0且为整数。
图2为本申请实施例二中语音检测装置的结构示意图;如图2所示,与上述实施例不同的是,本实施例中,除了包括子带生成模块、能量计算模块、噪声计算模块、语音活动检测模块,还包括语音采集模块。即可以理解为语音采集为语音检测装置的组成,而上述实施例一中,语音采集模块为独立于语音检测装置并非为语音检测装置的组成。
本实施例中,对于当前时域信号帧来说,通过上述实施例一的方式,计算出 当前时域信号帧包括的多个子带时域信号的信号幅度,进而可以进一步计算出当前时域信号帧的总信号幅度以及总噪声幅度,因此,为了降低资源消耗以及节省功率,所述能量计算模块进一步用于根据当前时域信号帧中所述子带时域信号的信号幅度计算所述当前时域信号帧的总信号幅度,且所述噪声计算模块进一步用于根据所述当前时域信号帧中的所述子带时域信号的噪声幅度计算所述当前时域信号帧的总噪声幅度,且所述语音活动检测模块进一步用于根据所述总噪声幅度以及所述总信号幅度判断所述当前时域信号帧是否是有效语音信号。即可理解为,本实施例中,从当前时域信号帧的总噪声幅度和总信号幅度来判断所述当前时域信号帧是否是有效语音信号,从而有效地降低了技术的复杂度,且减少了资源的消耗,或者又称之对资源的要求较低。
进一步地,本实施例中,设置了多个噪声能量等级,其中最小的噪声能量等级称之为噪声能量等级下限,而最大的噪声能量等级称之为噪声能量等级上限,因此,在判断当前时域信号帧是否是有效语音信号时,将所述总噪声幅度以及所述总信号幅度分别与多个噪声能量等级进行比对,如果所述总噪声幅度以及所述总信号幅度均小于噪声能量等级下限则所述语音活动检测模块判定所述当前时域信号帧为无效语音信号;或者,若所述总噪声幅度大于或等于噪声能量等级上限,则根据默认配置项,判断所述当前时域信号帧是否有效语音信号。此处默认配置项可以根据应用场景灵活设定,如果配置项为所述总噪声幅度大于或等于噪声能量等级上限时可认为当前时域信号帧是有效语音信号,则当所述总噪声幅度大于或等于噪声能量等级上限,在语音活动检测模块判定当前时域信号帧是有效语音信号。如果配置项为所述总噪声大于或等于噪声能量等级上限时,可直接认为当前时域信号帧是无效语音信号,即当所述总噪声幅度大于或等于噪声能量等级上限,在语音活动检测模块判定当前时域信号帧是无效语音信号。
图3为本申请实施例三中语音检测装置的结构示意图;如图3所示,与上述实施例不同的是,本实施例中,子带生成模块、能量计算模块、噪声计算模块、语音活动检测模块,还包括信噪比计算模块,用于根据所述当前时域信号帧的所述若干个子带时域信号的噪声幅度以及所述信号幅度计算所述子带时域信号的信噪比;所述语音活动检测模块进一步用于根据当前时域信号帧的所述总噪声幅度以及所述当前时域信号帧的所述子带时域信号的信噪比,判断所述当前时域信号帧是否是有效语音信号。
本实施例中,设置了多个信噪比等级,以根据当前时域信号帧的所述子带时域信号的信噪比与信噪比等级判断所述当前时域信号帧是否是有效语音信号。
具体地,在一应用场景中,可以根据当前时域信号帧的所述子带时域信号的多个噪声能量等级,对应地设置多个信噪比等级。
具体地,可能存在如下几种情形:
(1)噪声能量等级的下限对应信噪比等级的上限;且若所述当前时域信号帧的所述总噪声幅度小于或等于所述噪声能量等级的下限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的上限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于信噪比等级的上限,则所述语音活动检测模块判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号;
(2)噪声能量等级的上限对应信噪比等级的下限,且若所述当前时域信号帧的所述总噪声幅度大于或等于所述噪声能量等级的上限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的下限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于信噪比等级的下限,则所述语音活动检测模块判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号;
(3)噪声能量等级的下限和上限之间对应设置介于信噪比等级的上限和下限之间的信噪比等级的中间门限,若所述当前时域信号帧的所述总噪声幅度大于或等于所述噪声能量等级的中间门限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于对应的信噪比等级的中间门限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于所述信噪比等级的中间门限,则所述语音活动检测模块判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
需要说明的是,上述实施例中,仅仅是以语音检测装置可以包括能量计算模块以及噪声计算模块为例进行说明,并非代表能量计算模块以及噪声计算模块是实现本申请必不可少的模块。
图4为本申请实施例四中语音检测方法的流程示意图;如图4所示,其包括:
S401、子带生成模块对当前时域信号帧进行处理以得到若干个子带时域信号;
本实施例中,参见上述图1所示示例,通过将滤波器组作为子带生成模块来实现对当前时域信号帧进行滤波处理以得到若干个子带时域信号。
本实施例中,当前时域信号帧来自语音采集模块,比如,在一个采样周期内,语音采集模块在当前采样时刻i采集并经模数转换得到,每N个当前语音信号x(i)形成一个时域信号帧,其中的第n帧时域信号记为x(n),作为当前时域信号帧。进一步地,如果对于第n帧时域信号x(n)进行滤波处理得到共计M个子带时域信号,对于其中第m个子带时域信号记为x m(n),m=1~M。
S402、能量计算模块根据所述当前时域信号帧的所述若干个子带时域信号的幅度计算当前时域信号帧中所述子带时域信号的信号幅度,以及噪声计算模块计算所述子带时域信号的噪声幅度;
具体地,参见上述实施例,在计算当前时域信号帧中所述子带时域信号的信 号幅度时根据所述当前时域信号帧中所述子带时域信号的平均幅度计算当前时域信号帧中所述子带时域信号的信号幅度,在具体实施时,如果根据所述当前时域信号帧中所述子带时域信号的平均幅度以及幅度平滑值,计算当前时域信号帧中所述子带时域信号的信号幅度的话,可以参照如下公式(1)。
具体地,本实施例中,平均幅度计算单元通过如下公式(1)计算当前时域信号帧中每一个所述子带时域信号的平均幅度。
Figure PCTCN2019092361-appb-000001
在上述公式(1)中,x m,i(n)表示第n帧时域信号的第m个子带时域信号,E m(n)是第n帧时域信号的第m个子带时域信号的平均幅度,第n帧时域信号即为当前时域信号帧,i是采样点,N表示采样点数。
进一步地,能量计算单元通过如下公式(2)计算当前时域信号帧中所述子带时域信号的信号幅度,该信号幅度用于表征所述子带时域信号对应的信号幅度。
S m(n)=∝ 1*S m(n-1)+(1-∝ 1)*E m(n)   (2)
S m(n)表示第n帧时域信号的第m个子带时域信号的信号幅度,S m(n-1)表示第n-1帧时域信号的第m个子带时域信号的信号幅度,E m(n)是第n帧时域信号的第m个子带时域信号的平均幅度,∝ 1是强度平滑系数,0<∝ 1<1。此处,需要说明的是,第n-1帧时域信号的第m个子带时域信号的信号幅度S m(n-1)可以是经过平滑处理的幅度,n大于等于1。
特殊地,当n=1时,由于不存在第n-1帧,因此,上述公式中可以根据应用场景设置一个初始幅度,以代表S m(n-1)。当然,考虑到平滑的处理,主要避免两帧信号中子带时域信号间的幅度的突变,当n=1时,由于不存在第n-1帧,则更为直接地,该初始幅度可以直接为0。
在上述公式(2)可见,根据幅度平滑系数∝ 1以及上一时域信号帧的信号幅度S m(n-1)确定所述幅度平滑值∝ 1*S m(n-1)。
在上述步骤S402中噪声计算模块计算所述子带时域信号的噪声幅度时,若根据当前时域信号帧中的子带时域信号的信号幅度与上一时域信号帧中与当前时域信号帧中具有相同子带标识的子带时域信号的信号幅度的关系,确定当前时域信号帧中的噪声幅度。由此如果存在如下几种情形:
(1)在所述当前时域信号帧中第N子带时域信号的信号幅度大于上一时域信号帧中第N子带时域信号的噪声幅度时,所述第N子带时域信号为所述子带时域信号中的任意一个,N>0且为整数,所述噪声计算模块进一步用于根据所述当前时域信号帧中第N子带时域信号的信号幅度以及噪声平滑值计算第N子带时域信号的噪声幅 度;具体地,为了防连续两个时域信号帧的噪声幅度发生突变,所述噪声计算模块进一步用于根据噪声平滑系数以及上一时域信号帧的噪声幅度和信号幅度分别确定所述噪声平滑值。
针对此情形,考虑到噪声跟踪的连续性,在没有确定出是否是有效语音信号前,参照下述公式(3)计算第n帧时域信号的第m个子带时域信号的噪声幅度,从而实现对噪声跟踪的连续性。
Figure PCTCN2019092361-appb-000002
上述公式(3)中,N m(n)表示第n帧时域信号的第m个子带时域信号的噪声幅度,用于表征对应的噪声幅度,N m(n-1)表示第n-1帧时域信号的第m个子带时域信号的噪声幅度,S m(n)表示第n帧时域信号的第m个子带时域信号的信号幅度,S m(n-1)表示第n-1帧时域信号的第m个子带时域信号的信号幅度,γ和β是噪声平滑系数,0<γ<1,0<β<1,n大于等于1。
特殊地,当n=1时,由于不存在第n-1帧,因此,上述公式中可以根据应用场景针对N m(n-1)、S m(n-1)分别设置一个初始幅度。当然,考虑到平滑的处理,主要避免两帧信号中子带时域信号间的幅度的突变,当n=1时,由于不存在第n-1帧,则更为直接地,该N m(n-1)、S m(n-1)的初始幅度可以直接为0。n大于1时,N m(n-1)、S m(n-1)分别表示平滑后的对应幅度。
本实施例中,在计算所述子带时域信号的噪声时,根据噪声平滑系数以及上一时域信号帧的噪声幅度和信号幅度分别确定所述噪声平滑值。参见上述公式(3)可见,γ*N m(n-1)为一噪声平滑值,
Figure PCTCN2019092361-appb-000003
为另一噪声平滑值,或者可简要概括为:设置第一噪声平滑系数和第二噪声平滑系数,根据第一噪声平滑系数以及上一时域信号帧的噪声幅度得到第一噪声平滑值,根据第一噪声平滑系数和第二噪声平滑系数以及上一时域信号帧的信号幅度得到第二平滑值,从而避免当前语音信号x(i)中第n帧时域信号的第m个子带时域信号的噪声突变。
(2)在当前时域信号帧中第N子带时域信号的信号幅度小于或者等于上一时域信号帧中第N子带时域信号的噪声幅度时,所述第N子带时域信号为所述子带时域信号中的任意一个,N>0且为整数,所述噪声计算模块进一步用于将所述当前时域信号帧中第N子带时域信号的信号幅度直接作为第N子带时域信号的噪声幅度。
针对此情形,参照下述公式(4)计算第n帧时域信号的第m个子带时域信号的噪声幅度。
N m(n)=S m(n)    (4)
上述公式(4)中N m(n)表示第n帧时域信号的第m个子带时域信号的噪声幅 度,S m(n)表示第n帧时域信号的第m个子带时域信号的信号幅度,S m(n-1)表示第n-1帧时域信号的第m个子带时域信号的信号幅度,其可以经过平滑处理后的幅度。
参见上述公式(3)可见,在步骤S402中计算所述子带时域信号的噪声幅度时,根据当前时域信号帧中所述子带时域信号的信号幅度计算所述子带时域信号的噪声幅度。进一步地,在当前时域信号帧中所述子带时域信号的信号幅度大于上一时域信号帧中与当前时域信号帧中具有相同子带标识的子带时域信号的噪声时,根据当前时域信号帧中所述子带时域信号的信号幅度以及噪声平滑值计算当前时域信号帧中所述子带时域信号的噪声幅度。
参见上述公式(4)可见,在步骤S402中所述计算当前时域信号帧中所述子带时域信号的信号幅度时,首先通过计算所述当前时域信号帧中所述子带时域信号的平均幅度;之后,根据所述当前时域信号帧中所述子带时域信号的平均幅度计算当前时域信号帧中所述子带时域信号的信号幅度。在计算所述子带时域信号的噪声幅度时,如果当前时域信号帧中所述子带时域信号的信号幅度小于等于上一时域信号帧中与当前时域信号帧中具有相同子带标识的子带时域信号的噪声幅度时,将当前时域信号帧中所述子带时域信号的信号幅度直接作为当前时域信号帧中所述子带时域信号的噪声幅度。
此处,说明的是,对于上述公式(3)或者(4)的所示的情形,并非要在同一个实施例中,在具体实施时,根据应用场景的需求,可以只采取公式(3)或者公式(4)计算信号幅度的情形。
S403、语音活动检测模块根据所述子带时域信号的噪声幅度以及所述信号幅度判断所述当前时域信号帧是否是有效语音信号。
在步骤S403中,针对所述子带时域信号设置多个子带时序信号的噪声能量等级、能量等级,语音活动检测模块具体可以根据所述子带时域信号的噪声幅度以及所述信号幅度与噪声能量等级、能量等级进行比对,从而确定当前语音信号x(i)中第n帧时域信号是否是有效语音信号。
图5为本申请实施例五中语音检测方法的流程示意图;如图5所示,其包括如下步骤:
S501、子带生成模块对当前时域信号帧进行处理以得到若干个子带时域信号;
S502、能量计算模块计算当前时域信号帧中所述子带时域信号的信号幅度,以及噪声计算模块计算当前时域信号帧中所述子带时域信号的噪声幅度;
本实施例中,步骤S501、S502分别类似上述图4所示实施例中的S401、S402。
S503、根据当前时域信号帧中所述子带时域信号的信号幅度计算所述当前时 域信号帧的总信号幅度;
Figure PCTCN2019092361-appb-000004
S t(n)表示第n帧时域信号的总信号幅度。
由上述公式(5)可见,S t(n)实际上是第n帧时域信号的M个子带时域信号的信号幅度之和。
S504、根据所述子带时域信号的噪声幅度计算所述当前时域信号帧的总噪声幅度;
Figure PCTCN2019092361-appb-000005
N t(n)表示第n帧时域信号的总噪声幅度,用于表征总噪声幅度。
由上述公式(6)可见,N t(n)实际上是第n帧时域信号的M个子带时域信号的噪声幅度之和。
S505、根据所述总噪声幅度以及所述总信号幅度判断所述当前时域信号帧是否是有效语音信号。
本实施例中,在步骤S505中判断所述当前时域信号帧是否是有效语音信号时,如前所述由于设置了多个噪声能量等级,若所述总噪声幅度以及所述总信号幅度均小于噪声能量等级下限则判定所述当前时域信号帧为无效语音信号。
比如,在一应用场景中,定义噪声能量等级thn(k),k=1,…,K,thn(1)代表噪声能量等级的下限,或者又称之为最低噪声能量等级,thn(K)代表噪声能量等级的上限,或者又称之为最高噪声能量等级,随着k增加,等级thn(k)逐渐变大,说明噪声强度越大。噪声能量等级的数量K根据对判断准确性的要求来设置。
如果N t(n)<thn(1)&&S t(n)<thn(1),即当前语音信号x(i)中第n帧时域信号的总信号幅度和总噪声幅度都小于噪声能量等级下限。说明此时噪声强度很低,没有语音,即判定第n帧时域信号为无效语音信号。
对于上述语音活动检测模块来说,产生输出信号VAD(n)=0,即表明第n帧时域信号为无效语音信号。
比如,在另外一应用场景中,若所述总噪声幅度大于或等于噪声能量等级上限,此时,是否有效语音信号的判断难度较大,因此,则根据默认配置项,判断所述当前时域信号帧是否有效语音信号。
如果N t(n)>thn(K),即第n帧时域信号的总噪声幅度大于噪声能量等级的上限,说明此时噪声强度很高,很难作出判定。如果设置了默认配置项D highnoise,对应地,语音活动检测模块产生输出信号VAD(n)=D highnoise;若D highnoise=0时,可以判定第n帧时域信号为无效语音信号,若D highnoise=1,可以判定第n帧时域信号为 有效语音信号。
图6为本申请实施例六中语音检测方法的流程示意图;如图6所示,其包括:
S601、子带生成模块对当前时域信号帧进行处理以得到若干个子带时域信号;
S602、能量计算模块计算当前时域信号帧中所述子带时域信号的信号幅度,以及噪声计算模块计算当前时域信号帧中所述子带时域信号的噪声幅度;
S603、根据当前时域信号帧中所述子带时域信号的噪声幅度以及所述信号幅度计算当前时域信号帧中所述子带时域信号的信噪比;
本实施例中,参照如下公式(7)计算信噪比。
SNR m(n)=S m(n)/N m(n)  (7)
上述公式(7)中SNR m(n)表示第n帧时域信号的信噪比。
S604、根据当前时域信号帧的总噪声幅度以及所述子带时域信号的信噪比判断所述当前时域信号帧是否是有效语音信号。
本实施例中,步骤S604具体可以包括根据当前时域信号帧的所述子带时域信号的信噪比与信噪比等级判断所述当前时域信号帧是否是有效语音信号。
本实施例中,参见上述公式(7)可见,对于第n帧时域信号来说,其信噪比跟总噪声幅度密切相关,针对总噪声幅度设置了多个噪声能量等级,对应地,也可以通过设置多个信噪比等级,噪声能量等级与信噪比等级之间具有映射关系,从而判断第n帧时域信号是否是有效语音信号。
示例性地,在一具体应用场景中,定义和噪声能量等级thn(k)相对应的信噪比SNR m等级thsnr(k),k=1,…,K,K表示等级数,本实施例中,噪声能量等级与信噪比等级相对应,例如,噪声能量等级thn(1)到thn(K)从最小值到最大值排序,thn(1)为噪声能量等级的下限,thn(K)为噪声能量等级的上限,则信噪比等级可以从thsnr(1)到thsnr(K)由最大值到最小值排序,thsnr(1)为信噪比等级的上限,thsnr(K)为信噪比等级的下限,较小值的噪声能量等级对应较大值的信噪比等级,较大值的噪声能量等级对应较小值的信噪比等级。或者,换言之,噪声能量等级的级数与信噪比等级的级数相等,噪声能量等级越高,信噪比等级越高,信噪比等级的值越小,但信噪比等级的数值大小根据应用场景灵活设置,从而避免有效语音信号的误判。具体地,有如下几种情形:
(1)若所述当前时域信号帧的所述总噪声幅度小于等于所述噪声能量等级的下限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的上限,若当前时域信号帧的所述子带时域信号的信噪比大于或等于信噪比等级的上限,则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信 号。
具体实施时,比如如果N t(n)<thn(1),则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的上限,若第n帧时域信号的信噪比SNR m(n)大于或等于thsnr(1),则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
(2)若所述当前时域信号帧的所述总噪声幅度大于或等于所述噪声能量等级的上限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的下限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于信噪比等级的下限,则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
具体实施时,比如如果N t(n)>thn(K),则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的下限信噪比等级的下限thsnr(K);若第n帧时域信号的信噪比SNR m(n)大于或等于thsnr(K),则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
(3)若所述当前时域信号帧的所述总噪声幅度大于或等于噪声能量等级的中间门限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于对应的信噪比等级的中间门限,若当前时域信号帧的所述子带时域信号的信噪比大于或等于对应信噪比等级的中间门限,则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
具体实施时,噪声能量等级中间门限thn(q),1<q<K,thn(q)可以为thn(1)和thn(K)中间的任一噪声能量等级,如果thn(q-1)<N t(n)≤thn(q),1<q<K,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于对应的信噪比等级的中间门限thsnr(q-1),信噪比等级的中间门限thsnr(q-1)对应噪声能量等级thn(q-1);若第n帧时域信号的信噪比SNR m(n)大于或等于thsnr(q-1),则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号,本实施例中,噪声能量等级的中间门限可以认为是噪声能量等级中的任一门限,另外,本实施例中,如果thn(q-1)<N t(n)≤thn(q),1<q<K,也可以判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于对应的信噪比等级的中间门限thsnr(q),信噪比等级的中间门限thsnr(q)对应噪声能量等级thn(q);在噪声较小的情况下,选取较大值的信噪比等级与信噪比进行比较,在噪声较大的情况下,选取较小值的信噪比等级进行比较,可以更准确的判断是否是有效语音信号。
上述过程实际上认为,先判断N t(n)对应的噪声能量等级,然后根据噪声能 量等级的比较结果确定与噪声能量等级对应的信噪比等级thsnr(q),将N t(n)对应的信噪比SNR m(n)与信噪比等级thsnr(q)进行比对,对于第n帧时域信号中的任一子带时域信号的信噪比SNR m(n)大于对应的信噪比等级thsnr(q),则判定第n帧时域信号为有效语音信号。
在上述实施例的基础上,如果VAD(n-1)=0并且VAD(n)=1,说明检测到开始有有效的语音信号,此时可以传送采集到的语音信号,为了更加完整地向下一级传送语音信号,可以缓存一部分历史语音信号,当检测到语音开始,可以从缓存区获取历史语音信号并传送,从而相当于提前了语音检测时刻,保障了语音刚开始那部分小幅度的语音信号不会被遗漏。缓存区的大小可根据应用场景灵活配置。即,当判定开始检测到有效语音信号后,对检测到的有效语音进行缓存。
图5为本申请实施例五中语音处理芯片的结构示意图;如图5所述,其包括:语音检测装置以及处理器,语音检测装置包括:子带生成模块、能量计算模块、噪声计算模块、语音活动检测模块,所述子带生成模块用于对当前时域信号帧进行处理以得到若干个子带时域信号,能量计算模块用于计算当前时域信号帧中所述子带时域信号的信号幅度,所述噪声计算模块用于计算所述子带时域信号的噪声,所述语音活动检测模块用于在根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号时,具体根据所述子带时域信号的噪声以及所述信号幅度判断所述当前时域信号帧是否是有效语音信号;所述处理器用于对所述有效语音信号进行识别,以根据所述识别的结果进行语音控制。本实施例中,有关语音检测装置其他示例性解释可参见上述实施例。
此处需要说明的是,对于上述实施例中,可能存在多种语音检测具体方式的情形或者条件或者存在各种分支的情形,并非要在同一实施例中同时出现,实际上,也可以根据应用场景的需求,将技术方案配置为只针对其中的一种情形,比如:上述通过总信号幅度、总噪声幅度来判断当前时域信号是否是有效语音信号,如果可根据总信号幅度、总噪声幅度进行判断,则直接进行判断,如果不可根据总信号幅度、总噪声幅度进行判断,则直接跳转到对下一时域信号帧进行处理;或者参照上述默认配置项的方式进行简单的处理,以节省功耗和降低技术的复杂度。
而有关语音检测装置中各个结构单元的详细描述,可参见上述图1-图4实施例的记载。
另外,上述实施例中,当判定为有效语音信号时,可以表示存在来自感兴趣信号源的语音信号,当判定为无效语音信号时,可以表示不存在来自感兴趣信号源的语音信号。
本申请实施例还提供一种电子设备,其包括本申请任一实施例所述的语音处 理芯片。
另外,上述实施例中记载的具体公式,仅仅是示例并非唯一性限定,在不偏离本申请思想的前提下,本领域普通技术人员可对其进行变形。
本申请实施例的上述技术方案可以具体用于各种类型的电子设备上,该电子设备以多种形式存在,包括但不限于:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。
(4)其他具有数据交互功能的电子装置。
至此,已经对本主题的特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作可以按照不同的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序,以实现期望的结果。在某些实施方式中,多任务处理和并行处理可以是有利的。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品 的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定事务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行事务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (39)

  1. 一种语音检测方法,其特征在于,包括:
    对当前时域信号帧进行处理以得到若干个子带时域信号;
    根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号。
  2. 根据权利要求1所述的方法,其特征在于,所述对当前时域信号帧进行处理以得到若干个子带时域信号,包括:通过滤波器组对所述当前时域信号帧进行滤波以得到若干个子带时域信号。
  3. 根据权利要求1所述的方法,其特征在于,根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号,包括:
    根据所述当前时域信号帧的所述若干个子带时域信号的幅度,计算所述当前时域信号帧中所述子带时域信号的信号幅度以及噪声幅度;
    根据所述当前时域信号帧中所述子带时域信号的噪声幅度以及所述信号幅度判断所述当前时域信号帧是否是有效语音信号。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述当前时域信号帧的所述若干个子带时域信号的幅度,计算所述当前时域信号帧中所述子带时域信号的信号幅度,包括根据所述当前时域信号帧的所述若干个子带时域信号,计算所述当前时域信号帧中所述子带时域信号的平均幅度;根据所述当前时域信号帧中所述子带时域信号的平均幅度计算所述当前时域信号帧中所述子带时域信号的信号幅度。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述当前时域信号帧中所述子带时域信号的平均幅度计算所述当前时域信号帧中所述子带时域信号的信号幅度,包括使用所述当前时域信号帧中所述子带时域信号的平均幅度表征所述子带时域信号的信号幅度。
  6. 根据权利要求4所述的方法,其特征在于,所述根据所述当前时域信号帧中所述子带时域信号的平均幅度计算所述当前时域信号帧中所述子带时域信号的信号幅度,包括根据所述当前时域信号帧中所述子带时域信号的平均幅度以及幅度平滑值,计算所述当前时域信号帧中所述子带时域信号的信号幅度。
  7. 根据权利要求6所述的方法,其特征在于,所述计算当前时域信号帧中所述子带时域信号的信号幅度,包括根据幅度平滑系数以及上一时域信号帧的信号幅度确定所述幅度平滑值。
  8. 根据权利要求3-7任一项所述的方法,其特征在于,所述计算所述子带时域信号 的噪声幅度,包括根据所述当前时域信号帧中所述子带时域信号的信号幅度计算所述当前时域信号帧中所述子带时域信号的噪声幅度。
  9. 根据权利要求8所述的方法,其特征在于,所述计算所述子带时域信号的噪声幅度,包括:在所述当前时域信号帧中第N子带时域信号的信号幅度大于上一时域信号帧中第N子带时域信号的噪声幅度时,根据所述当前时域信号帧中第N子带时域信号的信号幅度以及噪声平滑值计算第N子带时域信号的噪声幅度,所述第N子带时域信号为所述子带时域信号中的任意一个,N>0且为整数。
  10. 根据权利要求9所述的方法,其特征在于,所述计算所述子带时域信号的噪声幅度,包括根据噪声平滑系数以及上一时域信号帧的噪声幅度和信号幅度分别确定所述噪声平滑值。
  11. 根据权利要求8所述的方法,其特征在于,所述计算所述子带时域信号的噪声幅度,包括在当前时域信号帧中第N子带时域信号的信号幅度小于或者等于上一时域信号帧中第N子带时域信号的噪声幅度时,将所述当前时域信号帧中第N子带时域信号的信号幅度直接作为第N子带时域信号的噪声幅度,所述第N子带时域信号为所述子带时域信号中的任意一个,N>0且为整数。
  12. 根据权利要求3-11任一项所述的方法,其特征在于,所述计算当前时域信号帧中所述子带时域信号的信号幅度,包括:根据当前时域信号帧中所述子带时域信号的信号幅度计算所述当前时域信号帧的总信号幅度;所述计算所述子带时域信号的噪声幅度,包括根据所述子带时域信号的噪声幅度计算所述当前时域信号帧的总噪声幅度;所述根据所述子带时域信号的噪声幅度以及所述信号幅度判断所述当前时域信号帧是否是有效语音信号,包括:根据所述总噪声幅度以及所述总信号幅度判断所述当前时域信号帧是否是有效语音信号。
  13. 根据权利要求12所述的方法,其特征在于,所述根据所述子带时域信号的噪声幅度以及所述信号幅度判断所述当前时域信号帧是否是有效语音信号,包括:若所述总噪声幅度以及所述总信号幅度均小于噪声能量等级下限则判定所述当前时域信号帧为无效语音信号。
  14. 根据权利要求12所述的方法,其特征在于,所述根据所述子带时域信号的噪声幅度以及所述信号幅度判断所述当前时域信号帧是否是有效语音信号,包括:若所述总噪声幅度大于或等于噪声能量等级上限,则根据默认配置项,判断所述当前时域信号帧是否为有效语音信号。
  15. 根据权利要求13或14所述的方法,其特征在于,还包括:根据所述当前时域信号帧的所述若干个子带时域信号的噪声幅度以及所述信号幅度计算所述当前时域信号帧的所述子带时域信号的信噪比;所述根据所述当前时域信号帧的所述若干个子 带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号,包括:根据所述当前时域信号帧的所述总噪声幅度以及所述当前时域信号帧的所述子带时域信号的信噪比,判断所述当前时域信号帧是否是有效语音信号。
  16. 根据权利要求15所述的方法,其特征在于,根据所述当前时域信号帧的总噪声幅度以及所述当前时域信号帧的所述子带时域信号的信噪比,判断所述当前时域信号帧是否是有效语音信号,包括:若所述当前时域信号帧的所述总噪声幅度小于或等于所述噪声能量等级的下限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的上限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于所述信噪比等级的上限,则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
  17. 根据权利要求15所述的方法,其特征在于,根据所述当前时域信号帧的总噪声幅度以及所述当前时域信号帧的所述子带时域信号的信噪比,判断所述当前时域信号帧是否是有效语音信号,包括:若所述当前时域信号帧的所述总噪声幅度大于或等于所述噪声能量等级的上限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的下限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于所述信噪比等级的下限,则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
  18. 根据权利要求15所述的方法,其特征在于,根据所述当前时域信号帧的总噪声幅度以及所述当前时域信号帧的所述子带时域信号的信噪比,判断所述当前时域信号帧是否是有效语音信号,包括:若所述当前时域信号帧的所述总噪声幅度大于或等于所述噪声能量等级的中间门限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于对应的信噪比等级的中间门限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于所述信噪比等级的中间门限,则判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
  19. 根据权利要求1-18中任一项所述的方法,其特征在于,还包括:当判定开始检测到有效语音信号后,对检测到的有效语音进行缓存。
  20. 一种语音检测装置,其特征在于,包括:子带生成模块、语音活动检测模块,所述子带生成模块用于对当前时域信号帧进行处理以得到若干个子带时域信号,所述语音活动检测模块用于根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号。
  21. 根据权利要求20所述的装置,其特征在于,所述子带生成模块为滤波器组。
  22. 根据权利要求20所述的装置,其特征在于,还包括:能量计算模块以及噪声计算模块;所述能量计算模块用于根据所述当前时域信号帧的所述若干个子带时域 信号的幅度,计算所述当前时域信号帧中所述子带时域信号的信号幅度;所述噪声计算模块用于根据所述当前时域信号帧的所述若干个子带时域信号的幅度计算所述当前时域信号帧中所述子带时域信号的噪声幅度,以根据所述当前时域信号帧中所述子带时域信号的所述噪声幅度以及所述信号幅度判断所述当前时域信号帧是否是有效语音信号。
  23. 根据权利要求22所述的装置,其特征在于,所述能量计算模块包括能量计算单元,所述能量计算单元用于根据所述当前时域信号帧的所述若干个子带时域信号,计算所述当前时域信号帧中所述子带时域信号的平均幅度;以及,根据所述当前时域信号帧中所述子带时域信号的平均幅度计算所述当前时域信号帧中所述子带时域信号的信号幅度。
  24. 根据权利要求23所述的装置,其特征在于,所述能量计算单元进一步用于使用所述当前时域信号帧中所述子带时域信号的平均幅度表征所述子带时域信号的信号幅度。
  25. 根据权利要求23所述的装置,其特征在于,所述能量计算单元进一步用于根据所述当前时域信号帧中所述子带时域信号的平均幅度以及幅度平滑值,计算当前时域信号帧中所述子带时域信号的信号幅度。
  26. 根据权利要求25所述的装置,其特征在于,所述能量计算单元进一步用于根据幅度平滑系数以及上一时域信号帧的信号幅度确定所述幅度平滑值。
  27. 根据权利要求22-26任一项所述的装置,其特征在于,所述噪声计算模块进一步用于根据所述当前时域信号帧中所述子带时域信号的信号幅度计算所述当前时域信号帧中所述子带时域信号的噪声幅度。
  28. 根据权利要求27所述的装置,其特征在于,所述噪声计算模块进一步用于在所述当前时域信号帧中第N子带时域信号的信号幅度大于上一时域信号帧中第N子带时域信号的噪声幅度时,根据所述当前时域信号帧中第N子带时域信号的信号幅度以及噪声平滑值计算第N子带时域信号的噪声幅度,所述第N子带时域信号为所述子带时域信号中的任意一个,N>0且为整数。
  29. 根据权利要求28所述的装置,其特征在于,所述噪声计算模块进一步用于根据噪声平滑系数以及所述上一时域信号帧的噪声幅度和信号幅度分别确定所述噪声平滑值。
  30. 根据权利要求27所述的装置,其特征在于,所述噪声计算模块进一步用于在当前时域信号帧中第N子带时域信号的信号幅度小于或者等于上一时域信号帧中第N子带时域信号的噪声幅度时,将所述当前时域信号帧中第N子带时域信号的信号幅度直接作为第N子带时域信号的噪声幅度,所述第N子带时域信号为所述子带时域信 号中的任意一个,N>0且为整数。
  31. 根据权利要求22-30任一项所述的装置,其特征在于,所述能量计算模块进一步用于根据当前时域信号帧中所述子带时域信号的信号幅度计算所述当前时域信号帧的总信号幅度,所述噪声计算模块进一步用于根据所述子带时域信号的噪声幅度计算所述当前时域信号帧的总噪声幅度,所述语音活动检测模块进一步用于根据所述总噪声幅度以及所述总信号幅度判断所述当前时域信号帧是否是有效语音信号。
  32. 根据权利要求31所述的装置,其特征在于,所述语音活动检测模块进一步用于若所述总噪声幅度以及所述总信号幅度均小于噪声能量等级下限则判定所述当前时域信号帧为无效语音信号。
  33. 根据权利要求31所述的装置,其特征在于,所述语音活动检测模块进一步用于若所述总噪声幅度大于或等于噪声能量等级上限,则根据默认配置项,判断所述当前时域信号帧是否为有效语音信号。
  34. 根据权利要求32或33所述的装置,其特征在于,还包括:信噪比计算模块,用于根据所述当前时域信号帧的所述若干个子带时域信号的的噪声幅度计算所述当前时域信号帧的所述子带时域信号的信噪比;所述语音活动检测模块进一步用于根据所述当前时域信号帧的所述总噪声幅度以及所述当前时域信号帧的所述子带时域信号的信噪比,判断所述当前时域信号帧是否是有效语音信号。
  35. 根据权利要求34所述的装置,其特征在于,若所述当前时域信号帧的所述总噪声幅度小于或等于所述噪声能量等级的下限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的上限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于所述信噪比等级的上限,则所述语音活动检测模块判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
  36. 根据权利要求34所述的装置,其特征在于,若所述当前时域信号帧的所述总噪声幅度大于或等于所述噪声能量等级的上限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于信噪比等级的下限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于所述信噪比等级的下限,则所述语音活动检测模块判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
  37. 根据权利要求34所述的装置,其特征在于,若所述当前时域信号帧的所述总噪声幅度大于或等于所述噪声能量等级的中间门限,则判断所述当前时域信号帧的所述子带时域信号的信噪比是否大于或等于对应的信噪比等级的中间门限,若所述当前时域信号帧的所述子带时域信号的信噪比大于或等于所述对应的信噪比等级的中间门限,则所述语音活动检测模块判定所述当前时域信号帧是有效语音信号,否则,判定是无效语音信号。
  38. 一种语音处理芯片,其特征在于,包括:语音检测装置以及处理器,语音检测装置包括:子带生成模块、语音活动检测模块,所述子带生成模块用于对当前时域信号帧进行处理以得到若干个子带时域信号,所述语音活动检测模块用于根据所述当前时域信号帧的所述若干个子带时域信号的幅度,判断所述当前时域信号帧是否是有效语音信号;所述处理器用于对所述有效语音信号进行识别,以根据所述识别的结果进行语音控制。
  39. 一种电子设备,其特征在于,包括权利要求19所述的语音处理芯片。
PCT/CN2019/092361 2019-06-21 2019-06-21 语音检测方法、语音检测装置、语音处理芯片以及电子设备 Ceased WO2020252782A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2019/092361 WO2020252782A1 (zh) 2019-06-21 2019-06-21 语音检测方法、语音检测装置、语音处理芯片以及电子设备
CN201980001072.9A CN110431625B (zh) 2019-06-21 2019-06-21 语音检测方法、语音检测装置、语音处理芯片以及电子设备
EP19933225.5A EP3800640B1 (en) 2019-06-21 2019-06-21 Voice detection method, voice detection device, voice processing chip and electronic apparatus
US17/034,096 US11322174B2 (en) 2019-06-21 2020-09-28 Voice detection from sub-band time-domain signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/092361 WO2020252782A1 (zh) 2019-06-21 2019-06-21 语音检测方法、语音检测装置、语音处理芯片以及电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/034,096 Continuation US11322174B2 (en) 2019-06-21 2020-09-28 Voice detection from sub-band time-domain signals

Publications (1)

Publication Number Publication Date
WO2020252782A1 true WO2020252782A1 (zh) 2020-12-24

Family

ID=68419103

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/092361 Ceased WO2020252782A1 (zh) 2019-06-21 2019-06-21 语音检测方法、语音检测装置、语音处理芯片以及电子设备

Country Status (4)

Country Link
US (1) US11322174B2 (zh)
EP (1) EP3800640B1 (zh)
CN (1) CN110431625B (zh)
WO (1) WO2020252782A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903634A (zh) * 2012-12-25 2014-07-02 中兴通讯股份有限公司 激活音检测及用于激活音检测的方法和装置
CN106098076A (zh) * 2016-06-06 2016-11-09 成都启英泰伦科技有限公司 一种基于动态噪声估计时频域自适应语音检测方法
US20170206908A1 (en) * 2014-10-06 2017-07-20 Conexant Systems, Inc. System and method for suppressing transient noise in a multichannel system

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19716862A1 (de) * 1997-04-22 1998-10-29 Deutsche Telekom Ag Sprachaktivitätserkennung
US6718301B1 (en) * 1998-11-11 2004-04-06 Starkey Laboratories, Inc. System for measuring speech content in sound
EP1729287A1 (en) * 1999-01-07 2006-12-06 Tellabs Operations, Inc. Method and apparatus for adaptively suppressing noise
US6453291B1 (en) * 1999-02-04 2002-09-17 Motorola, Inc. Apparatus and method for voice activity detection in a communication system
JP2005520211A (ja) * 2002-03-05 2005-07-07 アリフコム ノイズ抑制システムと共に用いるための発声活動検出(vad)デバイスおよび方法
US8326620B2 (en) * 2008-04-30 2012-12-04 Qnx Software Systems Limited Robust downlink speech and noise detector
KR101437830B1 (ko) * 2007-11-13 2014-11-03 삼성전자주식회사 음성 구간 검출 방법 및 장치
CN101599269B (zh) * 2009-07-02 2011-07-20 中国农业大学 语音端点检测方法及装置
CN102117618B (zh) * 2009-12-30 2012-09-05 华为技术有限公司 一种消除音乐噪声的方法、装置及系统
EP2561508A1 (en) * 2010-04-22 2013-02-27 Qualcomm Incorporated Voice activity detection
US9047878B2 (en) * 2010-11-24 2015-06-02 JVC Kenwood Corporation Speech determination apparatus and speech determination method
US20120265526A1 (en) * 2011-04-13 2012-10-18 Continental Automotive Systems, Inc. Apparatus and method for voice activity detection
CN104424956B9 (zh) * 2013-08-30 2022-11-25 中兴通讯股份有限公司 激活音检测方法和装置
US9524735B2 (en) * 2014-01-31 2016-12-20 Apple Inc. Threshold adaptation in two-channel noise estimation and voice activity detection
US10360926B2 (en) * 2014-07-10 2019-07-23 Analog Devices Global Unlimited Company Low-complexity voice activity detection
CN105261375B (zh) * 2014-07-18 2018-08-31 中兴通讯股份有限公司 激活音检测的方法及装置
US9672841B2 (en) * 2015-06-30 2017-06-06 Zte Corporation Voice activity detection method and method used for voice activity detection and apparatus thereof
US10090005B2 (en) * 2016-03-10 2018-10-02 Aspinity, Inc. Analog voice activity detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903634A (zh) * 2012-12-25 2014-07-02 中兴通讯股份有限公司 激活音检测及用于激活音检测的方法和装置
US20170206908A1 (en) * 2014-10-06 2017-07-20 Conexant Systems, Inc. System and method for suppressing transient noise in a multichannel system
CN106098076A (zh) * 2016-06-06 2016-11-09 成都启英泰伦科技有限公司 一种基于动态噪声估计时频域自适应语音检测方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3800640A4 *

Also Published As

Publication number Publication date
EP3800640B1 (en) 2024-10-16
CN110431625B (zh) 2023-06-23
US11322174B2 (en) 2022-05-03
US20210012792A1 (en) 2021-01-14
EP3800640A4 (en) 2021-09-29
EP3800640A1 (en) 2021-04-07
CN110431625A (zh) 2019-11-08

Similar Documents

Publication Publication Date Title
US11620983B2 (en) Speech recognition method, device, and computer-readable storage medium
CN107481718B (zh) 语音识别方法、装置、存储介质及电子设备
CN106575379B (zh) 用于神经网络的改进的定点整型实现方式
CN102710838B (zh) 一种音量调节方法及装置、电子设备
US20210051404A1 (en) Echo cancellation method and apparatus based on time delay estimation
US9620116B2 (en) Performing automated voice operations based on sensor data reflecting sound vibration conditions and motion conditions
WO2019101123A1 (zh) 语音活性检测方法、相关装置和设备
CN111477243B (zh) 音频信号处理方法及电子设备
WO2016180100A1 (zh) 一种音频处理的性能提升方法及装置
CN108922553A (zh) 用于音箱设备的波达方向估计方法及系统
CN110400571A (zh) 音频处理方法、装置、存储介质及电子设备
WO2020232659A1 (zh) 双端通话检测方法、双端通话检测装置以及回声消除系统
CN114302286B (zh) 一种通话语音降噪方法、装置、设备及存储介质
CN112669878B (zh) 声音增益值的计算方法、装置和电子设备
CN113823313B (zh) 语音处理方法、装置、设备以及存储介质
WO2020191512A1 (zh) 回声消除装置、回声消除方法、信号处理芯片及电子设备
CN110827858A (zh) 语音端点检测方法及系统
WO2021007841A1 (zh) 噪声估计方法、噪声估计装置、语音处理芯片以及电子设备
CN108831508A (zh) 语音活动检测方法、装置和设备
CN112397086A (zh) 语音关键词检测方法、装置、终端设备和存储介质
CN118899005B (zh) 一种音频信号处理方法、装置、计算机设备及存储介质
WO2020252629A1 (zh) 残余回声检测方法、残余回声检测装置、语音处理芯片及电子设备
WO2024027246A1 (zh) 声音信号处理方法、装置、电子设备和存储介质
CN110246502A (zh) 语音降噪方法、装置及终端设备
WO2023193573A1 (zh) 一种音频处理方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019933225

Country of ref document: EP

Effective date: 20201229

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933225

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE