US20190259400A1 - Microphone unit comprising integrated speech analysis - Google Patents
Microphone unit comprising integrated speech analysis Download PDFInfo
- Publication number
- US20190259400A1 US20190259400A1 US16/380,106 US201916380106A US2019259400A1 US 20190259400 A1 US20190259400 A1 US 20190259400A1 US 201916380106 A US201916380106 A US 201916380106A US 2019259400 A1 US2019259400 A1 US 2019259400A1
- Authority
- US
- United States
- Prior art keywords
- microphone unit
- data
- digital
- speech feature
- audio signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000004458 analytical method Methods 0.000 title description 4
- 230000005236 sound signal Effects 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims description 21
- 238000013144 data compression Methods 0.000 claims description 8
- 239000013598 vector Substances 0.000 claims description 7
- 238000012545 processing Methods 0.000 description 32
- 238000000034 method Methods 0.000 description 20
- 230000006870 function Effects 0.000 description 15
- 230000005540 biological transmission Effects 0.000 description 12
- 239000000523 sample Substances 0.000 description 9
- 238000001514 detection method Methods 0.000 description 8
- 238000007906 compression Methods 0.000 description 7
- 230000006835 compression Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 230000009467 reduction Effects 0.000 description 7
- 238000005070 sampling Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 230000003595 spectral effect Effects 0.000 description 4
- 230000007704 transition Effects 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 3
- 230000001360 synchronised effect Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011143 downstream manufacturing Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 230000036039 immunity Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000011295 pitch Substances 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 241000282693 Cercopithecidae Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 210000003477 cochlea Anatomy 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000003306 harvesting Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003071 parasitic effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000013074 reference sample Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G10L19/0019—
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/02—Casings; Cabinets ; Supports therefor; Mountings therein
- H04R1/04—Structural association of microphone with electric circuitry therefor
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R19/00—Electrostatic transducers
- H04R19/04—Microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/005—Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2201/00—Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
- H04R2201/003—Mems transducers or their use
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2499/00—Aspects covered by H04R or H04S not otherwise provided for in their subgroups
- H04R2499/10—General applications
- H04R2499/11—Transducers incorporated or for use in hand-held devices, e.g. mobile phones, PDA's, camera's
Definitions
- This disclosure relates to reducing the data bit rate on the interface of digital microphones, for example to minimise power consumption in Always-On Voice modes, while still passing enough information to allow downstream keyword detection or speech recognition functions.
- Audio functionality is becoming increasingly prevalent in portable devices. Such functionality is present not only in devices such as phones that are reliant audio technology, but also in other wearable equipment or devices that may be controlled by voice, for instance voice-responsive toys such as listening-talking teddy bears. Such devices, including phones, will spend little of their time actually transmitting speech, yet one or possibly more microphones may be permanently enabled listening out for same voice command. Even a wearable accessory may be continuously on, awaiting a voice command, and will have little space for a battery, or may rely on some solar or mechanical energy harvesting, and so has severe power consumption requirements in a continuous standby mode as well as in a low-duty-cycle operating mode.
- Microphone transducer and amplifier technology has improved, but generally a microphone package needs to drive its output signal some distance.
- Digital transmission offers advantages including noise immunity, but the usual formats for transmission of digital data from microphones are not particularly efficient in terms of signal line activity and the consequent power consumed in charging parasitic capacitances though a supply voltage at every logic level transition.
- the digital microphone signal may have some distance to go from the microphone to a centralised smart codec chip or such, along a ribbon cable, or flex, or even across a densely populated printed circuit board. Even worse are applications where the microphone may be in a headset or earbuds or some acoustically desirable position on the user's clothing, distant from the handset or the main module of a distributed device.
- FIG. 1 illustrates a conventional digital microphone 10 communicating with a smart codec 22 in a host device 20 , for example a phone
- FIG. 2 illustrates the operating waveforms in a conventional digital microphone interface
- a host device 20 transmits a clock CLK, typically at a frequency such as 3 MHz, to the microphone 10 , which uses this to clock an ADC 12 and to clock out from digital buffer interface Dout 14 a 1-bit oversampled delta-sigma stream DAT representing the acoustic signal input Px to the microphone transducer 16 providing the ADC input. Power is consumed in the system by the host 20 transmitting this clock signal CLK, and in particular by the microphone in sending a data stream DAT with an average 1.5 MHz transition rate.
- Power may be reduced by operating at a lower clock rate, say 768 kHz, but this greatly increase the in-band quantisation noise or conversely restricts the usable bandwidth for a particular noise level, Even this only reduces the power by a factor of 4, so the power consumption is still significant, particularly in larger form factor devices or long cable runs.
- Transmitting a delta-sigma stream is notably less efficient in terms of data bit rate and transition rate than transmitting a serial multi-bit pulse-code-modulated stream, but the latter generally requires an additional clock wire to transmit clocks to mark the start of each multi-bit word.
- a microphone unit comprising:
- a speech coder for obtaining compressed speech data from the audio signal
- a digital output for supplying digital signals representing said compressed speech data.
- the microphone unit comprises a packaged microphone, for example a MEMS microphone, with on-chip or co-packaged integrated speech coder circuitry.
- This circuitry transmits data out of this package on a PCB trace or possibly headset cable to downstream circuitry that may perform more complex functions such as speech recognition, the transmitted data representing speech information coded in a speech-compressed format at a low bit rate to reduce the power consumed in physically transmitting the data.
- uncompressed data can be regarded as a numeric representation of samples in an uniformly sampled system, where the in-band signal is an approximation, in the audio band, of the audio input waveform, whereas compressed data is typically derived from uncompressed data in such a way that the digital stream no longer directly represents the above, and has a lower bit rate.
- Speech coding is an application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, and may be combined with generic data compression algorithms to represent the resulting modelled parameters in a compact bitstream.
- compressed speech data may be [usually digital] data representing an audio signal in terms of speech-specific parameters calculated from the signal. For example, this may be the signal energy in a set of non-uniformly spaced frequency bins, or may use sub-band coding via say ADPCM of each sub-band. Data compression techniques may then be applied to these time-varying parameters, for example recoding scalars or vectors according to some codebook.
- embodiments of the invention may use any speech compression standard, for example one using MDCT, MDCT-Hybrid subband, CELP, ACELP, Two-Stage Noise Feedback Coding (TSNFC), VSELP, RPE-LTP, LPC, Transform coding, or MLT, with suitable examples being AAC, AC-3, ALAC, ALS, AMBE, AMr, AMR-WB, AMR-WB-F, apt-X, ATRAC, BroadVoice, CELT, Codec2, Enhanced AC-3, FLAC, any of the group of G.7xx standards, GSM-FR, iLBC, iSAC, Monkey's Audio, MP2, MP3, Musepack, Nellymoser Asao, Opus, Shorten, SILK, Siren 7, Speex, SVOPC, TTA, TwinVO, Vorbis, WavPack, or Windows Media Audio.
- TSNFC Two-Stage Noise Feedback Coding
- VSELP Two-Stage Noise Feedback Coding
- RPE-LTP Long
- FIG. 1 illustrates an audio processing system
- FIG. 2 illustrates signals in the audio processing system of FIG. 1 .
- FIG. 3 illustrates a system, comprising a host device and an accessory.
- FIG. 4 illustrates an audio processing system
- FIG. 5 illustrates a part of a microphone unit.
- FIG. 6 illustrates a part of a microphone unit.
- FIG. 7 illustrates a part of a microphone unit.
- FIG. 8 illustrates a compressive speech coder
- FIG. 9 illustrates an audio processing system.
- FIG. 10 illustrates an audio processing system
- FIG. 11 illustrates a part of a microphone unit in the audio processing system of FIG. 10 .
- FIG. 3 shows an audio system, as just one example of a system using the methods described herein.
- FIG. 3 shows a device 30 , which in this example takes the form of a smartphone or tablet computer.
- the methods described herein may be used with any device, but are described herein with reference to a particular example in which the device is a portable communications device.
- the host device 30 has audio processing capability.
- FIG. 3 shows an audio input 32 , near which there is located a microphone, within the body of the device 30 and therefore not visible in FIG. 3 . In other devices, there may be multiple microphones.
- FIG. 3 also shows an accessory device 34 , which in this example takes the form of a pair of earphones, but which may be any device, in particular any audio accessory device.
- the pair of earphones has two earpieces 36 , 38 , each of which includes a speaker for reproducing sound in response to audio signals transferred from the host device 30 .
- Each of the earpieces 36 , 38 also includes at least one microphone, for example for detecting ambient noise in the vicinity of the wearer.
- Signals representing the ambient sound are then transferred from the earphones to the host device 30 .
- the host device may then perform various functions. For example, the host device may perform a noise cancellation function using an algorithm and generate anti-noise signals that it transfers to the earphones for playback.
- the effect of playing back the anti-noise signals is that the level of ambient noise heard by the wearer is reduced, and the wanted sounds (music, speech, or the like) that are also being transferred from the host device 30 are therefore more audible.
- the accessory device 34 in this example also includes a microphone 40 that is positioned near to the user's mouth when wearing the earphones.
- the microphone 40 is suitable for detecting the user's speech.
- the accessory device 34 may be connected to the host device 30 by means of a cable 42 .
- the cable 42 is detachable from at least one of the portable communications device and the audio accessory.
- the cable 42 is permanently attached to the accessory device 34 , and may be provided with a jack 44 , to allow mechanical and electrical connection to or disconnection from the host device via a socket 46 provided on the host device.
- the cable may be in any suitable format.
- the host device 30 includes circuitry for receiving signals from the microphone or microphones within the body of the device 30 and/or from the microphones in the earpieces 36 , 38 , and/or the microphone 40 .
- the circuitry may for example include a codec 52 , audio DSP, or other processing circuitry, which in turn may be connected to circuitry within the host device 30 such as an applications processor, and/or may be connected to a remote processor.
- the processing circuitry may be able to perform a speech processing function, such as recognising the presence of a trigger phrase in a speech input received by one or more of the microphones, identifying a speaker of the speech input, and/or recognising the content of a spoken command in order to be able to control the host device or another connected device on the basis of the user's spoken command.
- a speech processing function such as recognising the presence of a trigger phrase in a speech input received by one or more of the microphones, identifying a speaker of the speech input, and/or recognising the content of a spoken command in order to be able to control the host device or another connected device on the basis of the user's spoken command.
- FIG. 4 shows an embodiment with a microphone unit 50 having a digital transmission format and method, for communication to a downstream smart codec 52 , audio DSP, or other processing circuitry.
- the microphone unit 50 comprises a transducer 54 , an analogue-to-information converter (AIC) 56 and a digital output driver 58 .
- AIC analogue-to-information converter
- the analogue-to-information converter 56 may take several forms. It is well known that brute force digitisation of an audio signal is grossly inefficient in terms of the useful information conveyed, or usually required, as interpreted by say the human ear and brain or some machine equivalent. The basic concept is to extract features of the audio signal that may be particularly useful when interpreted downstream, as illustrated by the data stream Fx in FIG. 4 . A digital interface 58 then transmits a data stream FDAT carrying this coded speech signal to the codec 52 .
- a clock recognition block 60 in the codec 52 recovers some clock from the incoming data, and then a feature processing block 62 operates on the received feature information to perform functions such as voice activity detect or speaker recognition, delivering appropriate flags VDet to downstream processing circuitry or to control or configure some further or subsequent processing of its own.
- the codec 52 may comprise a clock generation circuit 66 , or may receive a system clock from elsewhere in the host device.
- the AIC 56 is asynchronous or self-timed in operation, so does not require a clock, and the data transmission may then also be asynchronous, as may at least early stages of processing of the feature data received by the codec. It may comprise an asynchronous ADC, for instance an Asynchronous Delta-Sigma Modulator (ADSM) followed by other analogue asynchronous circuitry or self-timed logic circuitry for digital signal processing.
- ADSM Asynchronous Delta-Sigma Modulator
- the microphone may generate its own clock if required by the chosen AIC circuit structure or FDAT data format.
- the microphone unit may receive at least a low-frequency clock from the codec or elsewhere such as the system real time clock for use to synchronise or tune its internal clock generator using say locked-loop techniques.
- the feature data to be transmitted may typically be a frame produced at nominally say 30 Hz or 10 Hz, and the design of any speech processing function, say speech recognition, may have to accommodate a wide range of pitches and spoken word rate.
- the clock in the voice recognition mode does not need an accurate or low-jitter sampling clock, so an on-chip uncalibrated low-power clock 64 may be more than adequate.
- the data may be transmitted as a frame or vector of data at some relatively high bit rate, such that there is a transitionless interval before the each next frame.
- a microphone unit comprises a transducer and a feature extraction block
- the transducer may comprise a MEMS microphone, with the MEMS microphone and the feature extraction block being provided in a single integrated circuit.
- the microphone unit may comprise a packaged microphone, for example a MEMS microphone, with on-chip or co-packaged integrated speech coder circuitry or feature extraction block.
- This speech coder circuitry may transmit data out of the package on a PCB trace or possibly a cable such as a headset cable to downstream circuitry that may perform more complex functions such as speech recognition, the transmitted data representing speech information coded in a speech-compressed format at a low bit rate to reduce the power consumed in physically transmitting the data.
- FIG. 5 illustrates one embodiment of an AIC 56 , in which the analog input signal is presented to an ADC 70 , for example a 1-bit delta-sigma ADC, clocked by a sample clock CKM of nominally 768 kHz.
- the delta-sigma data stream Dx is then passed to a Decimator, Window block, and a Framer 72 for decimating the data to a sample rate of say 16 ks/s, suitable windowing and then framing for presentation to an FFT block 74 for deriving a set of Fourier coefficients representing the power (or magnitude) of the signal in each one of a set of equally spaced frequency bins.
- This spectral information is then passed through a mel-frequency filter bank 76 to provide estimates of the signal energy in each of a set of non-equally-spaced frequency bands.
- This set of energy estimates itself may be used for output.
- each of these energy estimates is passed though a log block 78 to compand the estimate, and then though a Discrete Cosine Transform block 80 to provide cepstral coefficients, known as Mel-frequency Cepstral Components (M FCC).
- M FCC Mel-frequency Cepstral Components
- the output cepstral coefficients comprise 15 channels of 12-bit words at a frame period of 30 ms, thus reducing the data rate from the original 1-bit delta-sigma rate of 3 Mbs/s or 786 kb/s to 6 kb/s.
- FIG. 6 illustrates another embodiment of an AIC 56 , with some extra functional blocks in the signal path compared with FIG. 5 . In some other embodiments not all of these blocks may be present.
- the analog input signal from the transducer element 90 is presented to an ADC 92 , for example a 1-bit delta-sigma ADC, clocked by a sample clock CKM of nominally 768 kHz generated by a local clock generator 94 , which may be synchronised to a system 32 kHz Real-Time Clock for instance, or which may be independent.
- ADC 92 for example a 1-bit delta-sigma ADC, clocked by a sample clock CKM of nominally 768 kHz generated by a local clock generator 94 , which may be synchronised to a system 32 kHz Real-Time Clock for instance, or which may be independent.
- the delta-sigma data stream Dx is then decimated in a decimator 96 to a sample rate of, say 16 ks/s.
- a pre-emphasis block 98 comprising a high-pass filter, to spectrally equalise the speech signal, most likely dominated by low-frequency components.
- This step may also be advantageous in reducing the effect of low-frequency background noise, for instance wind noise or mechanical acoustic background noise.
- There may also be a frequency-dependent noise reduction block at this point, as discussed below, to reduce noise in the frequency bands where it is most apparent.
- the signal may then be passed to a windowing block 100 , which may apply say a Hamming window or possibly some other windowing function to extract short-duration frames, say of time duration 10 ms to 50 ms, over each of which the speech may be considered stationary.
- the windowing block extracts a stream of short-duration frames by sliding the Hamming window along the speech signal by say half the frame length, or say sliding a 25 ms window by 10 ms, thus providing a frame of windowed data at a frame rate of 100 frames per second.
- An FFT 102 block then performs a Fast Fourier Transform (FFT) on the set of windowed samples of each frame, providing a set of Fourier coefficients representing the power (or magnitude) of the signal in each one of a set of equally spaced frequency bins.
- FFT Fast Fourier Transform
- Each of these frame-by-frame sets of signal spectral components is then processed by a Mel-filter-bank 104 , which maps and combines these linear-spaced spectral components onto frequency bins distributed to correspond more closely to the non-linear frequency sensitivity of the human ear, with a greater density of bins at low frequencies than at high frequencies. For instance, there may be 23 such bins, each with a triangular band-pass response, with the lowest frequency channel centred at 125 Hz and spanning 125 kHz while the highest frequency channel in centred at 3657 Hz and spans 656 Hz. In some embodiments, other numbers of channels or other non-linear frequency scales such as the Bark scale may be employed.
- a log block 106 then applies a log scaling to the energy reported from each mel-frequency bin. This helps reduce the sensitivity to very loud or very quiet sounds, in a similar way to the non-linear amplitude sensitivity of human hearing.
- the logarithmically compressed bin energies are then passed as a set of samples to a Discrete Cosine Transform block DCT 108 which applies a Discrete Cosine Transform on each set of logarithmically compressed bin energies.
- This serves to separate the slowly varying spectral envelope (or vocal tract) information from the faster varying speech excitation.
- the former is more useful in speech recognition, so the higher coefficients may be discarded. However is some embodiments these may be preserved, or possibly combined by weighted addition to provide at least some measure of energy for higher frequencies to aid in distinguishing sibilants or providing more clues for speaker identification.
- the higher order ( 3 ) coefficients may be generated in parallel with the lower ones.
- the DCT block 108 may also provide further output data.
- one component output may be the sum of all the log energies from each channel, though this may also be derived by a parallel total energy estimator EST 110 fed from un-pre-emphasised data.
- a dynamic coefficient generator which may generate further coefficients based on the first-order or second-order frame-to-frame differences of the coefficients.
- An equaliser (EQ) block 112 may adaptively equalise the various components relative to a flat spectrum, for instance using an LMS algorithm.
- the data rate may be further reduced by a Data Compressor (DC) block 114 , possibly exploiting redundancy or correlation between the coefficients expected due to the nature of speech signals.
- DC Data Compressor
- split vector quantisation to compress the MFCC vectors.
- feature vectors of dimension 14 say may be split into pairs of sub vectors, each quantised to 5 or 6 bits say with a respective codebook at a frame period of 10 ms. This may reduce the data rate to 4.4 kb/s or lower, say 1.5 kb/s if a 30 ms frame period is used.
- the data compressor may employ other standard data compression techniques.
- the outgoing data stream may be considered compressed speech data in that the output data has been compressed from the input signal in a manner particularly suitable for speech and for communication of the parameters of a speech waveform that convey information rather than general purpose techniques of signal digitisation and of compressing arbitrary data streams.
- the output data may be transmitted simply using two wires, one carrying the data (for example 180 bits every 30 ms, in the example of FIG. 5 ), and the second carrying a sync pulse or edge every 30 ms.
- the extra power of this low clock rate clock line is negligible compared to the already low power consumption of the data line.
- a two-wire link may be used between a microphone in the body of a device such as a mobile phone and a codec or similar on a circuit board inside the phone.
- Standard data formats such as SoundwireTM or SlimbusTM may be used, or standard three-wire interfaces such as I2S.
- a one-wire serial interface may be employed, transmitting the data in a recurring predefined sequence of frames, in which a unique sync pattern may be sent at the start of every frame of words and recovered by simple and low power data and clock recovery circuitry in the destination device.
- the clock is preferably a low-power clock inside the microphone, whose precise frequency and jitter is unimportant since the feature data is nowhere near as clock critical as full-resolution PCM.
- Nibbles of data may be sent using a pulse-length modulated (PLM) one-wire or two-wire format such as disclosed in published US Patent Application (US2013/0197920(A1)).
- PLM pulse-length modulated
- the data may be sent with a sequence of pulses with a fixed leading edge, with the length of each pulse denoting the binary number.
- the fixed leading edge makes clock recovery simple.
- Some slots in the outgoing data stream structure may be reserved for identification or control functions.
- PLM outgoing data stream structure
- occasional data bit errors may not have a serious impact.
- it may be desirable to protect at least the control data with some error-detection and/or correction scheme, e.g. based on Cyclic Redundancy Check bits embedded in the stream.
- the speech coding to reduce the data rate and thus the average transition rate on the physical bus may thus greatly reduce the power consumption of the system.
- This power saving may be offset somewhat by the power consumed by the speech coding itself, but this processing may have otherwise had to be performed somewhere in the system in order to provide the keyword detection or speaker recognition or more general speech recognition function in any case.
- Also with decreasing transistor size the power required to perform a given digital computation task is falling rapidly with time.
- MFCC Mel-frequency Cepstral Component
- the generation method may be modified, for instances by raising the log-mel-amplitudes (generated by the block 78 in the embodiment shown in FIG. 5 or the block 106 in the embodiment shown in FIG. 6 ) to a suitable power (around 2 or 3) before taking the DOT (in the block 80 in the embodiment shown in FIG. 5 or the block 108 in the embodiment shown in FIG. 6 ), which reduces the influence of low-energy components.
- the parameters of the feature extraction may be modified according to a detected or estimated signal-to-noise ratio or other signal- or noise-related parameter associated with the input signal. For instance the number and centre-frequency of the cepstral frequency bins over which the mel-frequency energy is extracted may be modified.
- the cepstral coding block may comprise or be preceded by a noise reduction block for instance directly after a decimation block 72 or 96 or after a pre-emphasis block 98 that may already have removed some low frequency noise, or operating on the windowed framed data produced by block 100 .
- This noise reduction block may be enabled when necessary by a noise detection block.
- the noise detection block may be analog and monitor the input signal Ax, or it may be digital and operate on the ADC output Dx.
- the noise detection block may flag when the level or spectrum or other characteristic of the received signal implies a high noise level or when the ratio of peak or average signal to noise falls below a threshold.
- the noise reduction circuitry may act to filter the signal to suppress frequency bins where the noise, as monitored in time periods when there appears to be no voice as monitored by a Voice Activity Detector, is likely to exceed the signal at times where there is a signal.
- a Wiener Filter set up may be used to suppress noise on a frame-by-frame basis.
- the Wiener filter coefficients may be updated on a frame-by-frame basis and coefficient smoothed via a Mel-frequency filter bank followed by an Inverse Discrete Cosine Transform before application to the actual signal.
- the Wiener noise reduction may comprise two stages. Each stage may incorporate some dynamic noise enhancement feature where the level of noise reduction performed is dependent on an estimated signal-to-noise ratio or other signal- or noise-related parameter or feature of the signal.
- MFCC data may actually be forwarded from the codec, for example in an ETSI-standard MFCC form, for further signal processing either within the host device or transmitted to remote servers for processing “in the cloud”. This latter may reduce the data bandwidth required for transmission, and may be used to preserve speech quality in poor transmission conditions.
- the microphone may be required to deliver a more traditional output signal digitising the instantaneous input audio signal in say a 16-bit format at say 16 ks/s or 48 ks/s.
- FIG. 7 illustrates a microphone unit 130 which may operate in a plurality of modes, with various degrees and methods of signal coding or compression. Thus, FIG. 7 shows several different functional blocks. In some other embodiments, only a subset of these blocks is present.
- the analog input signal from a transducer element 132 is presented to an ADC 134 , for example a 1-bit delta-sigma ADC, and the resulting delta-sigma data stream Dx is then passed to one or more functional blocks, as described below.
- ADC 134 for example a 1-bit delta-sigma ADC
- the ADC may be clocked by a sample clock CKM, which may be generated by a local clock generator 136 , or which may be received on a clock input 138 according to the operating mode.
- the microphone unit may operate in a first, low power mode in which it uses an internally generated clock and provides compressed speech data and a second, higher power, mode in which it receives an external clock and provides uncompressed data.
- the operating mode may be controlled by the downstream control processor via signals received on a control input terminal 140 . These inputs may be separate or may be provided by making the digital output line bi-directional. In some embodiments the operating mode may be determined autonomously by circuitry in the microphone unit.
- a control block 142 receives the control inputs and determines which of the functional blocks are to be active.
- FIG. 7 shows that the data stream Dx may be passed to a PDM formatting block 144 , which allows the digitised time-domain output of the microphone to be output directly as a PDM stream.
- the output of the PDM formatting block 144 is passed to a multiplexer 146 , operating under the control of the control block 142 , and the multiplexer output is passed to a driver 148 for generating the digital output DAT.
- FIG. 7 also shows the data stream Dx being passed to a feature extraction block 150 , for example for obtaining values based on using non-linear-spaced frequency bins, for instance MFCC values.
- FIG. 7 also shows the data stream Dx being passed to a compressive sampling block 152 , for example for deriving a sparse representation of the incoming signal.
- FIG. 7 also shows the data stream Dx being passed to a lossy compression block 154 , for example for performing Adaptive differential pulse-code modulation (ADPCM) or a similar form of coding.
- ADPCM Adaptive differential pulse-code modulation
- FIG. 7 also shows the data stream Dx being passed to a decimator 156 .
- the data stream Dx is also passed to a lossless coding block to provide a suitable output data stream.
- FIG. 7 shows the outputs of the compressive sampling block 152 , lossy compression block 154 and decimator 156 being connected to respective data buffer memory blocks 158 , 160 , 162 . These allow the higher quality data generated by these blocks to be stored. Then, if analysis of a lower power data stream suggests that there is a need, power can be expended in transmitting the higher quality data for some further processing or examination that requires such higher quality data.
- analysis of a lower power data stream might suggest that the audio signal contains a trigger phrase being spoken by a recognized user of the device in a particular time period.
- the higher quality data can be used for downstream operations requiring better data, for example downstream speech recognition.
- FIG. 7 also shows the outputs of the feature extraction block 150 , compressive audio processing block 152 , and lossy compression block 154 being output through respective pulse length modulation (PLM) encoding blocks 164 , 166 , 168 and through the multiplexer 146 , operating under the control of the control block 142 , and the driver 148 .
- PLM pulse length modulation
- FIG. 7 also shows the output of the decimator 156 being output through a pulse code modulation (PCM) encoding block 170 and through the multiplexer 146 , operating under the control of the control block 142 , and the driver 148 .
- PCM pulse code modulation
- the physical form of the transmitted output may differ according to what operating mode is selected. For instance high-data-rate modes may be transmitted using low-voltage-differential signalling for noise immunity, and the data scrambled to reduce emissions.
- the signal in low-data rate modes, the signal may be low bandwidth and not so susceptible to noise and transmission line reflections and suchlike, and is preferably unterminated to save the power consumption associated with driving the termination resistance. In the lower power modes the signal swing, i.e. digital driver supply voltage, may be reduced.
- circuitry may also be altered according to signal mode. For instance in low data rate modes the speed requirements of the DSP operations may be modest, and the circuitry may thus be operated at a lower logic supply voltage or divided master clock frequency than when performing more complex operations in conjunction with higher rate coding.
- the AIC or feature-extraction based schemes above may provide particularly efficient methods of encoding and transmitting the essential information in the audio signal, there may be a requirement for the microphone unit to be able to operate also so as to provide a more conventional data format, say for processing by local circuitry or onward transmission for processing in the cloud, where such processing might not understand the more sophisticated signal representation, or where, for example, the current use case is for recording music in high quality.
- the microphone unit is operable in a first mode in which feature extraction and/or data compression is performed, and in a second mode, in which (for example) a clock is supplied from the codec, and the unit operates in a manner similar to that shown in FIG. 1 .
- the digital microphone unit may thus be capable of operating in at least two modes—ADC (analog-digital conversion) or AIC (analog-information conversion).
- ADC analog-digital conversion
- AIC analog-information conversion
- the microphone unit is operable in one mode to perform lossy low-bit-rate PCM coding.
- the unit may contain a lossy codec such as an ADPCM coder, with a sample rate that in some embodiments may be selectable, for example between 8 ks/s 24 ks/s.
- the microphone unit has a coding block for performing ⁇ -law and/or A-law coding, or coding to some other telephony standard.
- the microphone unit in some embodiments has coding blocks for MDCT, MDCT-Hybrid subband, CELP, ACELP, Two-Stage Noise Feedback Coding (TSNFC), VSELP, RPE-LTP, LPC, Transform coding, or MLT coding.
- the microphone unit is operable in a mode in which it outputs compressive sampled PCM data, or any scheme that exploits signal sparseness.
- FIG. 8 illustrates an embodiment of a compressive speech coder that may be used in any of the embodiments described or illustrated herein.
- the output of an ADC 190 is passed through a decimator 192 to provide (for example) 12 bit data at 16 ks/s or 48 ks/s. This data is sampled at an average sample rate of only say 48 Hz or 1 kHz but with a sampling time randomised by a suitable random number generator or random pulse generator 194 .
- the sampling circuit samples the input signal at a sample rate less than the input signal bandwidth, and the sampling instants are caused to be distributed randomly in time.
- FIG. 9 shows a system using such a compressive speech coder.
- the microphone unit 200 including a compressive ADC 202 , is connected to supply very low data rate data to a codec 204 .
- downstream circuitry 206 may either perform a partial reconstruction (computationally cheap) to do sparse feature extraction in a low power mode, or a full reconstruction (computationally more expensive) to get Nyquist type voice for onwards transmission.
- post-processing algorithm blocks such as the block 208 , for performing “sparse recognition” that are compatible with such compressive sampling formats.
- the sparse representation of a signal is matched to a linear combination of a few atoms from a pre-defined dictionary, which atoms may be obtained a priori by using machine learning techniques to learn an over-complete dictionary of primary signals (atoms) directly from data, so that the most relevant properties of the signals may be efficiently captured.
- Sparse extraction may have some benefits in performance of feature extraction in the presence of noise.
- the noise is not recognised as comprising any atom component, so does not appear in the encoded data.
- Such ignorance of input noise may thus avoid unnecessary activation of downstream circuitry and avoid the power consumption increasing in noisy environments relative to quiet environments.
- FIG. 10 shows an embodiment in which a microphone unit 210 is connected to supply very low data rate data to a codec 212 , and in which, to further reduce the power consumption, some, if not all, of the feature extraction is performed using Analogue Signal Processing (ASP).
- ASP Analogue Signal Processing
- This a signal from a microphone transducer is passed to an analog signal processor 214 , and then to one or more analog-to-digital converter 216 , and then to an optional digital signal processor 218 in the microphone unit 210 .
- Feature recognition 220 is then performed in the codec 212 .
- FIG. 11 shows in more detail the processing inside an embodiment of the microphone unit 210 , in which a large part of the signal processing is performed by analogue rather than digital circuitry.
- an input signal is passed through a plurality of band pass filters (three being shown in FIG. 11 purely by way of illustration) 240 , 242 , 246 .
- the band pass filters are constant Q and equally spaced in mel frequency.
- the outputs are passed to log function blocks 248 , 250 , 252 , which may be achieved using standard analogue design techniques based for instance on applying the input signal via a voltage-to-current converted signal into an I-V two-port circuit with a logarithmic current-to-voltage conversion such as a semiconductor diode.
- the outputs are passed to a plurality of parallel ADCs 252 , 254 , 256 .
- the ADCs may comprise voltage-controlled oscillators, whose frequency is used as a representation of their respective input signals. These are simple and low power, and their linearity is not important in this application. These simple ADCs may have significantly reduced power and area even in total compared to the main ADC. The state of the art for similar circuit blocks in say the field of artificial cochleas is below 20 microwatts.
- the microphone, ADC, and the speech coding circuitry may advantageously be located close together, to reduce high-data-rate signal paths of the digital data before data-rate reduction. All three components may be packaged together. At least two of these three components may be co-integrated on an integrated circuit.
- the microphone may be a MEMS transducer, which may be capacitive, piezo-electric or piezo resistive, and co-integrated with at least the ADC.
- embodiments of the above-described apparatus and methods may be, at least partly, implemented using programmable components rather than dedicated hardwired components.
- embodiments of the apparatus and methods may be, at least partly embodied as processor control code, for example on a non transitory carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier.
- embodiments of the invention may be implemented, at least partly, by a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array),
- the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA,
- the code may also comprise code for dynamically configuring re-configurable apparatus such as re-programmable logic gate arrays.
- the code may comprise code for a hardware description language such as VerilogTM or VHDL (Very high speed integrated circuit Hardware Description Language).
- VerilogTM Very high speed integrated circuit Hardware Description Language
- the code may be distributed between a plurality of coupled components in communication with one another.
- the embodiments may also be implemented using code running on a field-(re-)programmable analogue array or similar device in order to configure analogue hardware.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Otolaryngology (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Abstract
Description
- This disclosure relates to reducing the data bit rate on the interface of digital microphones, for example to minimise power consumption in Always-On Voice modes, while still passing enough information to allow downstream keyword detection or speech recognition functions.
- Audio functionality is becoming increasingly prevalent in portable devices. Such functionality is present not only in devices such as phones that are reliant audio technology, but also in other wearable equipment or devices that may be controlled by voice, for instance voice-responsive toys such as listening-talking teddy bears. Such devices, including phones, will spend little of their time actually transmitting speech, yet one or possibly more microphones may be permanently enabled listening out for same voice command. Even a wearable accessory may be continuously on, awaiting a voice command, and will have little space for a battery, or may rely on some solar or mechanical energy harvesting, and so has severe power consumption requirements in a continuous standby mode as well as in a low-duty-cycle operating mode.
- Microphone transducer and amplifier technology has improved, but generally a microphone package needs to drive its output signal some distance. Digital transmission offers advantages including noise immunity, but the usual formats for transmission of digital data from microphones are not particularly efficient in terms of signal line activity and the consequent power consumed in charging parasitic capacitances though a supply voltage at every logic level transition.
- In a portable device such as a phone or tablet, containing one or more digital microphone, the digital microphone signal may have some distance to go from the microphone to a centralised smart codec chip or such, along a ribbon cable, or flex, or even across a densely populated printed circuit board. Even worse are applications where the microphone may be in a headset or earbuds or some acoustically desirable position on the user's clothing, distant from the handset or the main module of a distributed device.
- However, even when largely otherwise inactive, there may be sophisticated signal processing to be performed, for example speaker recognition during voice-triggered wake-up, so solutions such as grossly degrading the resolution of the ADC therein may lead to unacceptable downstream processing results.
- There is thus a requirement to reduce the power consumed in sending digital microphone data across a wired digital transmission link, while still conveying enough useful information in the transmitted signal to allow downstream function such as speech recognition.
-
FIG. 1 illustrates a conventionaldigital microphone 10 communicating with asmart codec 22 in ahost device 20, for example a phone, andFIG. 2 illustrates the operating waveforms in a conventional digital microphone interface. Ahost device 20 transmits a clock CLK, typically at a frequency such as 3 MHz, to themicrophone 10, which uses this to clock anADC 12 and to clock out from digital buffer interface Dout 14 a 1-bit oversampled delta-sigma stream DAT representing the acoustic signal input Px to themicrophone transducer 16 providing the ADC input. Power is consumed in the system by thehost 20 transmitting this clock signal CLK, and in particular by the microphone in sending a data stream DAT with an average 1.5 MHz transition rate. - Power may be reduced by operating at a lower clock rate, say 768 kHz, but this greatly increase the in-band quantisation noise or conversely restricts the usable bandwidth for a particular noise level, Even this only reduces the power by a factor of 4, so the power consumption is still significant, particularly in larger form factor devices or long cable runs.
- Transmitting a delta-sigma stream is notably less efficient in terms of data bit rate and transition rate than transmitting a serial multi-bit pulse-code-modulated stream, but the latter generally requires an additional clock wire to transmit clocks to mark the start of each multi-bit word.
- Secondly we note that an unfortunate side effect of reducing the delta-sigma sample clock rate may be to limit the bandwidth usable in terms of background quantisation noise to say 8 kHz rather than say 20 kHz. This may increase the word error rate (WER) for Voice Key Word detection (VKD). This may in turn lead to a higher incidence of false positives and the system may spend more time in its awake mode thus significantly affecting the average complete system power consumption.
- Additionally there is also a prevalent requirement for functions requiring even more accurate input audio data streams, such as speaker identification, as part of a voice-triggered wake-up function. It is known that using a wider bandwidth for the speaker identification captures more speech signal components and thus relaxes the need for high signal-to-noise (SNR) (e.g. relaxes the need for low acoustic background noise, or carefully optimised microphone placement) to get high enough accuracy for biometric purposes. Even in a high SNR environment a relatively wide signal bandwidth may improve speaker verification accuracy. This is at odds with the concept of reducing the frequency of the digital microphone clock to reduce the power consumption.
- According to a first aspect of the invention, there is provided a microphone unit, comprising:
-
- a transducer, for generating an electrical audio signal from a received acoustic signal;
- a speech coder, for obtaining compressed speech data from the audio signal; and
- a digital output, for supplying digital signals representing said compressed speech data.
- In embodiments of the invention, the microphone unit comprises a packaged microphone, for example a MEMS microphone, with on-chip or co-packaged integrated speech coder circuitry. This circuitry transmits data out of this package on a PCB trace or possibly headset cable to downstream circuitry that may perform more complex functions such as speech recognition, the transmitted data representing speech information coded in a speech-compressed format at a low bit rate to reduce the power consumed in physically transmitting the data.
- In this disclosure, uncompressed data can be regarded as a numeric representation of samples in an uniformly sampled system, where the in-band signal is an approximation, in the audio band, of the audio input waveform, whereas compressed data is typically derived from uncompressed data in such a way that the digital stream no longer directly represents the above, and has a lower bit rate.
- Speech coding is an application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, and may be combined with generic data compression algorithms to represent the resulting modelled parameters in a compact bitstream.
- Thus, compressed speech data may be [usually digital] data representing an audio signal in terms of speech-specific parameters calculated from the signal. For example, this may be the signal energy in a set of non-uniformly spaced frequency bins, or may use sub-band coding via say ADPCM of each sub-band. Data compression techniques may then be applied to these time-varying parameters, for example recoding scalars or vectors according to some codebook.
- As examples, embodiments of the invention may use any speech compression standard, for example one using MDCT, MDCT-Hybrid subband, CELP, ACELP, Two-Stage Noise Feedback Coding (TSNFC), VSELP, RPE-LTP, LPC, Transform coding, or MLT, with suitable examples being AAC, AC-3, ALAC, ALS, AMBE, AMr, AMR-WB, AMR-WB-F, apt-X, ATRAC, BroadVoice, CELT, Codec2, Enhanced AC-3, FLAC, any of the group of G.7xx standards, GSM-FR, iLBC, iSAC, Monkey's Audio, MP2, MP3, Musepack, Nellymoser Asao, Opus, Shorten, SILK, Siren 7, Speex, SVOPC, TTA, TwinVO, Vorbis, WavPack, or Windows Media Audio.
-
FIG. 1 illustrates an audio processing system. -
FIG. 2 illustrates signals in the audio processing system ofFIG. 1 . -
FIG. 3 illustrates a system, comprising a host device and an accessory. -
FIG. 4 illustrates an audio processing system. -
FIG. 5 illustrates a part of a microphone unit. -
FIG. 6 illustrates a part of a microphone unit. -
FIG. 7 illustrates a part of a microphone unit. -
FIG. 8 illustrates a compressive speech coder. -
FIG. 9 illustrates an audio processing system. -
FIG. 10 illustrates an audio processing system. -
FIG. 11 illustrates a part of a microphone unit in the audio processing system ofFIG. 10 . -
FIG. 3 shows an audio system, as just one example of a system using the methods described herein. - Specifically,
FIG. 3 shows adevice 30, which in this example takes the form of a smartphone or tablet computer. The methods described herein may be used with any device, but are described herein with reference to a particular example in which the device is a portable communications device. Thus, in this example, thehost device 30 has audio processing capability. -
FIG. 3 shows anaudio input 32, near which there is located a microphone, within the body of thedevice 30 and therefore not visible inFIG. 3 . In other devices, there may be multiple microphones.FIG. 3 also shows anaccessory device 34, which in this example takes the form of a pair of earphones, but which may be any device, in particular any audio accessory device. In this example, the pair of earphones has two 36, 38, each of which includes a speaker for reproducing sound in response to audio signals transferred from theearpieces host device 30. Each of the 36, 38 also includes at least one microphone, for example for detecting ambient noise in the vicinity of the wearer.earpieces - Signals representing the ambient sound are then transferred from the earphones to the
host device 30. The host device may then perform various functions. For example, the host device may perform a noise cancellation function using an algorithm and generate anti-noise signals that it transfers to the earphones for playback. The effect of playing back the anti-noise signals is that the level of ambient noise heard by the wearer is reduced, and the wanted sounds (music, speech, or the like) that are also being transferred from thehost device 30 are therefore more audible. - The
accessory device 34 in this example also includes amicrophone 40 that is positioned near to the user's mouth when wearing the earphones. Themicrophone 40 is suitable for detecting the user's speech. Theaccessory device 34 may be connected to thehost device 30 by means of acable 42. Thecable 42 is detachable from at least one of the portable communications device and the audio accessory. In some embodiments, thecable 42 is permanently attached to theaccessory device 34, and may be provided with ajack 44, to allow mechanical and electrical connection to or disconnection from the host device via asocket 46 provided on the host device. The cable may be in any suitable format. - The
host device 30 includes circuitry for receiving signals from the microphone or microphones within the body of thedevice 30 and/or from the microphones in the 36, 38, and/or theearpieces microphone 40. The circuitry may for example include acodec 52, audio DSP, or other processing circuitry, which in turn may be connected to circuitry within thehost device 30 such as an applications processor, and/or may be connected to a remote processor. - For example, the processing circuitry may be able to perform a speech processing function, such as recognising the presence of a trigger phrase in a speech input received by one or more of the microphones, identifying a speaker of the speech input, and/or recognising the content of a spoken command in order to be able to control the host device or another connected device on the basis of the user's spoken command.
-
FIG. 4 shows an embodiment with amicrophone unit 50 having a digital transmission format and method, for communication to a downstreamsmart codec 52, audio DSP, or other processing circuitry. - The
microphone unit 50 comprises atransducer 54, an analogue-to-information converter (AIC) 56 and adigital output driver 58. - The analogue-to-
information converter 56, or speech coder, or feature extraction block, may take several forms. It is well known that brute force digitisation of an audio signal is grossly inefficient in terms of the useful information conveyed, or usually required, as interpreted by say the human ear and brain or some machine equivalent. The basic concept is to extract features of the audio signal that may be particularly useful when interpreted downstream, as illustrated by the data stream Fx inFIG. 4 . Adigital interface 58 then transmits a data stream FDAT carrying this coded speech signal to thecodec 52. In one embodiment, aclock recognition block 60 in thecodec 52 recovers some clock from the incoming data, and then afeature processing block 62 operates on the received feature information to perform functions such as voice activity detect or speaker recognition, delivering appropriate flags VDet to downstream processing circuitry or to control or configure some further or subsequent processing of its own. Thecodec 52 may comprise aclock generation circuit 66, or may receive a system clock from elsewhere in the host device. - Preferably the
AIC 56 is asynchronous or self-timed in operation, so does not require a clock, and the data transmission may then also be asynchronous, as may at least early stages of processing of the feature data received by the codec. It may comprise an asynchronous ADC, for instance an Asynchronous Delta-Sigma Modulator (ADSM) followed by other analogue asynchronous circuitry or self-timed logic circuitry for digital signal processing. - However the microphone may generate its own clock if required by the chosen AIC circuit structure or FDAT data format.
- In some embodiments the microphone unit may receive at least a low-frequency clock from the codec or elsewhere such as the system real time clock for use to synchronise or tune its internal clock generator using say locked-loop techniques. However, as will be discussed below, the feature data to be transmitted may typically be a frame produced at nominally say 30 Hz or 10 Hz, and the design of any speech processing function, say speech recognition, may have to accommodate a wide range of pitches and spoken word rate. Thus in contrast to the use case where music has to be recorded with accurate pitch and where any jitter may lead to unmusical intermodulation, the clock in the voice recognition mode does not need an accurate or low-jitter sampling clock, so an on-chip uncalibrated low-power clock 64 may be more than adequate.
- In some embodiments, the data may be transmitted as a frame or vector of data at some relatively high bit rate, such that there is a transitionless interval before the each next frame.
- In all of the embodiments described herein, in which a microphone unit comprises a transducer and a feature extraction block, the transducer may comprise a MEMS microphone, with the MEMS microphone and the feature extraction block being provided in a single integrated circuit.
- The microphone unit may comprise a packaged microphone, for example a MEMS microphone, with on-chip or co-packaged integrated speech coder circuitry or feature extraction block.
- This speech coder circuitry, or feature extraction block, may transmit data out of the package on a PCB trace or possibly a cable such as a headset cable to downstream circuitry that may perform more complex functions such as speech recognition, the transmitted data representing speech information coded in a speech-compressed format at a low bit rate to reduce the power consumed in physically transmitting the data.
-
FIG. 5 illustrates one embodiment of anAIC 56, in which the analog input signal is presented to an ADC 70, for example a 1-bit delta-sigma ADC, clocked by a sample clock CKM of nominally 768 kHz. The delta-sigma data stream Dx is then passed to a Decimator, Window block, and aFramer 72 for decimating the data to a sample rate of say 16 ks/s, suitable windowing and then framing for presentation to anFFT block 74 for deriving a set of Fourier coefficients representing the power (or magnitude) of the signal in each one of a set of equally spaced frequency bins. This spectral information is then passed through a mel-frequency filter bank 76 to provide estimates of the signal energy in each of a set of non-equally-spaced frequency bands. - This set of energy estimates itself may be used for output. Alternatively, each of these energy estimates is passed though a
log block 78 to compand the estimate, and then though a Discrete Cosine Transform block 80 to provide cepstral coefficients, known as Mel-frequency Cepstral Components (M FCC). - In one example the output cepstral coefficients comprise 15 channels of 12-bit words at a frame period of 30 ms, thus reducing the data rate from the original 1-bit delta-sigma rate of 3 Mbs/s or 786 kb/s to 6 kb/s.
-
FIG. 6 illustrates another embodiment of anAIC 56, with some extra functional blocks in the signal path compared withFIG. 5 . In some other embodiments not all of these blocks may be present. - The analog input signal from the
transducer element 90 is presented to anADC 92, for example a 1-bit delta-sigma ADC, clocked by a sample clock CKM of nominally 768 kHz generated by alocal clock generator 94, which may be synchronised to asystem 32 kHz Real-Time Clock for instance, or which may be independent. - The delta-sigma data stream Dx is then decimated in a
decimator 96 to a sample rate of, say 16 ks/s. - It may then be passed to a
pre-emphasis block 98 comprising a high-pass filter, to spectrally equalise the speech signal, most likely dominated by low-frequency components. This step may also be advantageous in reducing the effect of low-frequency background noise, for instance wind noise or mechanical acoustic background noise. There may also be a frequency-dependent noise reduction block at this point, as discussed below, to reduce noise in the frequency bands where it is most apparent. - The signal may then be passed to a
windowing block 100, which may apply say a Hamming window or possibly some other windowing function to extract short-duration frames, say oftime duration 10 ms to 50 ms, over each of which the speech may be considered stationary. The windowing block extracts a stream of short-duration frames by sliding the Hamming window along the speech signal by say half the frame length, or say sliding a 25 ms window by 10 ms, thus providing a frame of windowed data at a frame rate of 100 frames per second. - An
FFT 102 block then performs a Fast Fourier Transform (FFT) on the set of windowed samples of each frame, providing a set of Fourier coefficients representing the power (or magnitude) of the signal in each one of a set of equally spaced frequency bins. - Each of these frame-by-frame sets of signal spectral components is then processed by a Mel-filter-
bank 104, which maps and combines these linear-spaced spectral components onto frequency bins distributed to correspond more closely to the non-linear frequency sensitivity of the human ear, with a greater density of bins at low frequencies than at high frequencies. For instance, there may be 23 such bins, each with a triangular band-pass response, with the lowest frequency channel centred at 125 Hz and spanning 125 kHz while the highest frequency channel in centred at 3657 Hz and spans 656 Hz. In some embodiments, other numbers of channels or other non-linear frequency scales such as the Bark scale may be employed. - A
log block 106 then applies a log scaling to the energy reported from each mel-frequency bin. This helps reduce the sensitivity to very loud or very quiet sounds, in a similar way to the non-linear amplitude sensitivity of human hearing. - The logarithmically compressed bin energies are then passed as a set of samples to a Discrete Cosine
Transform block DCT 108 which applies a Discrete Cosine Transform on each set of logarithmically compressed bin energies. This serves to separate the slowly varying spectral envelope (or vocal tract) information from the faster varying speech excitation. The former is more useful in speech recognition, so the higher coefficients may be discarded. However is some embodiments these may be preserved, or possibly combined by weighted addition to provide at least some measure of energy for higher frequencies to aid in distinguishing sibilants or providing more clues for speaker identification. In some embodiments the higher order (3) coefficients may be generated in parallel with the lower ones. - The
DCT block 108 may also provide further output data. For instance one component output may be the sum of all the log energies from each channel, though this may also be derived by a parallel totalenergy estimator EST 110 fed from un-pre-emphasised data. - There may also be a dynamic coefficient generator which may generate further coefficients based on the first-order or second-order frame-to-frame differences of the coefficients.
- An equaliser (EQ) block 112 may adaptively equalise the various components relative to a flat spectrum, for instance using an LMS algorithm.
- Before transmission, the data rate may be further reduced by a Data Compressor (DC) block 114, possibly exploiting redundancy or correlation between the coefficients expected due to the nature of speech signals. For example split vector quantisation to compress the MFCC vectors. In one example feature vectors of dimension 14 say may be split into pairs of sub vectors, each quantised to 5 or 6 bits say with a respective codebook at a frame period of 10 ms. This may reduce the data rate to 4.4 kb/s or lower, say 1.5 kb/s if a 30 ms frame period is used.
- Additionally or alternatively, the data compressor may employ other standard data compression techniques.
- Thus the data rate necessary to carry useful information concerning the speech content of the acoustic input signal has been reduced below that necessary for simple multi-bit or oversampled time-domain representations of the actual waveform by employing compression techniques at least in part reliant on known general characteristics of speech waveforms and of the human perception of speech, for instance in the use of non-linearly-spaced filter banks and logarithmic compression, or the separation of vocal tract information from excitation information referred to above. The outgoing data stream may be considered compressed speech data in that the output data has been compressed from the input signal in a manner particularly suitable for speech and for communication of the parameters of a speech waveform that convey information rather than general purpose techniques of signal digitisation and of compressing arbitrary data streams.
- Having generated the compressed speech data, this data now needs to be physically transmitted to the codec or other downstream circuitry. In the case of an accessory connected to a host device by means of a cable (such as the
headset 34 containing multiple microphones being connected to anaudio device 30, as shown inFIG. 1 ), the output data may be transmitted simply using two wires, one carrying the data (for example 180 bits every 30 ms, in the example ofFIG. 5 ), and the second carrying a sync pulse or edge every 30 ms. The extra power of this low clock rate clock line is negligible compared to the already low power consumption of the data line. Similarly a two-wire link may be used between a microphone in the body of a device such as a mobile phone and a codec or similar on a circuit board inside the phone. - Standard data formats such as Soundwire™ or Slimbus™ may be used, or standard three-wire interfaces such as I2S.
- Alternatively, a one-wire serial interface may be employed, transmitting the data in a recurring predefined sequence of frames, in which a unique sync pattern may be sent at the start of every frame of words and recovered by simple and low power data and clock recovery circuitry in the destination device. The clock is preferably a low-power clock inside the microphone, whose precise frequency and jitter is unimportant since the feature data is nowhere near as clock critical as full-resolution PCM.
- Nibbles of data may be sent using a pulse-length modulated (PLM) one-wire or two-wire format such as disclosed in published US Patent Application (US2013/0197920(A1)). The data may be sent with a sequence of pulses with a fixed leading edge, with the length of each pulse denoting the binary number. The fixed leading edge makes clock recovery simple.
- Some slots in the outgoing data stream structure (PLM or non-PLM) may be reserved for identification or control functions. In this application, with continuous streams of data, occasional data bit errors may not have a serious impact. However, in some applications it may be desirable to protect at least the control data with some error-detection and/or correction scheme, e.g. based on Cyclic Redundancy Check bits embedded in the stream.
- The speech coding to reduce the data rate and thus the average transition rate on the physical bus may thus greatly reduce the power consumption of the system. This power saving may be offset somewhat by the power consumed by the speech coding itself, but this processing may have otherwise had to be performed somewhere in the system in order to provide the keyword detection or speaker recognition or more general speech recognition function in any case. Also with decreasing transistor size the power required to perform a given digital computation task is falling rapidly with time.
- It is known that Mel-frequency Cepstral Component (MFCC) values are not very robust in the presence of additive noise. This may lead to false positives from a downstream voice keyword detector, which may lead to this block frequently triggering futile power-up of following circuitry, with a significant effect on average system power consumption.
- In some embodiments the generation method may be modified, for instances by raising the log-mel-amplitudes (generated by the
block 78 in the embodiment shown inFIG. 5 or theblock 106 in the embodiment shown inFIG. 6 ) to a suitable power (around 2 or 3) before taking the DOT (in the block 80 in the embodiment shown inFIG. 5 or theblock 108 in the embodiment shown inFIG. 6 ), which reduces the influence of low-energy components. - In some embodiments the parameters of the feature extraction may be modified according to a detected or estimated signal-to-noise ratio or other signal- or noise-related parameter associated with the input signal. For instance the number and centre-frequency of the cepstral frequency bins over which the mel-frequency energy is extracted may be modified.
- In some embodiments, the cepstral coding block may comprise or be preceded by a noise reduction block for instance directly after a
72 or 96 or after adecimation block pre-emphasis block 98 that may already have removed some low frequency noise, or operating on the windowed framed data produced byblock 100. This noise reduction block may be enabled when necessary by a noise detection block. The noise detection block may be analog and monitor the input signal Ax, or it may be digital and operate on the ADC output Dx. The noise detection block may flag when the level or spectrum or other characteristic of the received signal implies a high noise level or when the ratio of peak or average signal to noise falls below a threshold. - The noise reduction circuitry may act to filter the signal to suppress frequency bins where the noise, as monitored in time periods when there appears to be no voice as monitored by a Voice Activity Detector, is likely to exceed the signal at times where there is a signal. For instance a Wiener Filter set up may be used to suppress noise on a frame-by-frame basis. The Wiener filter coefficients may be updated on a frame-by-frame basis and coefficient smoothed via a Mel-frequency filter bank followed by an Inverse Discrete Cosine Transform before application to the actual signal. In some embodiments the Wiener noise reduction may comprise two stages. Each stage may incorporate some dynamic noise enhancement feature where the level of noise reduction performed is dependent on an estimated signal-to-noise ratio or other signal- or noise-related parameter or feature of the signal.
- Various signal coding techniques where the output data transmitted is derived from the signal energies associated with each of a filter bank with non-uniformly spaced centre frequencies as described above, particularly cepstral feature extraction using MFCC coding, are compatible with many known downstream voice recognition or speaker recognition algorithms. In some cases the MFCC data may actually be forwarded from the codec, for example in an ETSI-standard MFCC form, for further signal processing either within the host device or transmitted to remote servers for processing “in the cloud”. This latter may reduce the data bandwidth required for transmission, and may be used to preserve speech quality in poor transmission conditions.
- However in some embodiments the microphone may be required to deliver a more traditional output signal digitising the instantaneous input audio signal in say a 16-bit format at say 16 ks/s or 48 ks/s.
- There may also be other applications in which some other format of signal is required. Traditionally this processing and re-formatting of the signal might take place within a phone applications processor or a smart codec with DSP capability. However given the presence of DSP circuitry in the microphone unit, necessary to reduce digital transmission power in stand-by or “Always-On” modes, this DSP circuitry may be usable to perform other speech coding methods in other use cases. As semiconductor manufacturing processes evolve with ever-decreasing feature sizes, and as the cost of each of these processes decreases over time with maturity, it becomes more feasible to actually integrate this functionality in the microphone unit itself, leaving any more powerful processing power elsewhere in the system freer to perform higher-level tasks. Or indeed in some end applications, the requirement for other signal-processing DSP may be removed, allowing perhaps some simpler non-DSP controller processor to be used.
-
FIG. 7 illustrates amicrophone unit 130 which may operate in a plurality of modes, with various degrees and methods of signal coding or compression. Thus,FIG. 7 shows several different functional blocks. In some other embodiments, only a subset of these blocks is present. - The analog input signal from a
transducer element 132 is presented to anADC 134, for example a 1-bit delta-sigma ADC, and the resulting delta-sigma data stream Dx is then passed to one or more functional blocks, as described below. - The ADC may be clocked by a sample clock CKM, which may be generated by a
local clock generator 136, or which may be received on aclock input 138 according to the operating mode. - The microphone unit may operate in a first, low power mode in which it uses an internally generated clock and provides compressed speech data and a second, higher power, mode in which it receives an external clock and provides uncompressed data.
- The operating mode may be controlled by the downstream control processor via signals received on a control input terminal 140. These inputs may be separate or may be provided by making the digital output line bi-directional. In some embodiments the operating mode may be determined autonomously by circuitry in the microphone unit. A
control block 142 receives the control inputs and determines which of the functional blocks are to be active. - Thus,
FIG. 7 shows that the data stream Dx may be passed to aPDM formatting block 144, which allows the digitised time-domain output of the microphone to be output directly as a PDM stream. The output of thePDM formatting block 144 is passed to amultiplexer 146, operating under the control of thecontrol block 142, and the multiplexer output is passed to adriver 148 for generating the digital output DAT. -
FIG. 7 also shows the data stream Dx being passed to afeature extraction block 150, for example for obtaining values based on using non-linear-spaced frequency bins, for instance MFCC values. -
FIG. 7 also shows the data stream Dx being passed to acompressive sampling block 152, for example for deriving a sparse representation of the incoming signal. -
FIG. 7 also shows the data stream Dx being passed to alossy compression block 154, for example for performing Adaptive differential pulse-code modulation (ADPCM) or a similar form of coding. -
FIG. 7 also shows the data stream Dx being passed to adecimator 156. - In some embodiments, the data stream Dx is also passed to a lossless coding block to provide a suitable output data stream.
-
FIG. 7 shows the outputs of thecompressive sampling block 152,lossy compression block 154 anddecimator 156 being connected to respective data buffer memory blocks 158, 160, 162. These allow the higher quality data generated by these blocks to be stored. Then, if analysis of a lower power data stream suggests that there is a need, power can be expended in transmitting the higher quality data for some further processing or examination that requires such higher quality data. - For example, analysis of a lower power data stream might suggest that the audio signal contains a trigger phrase being spoken by a recognized user of the device in a particular time period. In that case, it is possible to read from one of the buffer memory blocks the higher quality data relating to the same period of time, and to perform further analysis on that data, for example to confirm whether the trigger phrase was in fact spoken, or whether the trigger phrase was spoken by the recognized user, or performing more detailed keyword detection before awakening a greater part of a downstream system. Thus, the higher quality data can be used for downstream operations requiring better data, for example downstream speech recognition.
-
FIG. 7 also shows the outputs of thefeature extraction block 150, compressiveaudio processing block 152, andlossy compression block 154 being output through respective pulse length modulation (PLM) encoding blocks 164, 166, 168 and through themultiplexer 146, operating under the control of thecontrol block 142, and thedriver 148.FIG. 7 also shows the output of thedecimator 156 being output through a pulse code modulation (PCM) encoding block 170 and through themultiplexer 146, operating under the control of thecontrol block 142, and thedriver 148. - The physical form of the transmitted output may differ according to what operating mode is selected. For instance high-data-rate modes may be transmitted using low-voltage-differential signalling for noise immunity, and the data scrambled to reduce emissions. On the other hand, in low-data rate modes, the signal may be low bandwidth and not so susceptible to noise and transmission line reflections and suchlike, and is preferably unterminated to save the power consumption associated with driving the termination resistance. In the lower power modes the signal swing, i.e. digital driver supply voltage, may be reduced.
- Other operating parameters of the circuit may also be altered according to signal mode. For instance in low data rate modes the speed requirements of the DSP operations may be modest, and the circuitry may thus be operated at a lower logic supply voltage or divided master clock frequency than when performing more complex operations in conjunction with higher rate coding.
- Although the AIC or feature-extraction based schemes above may provide particularly efficient methods of encoding and transmitting the essential information in the audio signal, there may be a requirement for the microphone unit to be able to operate also so as to provide a more conventional data format, say for processing by local circuitry or onward transmission for processing in the cloud, where such processing might not understand the more sophisticated signal representation, or where, for example, the current use case is for recording music in high quality.
- In this case, it is advantageous for the initial conversion in ADC to be high quality, requiring a high quality low-jitter clock, and preferably synchronous with codec DSP main clock to avoid issues with sample-rate conversion to be synchronous to the codec master clock and/or the reference sample rate of the standard output digital PCM format, Thus, the microphone unit is operable in a first mode in which feature extraction and/or data compression is performed, and in a second mode, in which (for example) a clock is supplied from the codec, and the unit operates in a manner similar to that shown in
FIG. 1 . - The digital microphone unit may thus be capable of operating in at least two modes—ADC (analog-digital conversion) or AIC (analog-information conversion). In ADC mode the PCM data from the ADC is transmitted, in AIC mode data extracted from the ADC output is coded, particularly for speech.
- In further embodiments, the microphone unit is operable in one mode to perform lossy low-bit-rate PCM coding. For example, the unit may contain a lossy codec such as an ADPCM coder, with a sample rate that in some embodiments may be selectable, for example between 8 ks/s 24 ks/s.
- In some embodiments, the microphone unit has a coding block for performing μ-law and/or A-law coding, or coding to some other telephony standard. For example, the microphone unit in some embodiments has coding blocks for MDCT, MDCT-Hybrid subband, CELP, ACELP, Two-Stage Noise Feedback Coding (TSNFC), VSELP, RPE-LTP, LPC, Transform coding, or MLT coding.
- In other embodiments, the microphone unit is operable in a mode in which it outputs compressive sampled PCM data, or any scheme that exploits signal sparseness.
-
FIG. 8 illustrates an embodiment of a compressive speech coder that may be used in any of the embodiments described or illustrated herein. The output of anADC 190 is passed through adecimator 192 to provide (for example) 12 bit data at 16 ks/s or 48 ks/s. This data is sampled at an average sample rate of only say 48 Hz or 1 kHz but with a sampling time randomised by a suitable random number generator orrandom pulse generator 194. - Thus, the sampling circuit samples the input signal at a sample rate less than the input signal bandwidth, and the sampling instants are caused to be distributed randomly in time.
-
FIG. 9 shows a system using such a compressive speech coder. Thus, themicrophone unit 200, including acompressive ADC 202, is connected to supply very low data rate data to acodec 204. With the aid of prior knowledge of the signal statistics,downstream circuitry 206 may either perform a partial reconstruction (computationally cheap) to do sparse feature extraction in a low power mode, or a full reconstruction (computationally more expensive) to get Nyquist type voice for onwards transmission. Note there are known post-processing algorithm blocks, such as theblock 208, for performing “sparse recognition” that are compatible with such compressive sampling formats. In such algorithms the sparse representation of a signal is matched to a linear combination of a few atoms from a pre-defined dictionary, which atoms may be obtained a priori by using machine learning techniques to learn an over-complete dictionary of primary signals (atoms) directly from data, so that the most relevant properties of the signals may be efficiently captured. - Sparse extraction may have some benefits in performance of feature extraction in the presence of noise. The noise is not recognised as comprising any atom component, so does not appear in the encoded data. Such ignorance of input noise may thus avoid unnecessary activation of downstream circuitry and avoid the power consumption increasing in noisy environments relative to quiet environments.
-
FIG. 10 shows an embodiment in which amicrophone unit 210 is connected to supply very low data rate data to acodec 212, and in which, to further reduce the power consumption, some, if not all, of the feature extraction is performed using Analogue Signal Processing (ASP). This a signal from a microphone transducer is passed to ananalog signal processor 214, and then to one or more analog-to-digital converter 216, and then to an optional digital signal processor 218 in themicrophone unit 210.Feature recognition 220 is then performed in thecodec 212. -
FIG. 11 shows in more detail the processing inside an embodiment of themicrophone unit 210, in which a large part of the signal processing is performed by analogue rather than digital circuitry. Thus, an input signal is passed through a plurality of band pass filters (three being shown inFIG. 11 purely by way of illustration) 240, 242, 246. The band pass filters are constant Q and equally spaced in mel frequency. The outputs are passed to log function blocks 248, 250, 252, which may be achieved using standard analogue design techniques based for instance on applying the input signal via a voltage-to-current converted signal into an I-V two-port circuit with a logarithmic current-to-voltage conversion such as a semiconductor diode. The outputs are passed to a plurality of 252, 254, 256. The ADCs may comprise voltage-controlled oscillators, whose frequency is used as a representation of their respective input signals. These are simple and low power, and their linearity is not important in this application. These simple ADCs may have significantly reduced power and area even in total compared to the main ADC. The state of the art for similar circuit blocks in say the field of artificial cochleas is below 20 microwatts.parallel ADCs - In all of the embodiments described herein, the microphone, ADC, and the speech coding circuitry may advantageously be located close together, to reduce high-data-rate signal paths of the digital data before data-rate reduction. All three components may be packaged together. At least two of these three components may be co-integrated on an integrated circuit.
- The microphone may be a MEMS transducer, which may be capacitive, piezo-electric or piezo resistive, and co-integrated with at least the ADC.
- The skilled person will recognise that various embodiments of the above-described apparatus and methods may be, at least partly, implemented using programmable components rather than dedicated hardwired components. Thus embodiments of the apparatus and methods may be, at least partly embodied as processor control code, for example on a non transitory carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier. In some applications, embodiments of the invention may be implemented, at least partly, by a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), Thus the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA, The code may also comprise code for dynamically configuring re-configurable apparatus such as re-programmable logic gate arrays. Similarly the code may comprise code for a hardware description language such as Verilog™ or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, the code may be distributed between a plurality of coupled components in communication with one another. Where appropriate, the embodiments may also be implemented using code running on a field-(re-)programmable analogue array or similar device in order to configure analogue hardware.
- It should be understood—especially by those having ordinary skill in the art with the benefit of this disclosure—that that the various operations described herein, particularly in connection with the figures, may be implemented by other circuitry or other hardware components. The order in which each operation of a given method is performed may be changed, and various elements of the systems illustrated herein may be added, reordered, combined, omitted, modified, etc. It is intended that this disclosure embrace all such modifications and changes and, accordingly, the above description should be regarded in an illustrative rather than a restrictive sense.
- Similarly, although this disclosure makes reference to specific embodiments, certain modifications and changes can be made to those embodiments without departing from the scope and coverage of this disclosure. Moreover, any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element.
- Further embodiments likewise, with the benefit of this disclosure, will be apparent to those having ordinary skill in the art, and such embodiments should be deemed as being encompassed herein.
Claims (22)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/380,106 US20190259400A1 (en) | 2014-12-23 | 2019-04-10 | Microphone unit comprising integrated speech analysis |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201462096424P | 2014-12-23 | 2014-12-23 | |
| PCT/GB2015/054122 WO2016102954A1 (en) | 2014-12-23 | 2015-12-22 | Microphone unit comprising integrated speech analysis |
| US201715538619A | 2017-06-21 | 2017-06-21 | |
| US16/380,106 US20190259400A1 (en) | 2014-12-23 | 2019-04-10 | Microphone unit comprising integrated speech analysis |
Related Parent Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/538,619 Continuation US10297258B2 (en) | 2014-12-23 | 2015-12-22 | Microphone unit comprising integrated speech analysis |
| PCT/GB2015/054122 Continuation WO2016102954A1 (en) | 2014-12-23 | 2015-12-22 | Microphone unit comprising integrated speech analysis |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US20190259400A1 true US20190259400A1 (en) | 2019-08-22 |
Family
ID=53677602
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/538,619 Active US10297258B2 (en) | 2014-12-23 | 2015-12-22 | Microphone unit comprising integrated speech analysis |
| US16/380,106 Abandoned US20190259400A1 (en) | 2014-12-23 | 2019-04-10 | Microphone unit comprising integrated speech analysis |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/538,619 Active US10297258B2 (en) | 2014-12-23 | 2015-12-22 | Microphone unit comprising integrated speech analysis |
Country Status (4)
| Country | Link |
|---|---|
| US (2) | US10297258B2 (en) |
| CN (2) | CN111933158B (en) |
| GB (3) | GB201509483D0 (en) |
| WO (1) | WO2016102954A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20250310699A1 (en) * | 2024-04-02 | 2025-10-02 | Infineon Technologies Ag | Idle tone mitigation using clock jitter |
Families Citing this family (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB201509483D0 (en) * | 2014-12-23 | 2015-07-15 | Cirrus Logic Internat Uk Ltd | Feature extraction |
| GB2578386B (en) | 2017-06-27 | 2021-12-01 | Cirrus Logic Int Semiconductor Ltd | Detection of replay attack |
| GB201713697D0 (en) | 2017-06-28 | 2017-10-11 | Cirrus Logic Int Semiconductor Ltd | Magnetic detection of replay attack |
| GB2563953A (en) | 2017-06-28 | 2019-01-02 | Cirrus Logic Int Semiconductor Ltd | Detection of replay attack |
| GB201801527D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Method, apparatus and systems for biometric processes |
| GB201801532D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Methods, apparatus and systems for audio playback |
| GB2567503A (en) | 2017-10-13 | 2019-04-17 | Cirrus Logic Int Semiconductor Ltd | Analysing speech signals |
| GB201804843D0 (en) | 2017-11-14 | 2018-05-09 | Cirrus Logic Int Semiconductor Ltd | Detection of replay attack |
| GB201801664D0 (en) | 2017-10-13 | 2018-03-21 | Cirrus Logic Int Semiconductor Ltd | Detection of liveness |
| GB201801659D0 (en) * | 2017-11-14 | 2018-03-21 | Cirrus Logic Int Semiconductor Ltd | Detection of loudspeaker playback |
| US11557306B2 (en) | 2017-11-23 | 2023-01-17 | Harman International Industries, Incorporated | Method and system for speech enhancement |
| US11264037B2 (en) | 2018-01-23 | 2022-03-01 | Cirrus Logic, Inc. | Speaker identification |
| US11475899B2 (en) | 2018-01-23 | 2022-10-18 | Cirrus Logic, Inc. | Speaker identification |
| US11735189B2 (en) | 2018-01-23 | 2023-08-22 | Cirrus Logic, Inc. | Speaker identification |
| DE102018204687B3 (en) | 2018-03-27 | 2019-06-13 | Infineon Technologies Ag | MEMS microphone module |
| US10692490B2 (en) | 2018-07-31 | 2020-06-23 | Cirrus Logic, Inc. | Detection of replay attack |
| US10915614B2 (en) | 2018-08-31 | 2021-02-09 | Cirrus Logic, Inc. | Biometric authentication |
| US11637546B2 (en) * | 2018-12-14 | 2023-04-25 | Synaptics Incorporated | Pulse density modulation systems and methods |
| CN110191397B (en) * | 2019-06-28 | 2021-10-15 | 歌尔科技有限公司 | Noise reduction method and Bluetooth headset |
| KR102740717B1 (en) * | 2019-08-30 | 2024-12-11 | 엘지전자 주식회사 | Artificial sound source separation method and device of thereof |
| US12302064B2 (en) * | 2020-01-27 | 2025-05-13 | Infineon Technologies Ag | Configurable microphone using internal clock changing |
| US12325627B2 (en) | 2020-01-27 | 2025-06-10 | Infineon Technologies Ag | Configurable microphone using internal clock changing |
| US11582560B2 (en) * | 2020-11-30 | 2023-02-14 | Infineon Technologies Ag | Digital microphone with low data rate interface |
| CN114187914A (en) * | 2021-12-17 | 2022-03-15 | 广东电网有限责任公司 | Voice recognition method and system |
| CN116645972B (en) * | 2023-02-03 | 2025-11-28 | 电子科技大学 | Tooth pitch suppression method based on sparse decomposition |
| CN118155608B (en) * | 2024-05-11 | 2024-07-19 | 米烁网络科技(广州)有限公司 | Miniature microphone voice recognition system for multi-noise environment |
Family Cites Families (31)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5612974A (en) * | 1994-11-01 | 1997-03-18 | Motorola Inc. | Convolutional encoder for use on an integrated circuit that performs multiple communication tasks |
| US5737485A (en) | 1995-03-07 | 1998-04-07 | Rutgers The State University Of New Jersey | Method and apparatus including microphone arrays and neural networks for speech/speaker recognition systems |
| US5966688A (en) * | 1997-10-28 | 1999-10-12 | Hughes Electronics Corporation | Speech mode based multi-stage vector quantizer |
| US20020193989A1 (en) * | 1999-05-21 | 2002-12-19 | Michael Geilhufe | Method and apparatus for identifying voice controlled devices |
| US6397186B1 (en) * | 1999-12-22 | 2002-05-28 | Ambush Interactive, Inc. | Hands-free, voice-operated remote control transmitter |
| US7076260B1 (en) * | 2000-03-21 | 2006-07-11 | Agere Systems Inc. | Unbalanced coding for cordless telephony |
| JP2001318694A (en) | 2000-05-10 | 2001-11-16 | Toshiba Corp | Signal processing device, signal processing method and recording medium |
| US20040156520A1 (en) * | 2002-04-10 | 2004-08-12 | Poulsen Jens Kristian | Miniature digital transducer with reduced number of external terminals |
| US7099821B2 (en) * | 2003-09-12 | 2006-08-29 | Softmax, Inc. | Separation of target acoustic signals in a multi-transducer arrangement |
| US7929714B2 (en) * | 2004-08-11 | 2011-04-19 | Qualcomm Incorporated | Integrated audio codec with silicon audio transducer |
| US20060206320A1 (en) * | 2005-03-14 | 2006-09-14 | Li Qi P | Apparatus and method for noise reduction and speech enhancement with microphones and loudspeakers |
| US20080013747A1 (en) * | 2006-06-30 | 2008-01-17 | Bao Tran | Digital stethoscope and monitoring instrument |
| US8379485B2 (en) * | 2007-11-01 | 2013-02-19 | University Of Maryland | Compressive sensing system and method for bearing estimation of sparse sources in the angle domain |
| US8099289B2 (en) * | 2008-02-13 | 2012-01-17 | Sensory, Inc. | Voice interface and search for electronic devices including bluetooth headsets and remote systems |
| US8085941B2 (en) * | 2008-05-02 | 2011-12-27 | Dolby Laboratories Licensing Corporation | System and method for dynamic sound delivery |
| US8171322B2 (en) * | 2008-06-06 | 2012-05-01 | Apple Inc. | Portable electronic devices with power management capabilities |
| US8488799B2 (en) | 2008-09-11 | 2013-07-16 | Personics Holdings Inc. | Method and system for sound monitoring over a network |
| KR20110134127A (en) * | 2010-06-08 | 2011-12-14 | 삼성전자주식회사 | Audio data decoding apparatus and method |
| CN102074245B (en) * | 2011-01-05 | 2012-10-10 | 瑞声声学科技(深圳)有限公司 | Dual-microphone-based speech enhancement device and speech enhancement method |
| CN102074246B (en) * | 2011-01-05 | 2012-12-19 | 瑞声声学科技(深圳)有限公司 | Dual-microphone based speech enhancement device and method |
| US9424849B2 (en) * | 2011-12-14 | 2016-08-23 | Cirrus Logic, Inc. | Data transfer |
| JP6136218B2 (en) * | 2012-12-03 | 2017-05-31 | 富士通株式会社 | Sound processing apparatus, method, and program |
| US9436633B2 (en) * | 2013-02-22 | 2016-09-06 | Marvell World Trade Ltd. | Multi-slot multi-point audio interface |
| US20140244273A1 (en) * | 2013-02-27 | 2014-08-28 | Jean Laroche | Voice-controlled communication connections |
| US9542933B2 (en) * | 2013-03-08 | 2017-01-10 | Analog Devices Global | Microphone circuit assembly and system with speech recognition |
| US10020008B2 (en) * | 2013-05-23 | 2018-07-10 | Knowles Electronics, Llc | Microphone and corresponding digital interface |
| US9111548B2 (en) * | 2013-05-23 | 2015-08-18 | Knowles Electronics, Llc | Synchronization of buffered data in multiple microphones |
| KR20160010606A (en) | 2013-05-23 | 2016-01-27 | 노우레스 일렉트로닉스, 엘엘시 | Vad detection microphone and method of operating the same |
| US20150350772A1 (en) * | 2014-06-02 | 2015-12-03 | Invensense, Inc. | Smart sensor for always-on operation |
| US9549273B2 (en) * | 2014-08-28 | 2017-01-17 | Qualcomm Incorporated | Selective enabling of a component by a microphone circuit |
| GB201509483D0 (en) * | 2014-12-23 | 2015-07-15 | Cirrus Logic Internat Uk Ltd | Feature extraction |
-
2015
- 2015-06-02 GB GBGB1509483.2A patent/GB201509483D0/en not_active Ceased
- 2015-12-22 WO PCT/GB2015/054122 patent/WO2016102954A1/en not_active Ceased
- 2015-12-22 GB GB1522666.5A patent/GB2535002A/en not_active Withdrawn
- 2015-12-22 CN CN202010877951.2A patent/CN111933158B/en active Active
- 2015-12-22 US US15/538,619 patent/US10297258B2/en active Active
- 2015-12-22 GB GB1711576.7A patent/GB2551916B/en active Active
- 2015-12-22 CN CN201580076624.4A patent/CN107251573B/en active Active
-
2019
- 2019-04-10 US US16/380,106 patent/US20190259400A1/en not_active Abandoned
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20250310699A1 (en) * | 2024-04-02 | 2025-10-02 | Infineon Technologies Ag | Idle tone mitigation using clock jitter |
Also Published As
| Publication number | Publication date |
|---|---|
| GB201522666D0 (en) | 2016-02-03 |
| US20180005636A1 (en) | 2018-01-04 |
| US10297258B2 (en) | 2019-05-21 |
| CN107251573A (en) | 2017-10-13 |
| GB201509483D0 (en) | 2015-07-15 |
| GB2551916B (en) | 2021-07-07 |
| GB2535002A (en) | 2016-08-10 |
| WO2016102954A1 (en) | 2016-06-30 |
| CN107251573B (en) | 2020-09-25 |
| CN111933158A (en) | 2020-11-13 |
| GB201711576D0 (en) | 2017-08-30 |
| GB2551916A (en) | 2018-01-03 |
| CN111933158B (en) | 2024-09-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10297258B2 (en) | Microphone unit comprising integrated speech analysis | |
| US10824391B2 (en) | Audio user interface apparatus and method | |
| US9111548B2 (en) | Synchronization of buffered data in multiple microphones | |
| US10381021B2 (en) | Robust feature extraction using differential zero-crossing counts | |
| US9542933B2 (en) | Microphone circuit assembly and system with speech recognition | |
| CN105379308B (en) | Microphone, microphone system, and method of operating a microphone | |
| US9460720B2 (en) | Powering-up AFE and microcontroller after comparing analog and truncated sounds | |
| US9721560B2 (en) | Cloud based adaptive learning for distributed sensors | |
| CN110164451B (en) | Speech Recognition | |
| US9785706B2 (en) | Acoustic sound signature detection based on sparse features | |
| CN102027536B (en) | Adaptively filtering a microphone signal responsive to vibration sensed in a user's face while speaking | |
| US20150371638A1 (en) | Context Aware Sound Signature Detection | |
| CN114333781A (en) | System and method for energy efficient and low power distributed automatic speech recognition on wearable devices | |
| US20170154620A1 (en) | Microphone assembly comprising a phoneme recognizer | |
| CN109346075A (en) | Method and system for recognizing user's voice through human body vibration to control electronic equipment | |
| US12277950B2 (en) | Methods for clear call under noisy conditions | |
| CN106104686B (en) | Methods in Microphone, Microphone-Component, Microphone-Device | |
| CN118354237A (en) | Awakening method, device and equipment of MEMS earphone and storage medium | |
| CN103295571A (en) | Control of audio commands using temporal and/or spectral compression | |
| CN110310635B (en) | Voice processing circuit and electronic equipment | |
| US9978394B1 (en) | Noise suppressor |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: CIRRUS LOGIC INTERNATIONAL SEMICONDUCTOR LTD., UNI Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LESSO, JOHN PAUL;MELANSON, JOHN LAURENCE;SIGNING DATES FROM 20170817 TO 20170821;REEL/FRAME:048846/0150 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
| STCV | Information on status: appeal procedure |
Free format text: NOTICE OF APPEAL FILED |
|
| STCV | Information on status: appeal procedure |
Free format text: APPEAL BRIEF (OR SUPPLEMENTAL BRIEF) ENTERED AND FORWARDED TO EXAMINER |
|
| STCV | Information on status: appeal procedure |
Free format text: EXAMINER'S ANSWER TO APPEAL BRIEF MAILED |
|
| STCV | Information on status: appeal procedure |
Free format text: ON APPEAL -- AWAITING DECISION BY THE BOARD OF APPEALS |
|
| STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |