WO2014194273A2 - Systèmes et procédés d'amélioration d'une audibilité ciblée - Google Patents
Systèmes et procédés d'amélioration d'une audibilité ciblée Download PDFInfo
- Publication number
- WO2014194273A2 WO2014194273A2 PCT/US2014/040359 US2014040359W WO2014194273A2 WO 2014194273 A2 WO2014194273 A2 WO 2014194273A2 US 2014040359 W US2014040359 W US 2014040359W WO 2014194273 A2 WO2014194273 A2 WO 2014194273A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- processing
- sound input
- noise
- profile
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R25/00—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
- H04R25/50—Customised settings for obtaining desired overall acoustical characteristics
- H04R25/505—Customised settings for obtaining desired overall acoustical characteristics using digital signal processing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R25/00—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
- H04R25/55—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception using an external connection, either wireless or wired
- H04R25/554—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception using an external connection, either wireless or wired using a wireless connection, e.g. between microphone and amplifier or using Tcoils
Definitions
- One goal of the systems and methods disclosed herein is to make hearing assistance algorithms easily accessible and available by implementing such algorithms using mobile devices such as smartphones, PDA's and the like.
- the systems and methods of the present disclosure integrate hearing assistance algorithms with multi-media algorithms in an API stack (similar to the implementation of audio effects such as stereo widening and psychoacoustic bass enhancement).
- Current hearing assistance applications for consumer devices, such as a smartphone provide basic functionality that falls short because of operating system limitations. These operating systems are not "real-time" environments so that the user level hearing assistance application cannot guarantee an acceptable processing time. If there is significant delay (for example, greater than 40ms) between the arrival at the ear of the ambient sound and the processed sound, the user will hear a confusing echo. This constraint on processing causes even the basic signal processing algorithms to run with too much delay; certainly putting the use of more sophisticated noise suppression algorithms out of reach.
- Hearing loss is a global problem, with nearly 700 million people suffering from hearing problems, and the rate of hearing loss accelerating around the world. As many as 47 million
- Systems and methods disclosed herein provide for low cost hearing assistance to improve intelligible hearing for those with normal hearing and to greatly improve hearing intelligibility for those with hearing problems. More particularly, the systems and methods of the present disclosure utilize commercially available general purpose central processing units (CPUs) and standard operating systems such as those inherently available in many mobile devices to improve targeted audibility, for example of speech.
- CPUs general purpose central processing units
- standard operating systems such as those inherently available in many mobile devices to improve targeted audibility, for example of speech.
- ADCs analog-to-digital converters
- DACs digital-to-analog-converters
- DSP digital signal processing
- ADC analog to digital converter
- DAC digital to analog converter
- communications for example, between a microphone and a mobile device and/or between a mobile device and an earpiece
- wireless e.g., Bluetooth
- the systems and methods of the present disclosure result in a total latency that reduces/eliminates the perceptibility of a time delay echo, for example, less than 40ms and in some embodiments less than 20ms.
- the systems and methods of the present disclosure may use a novel Bluetooth communication protocol, e.g., with reduced error checking, to reduce
- the systems and methods of the present disclosure may utilize sampling rates and buffers in ADC and DAC which reduce latency. Additional features of the systems and methods disclosed are described in the detailed description section which follows.
- Speech This term as used herein may refer to speech that is spoken by a source to whom the User is listening and the representation of that speech by an audio signal. This is also referred to as "targeted speech" when it is advantageous to differentiate the speech that is listened to from the other speech that is considered a component of noise.
- Ambient Sound Sound as picked up by a microphone that is traveling through the air in the proximity of the Mobile Computing Device.
- Audio Either an analog or digital electronic transmission that represents sound.
- Noise The component of sound or an audio that is not the targeted audio or sound.
- Mobile Computing Device A computing device that has a general purpose CPU, runs standard operating systems, and is designed to be portable.
- Speaker A device that translates analog audio into sound.
- systems and methods may utilize a Mobile Computing Device to process ambient sound and enhance the audibility of a targeted portion of the ambient sound, for example, to enhance speech.
- a Mobile Computing Device may refer to any portable or semi-portable device that includes a general purpose CPU and a standard operating system capable of executing an application— for example, an application stored in memory.
- Examples of Mobile Computing Devices may include smartphones, tablets, laptops, PDAs, media players, such as mp3 players, and the like.
- Examples of general CPUs are ARM Cortex-A9, Samsung S5PC100, and TIOMAP4 Platform, Apple A4.
- the key definition of a general CPU is that it has a general instruction set that may be augmented by specific instructions or co-processors to do DSP, but a general CPU is designed to execute a wide range of applications.
- General purpose CPUs are thus distinguished from proprietary processing chipsets found in dedicated hearing aid devices, in that they are designed to handle a wide variety of applications.
- Examples of standard operating systems are any of the versions of the Microsoft Windows mobile operating systems (Windows RT), different releases of the UNIX and Linux operating systems (e.g., Android), any version of the iOS, or any other non-proprietary operating system capable of running on the Mobile Computing Device and performing the operations described herein.
- Windows RT Microsoft Windows mobile operating systems
- Linux operating systems e.g., Android
- iOS any version of the iOS
- any other non-proprietary operating system capable of running on the Mobile Computing Device and performing the operations described herein.
- a Mobile Computing Device may include or otherwise be operatively associated with one or more sensors for detecting ambient sound, such as a microphone.
- a microphone may either be an integral component to the Mobile Computing Device or an external component operatively associated with the Mobile Computing Device via a wired or wireless connection.
- the Mobile Computing Device may include or otherwise be operatively associated with one or more speakers for outputting sound processed by the mobile device.
- a speaker may be embedded in an integral component to the Mobile Computing Device or embedded in an external component operatively associated with the Mobile Computing Device via a wired or wireless connection, for example a headset, earphones and the like.
- a same external component for example a headset, may include both a speaker and a microphone.
- Fig. la and Fig.lb describe the components of latency from input Sound reaching the microphone to processed Sound output from the Speakers.
- the key feature of the systems and methods of the present disclosure is the construction and configuration of the digital signal processing (DSP), ADC, DAC and communications (for example, between a microphone and a mobile device and/or between a mobile device and an earpiece) such as wireless (e.g.,
- Bluetooth communications, so as to minimize aggregate latency, i.e., the latency is sufficiently low such that a user will not perceive an echo-like delay between the processed sound exiting from the speaker and the raw ambient sound entering the User's ear.
- aggregate latency i.e., the latency is sufficiently low such that a user will not perceive an echo-like delay between the processed sound exiting from the speaker and the raw ambient sound entering the User's ear.
- the systems and methods of the present disclosure result in aggregate latency, for example, of less than 40ms and in wireless embodiments and less than 25ms in wired embodiments.
- Fig. 2 depicts an exemplary embodiment of a general method 9000, for processing ambient sound, including improving the audibility of targeted speech.
- Method 9000 including steps of receiving via an internal or external microphone ambient sound; converting the ambient sound to a digital audio signal using an ADC [9001]; preforming DSP to process the digital audio signal and enhance the audibility the ambient sound [9002], where the SP outputs a digital audio signal that is either further processed as a digital audio signal; or converted, using a DAC [9003], to an analog audio signal, which is then transmitted to a Speaker.
- Fig. 3 depicts a block diagram of the processes utilized in an exemplary embodiment to increase the audibility of targeted speech, which utilizes the Apple's iOS family of operating systems.
- Audio processing in the iOS operating system is based on event-oriented processing.
- RCEngineMgrDelegate [8040] is the primary event handler, processing events and setting state variables. It is the primary mediator between the general application processes, e.g., the User Interface, and the active audio processing modules.
- the ViewControllers [8010] manage the display and interaction of the User Interface (UI), e.g., signaling to the RCEngineMgrDelegate that audio processing should be initiated, passing a parameter to the RCEngineMgrDelegate to change in the volume setting.
- UI User Interface
- the ViewControllers also communicate with RCPrefernces to display and update User-entered profile information.
- RCPreferences manages the User setable preferences and profiles such as instantiating stored Hearing Profile or Equalizer Profile or retrieving a saved Sample Noise Profile.
- RCPrefernces interfaces with the RCPermStoreDelegate to either retrieve or update storable User preferences and profiles.
- RCPermStoreDelegate mediates between RCPrefernces and the various mechanisms for permanently storing data, e.g., Hearing Profiles, Sample Noise Profiles, etc., delegating to the appropriate process and indicating the CRUD operation that is required.
- RCProfileFiles [8031] stores and retrieve User profiles, such as Hearing Profiles, Equalizer Profiles and Sample Noise Profiles in the iOS file system.
- OS X User Defaults [8032] retrieves and updates in permanent storage User preferences and other parameters used when the exemplary embodiment is initiated or are changed during the execution of the exemplary embodiment.
- RealClarityAudio is the audio engine that manages the processing of digital audio.
- An instance of RealClarityAudio is instantiated when the exemplary embodiment is started and initiates the processing of a digital audio signal by iOS.
- RealClarityAudio then provides the overall management of the processing, specifically by instantiating the Audio Processing Graph unit.
- Audio Processing Graph [8060] is an object that contains an event-oriented flow describing processes to be executed based on call-backs from iOS. These flows provide the key set of functions that need to be executed by the exemplary embodiment to increase the audibility of speech that is being delivered within the digital audio input.
- the major call-back executes the exemplary embodiment's DSP. Additional call-backs include a Speaker Response Estimator and a Noise Estimator.
- RealClarity DSP [8061] contains the exemplary embodiment's algorithm that performs the core DSP to increase audibility. That algorithm is described in "Digital Signal Processing", below.
- Speaker Response Estimator [8062] is a unique process, triggered from a UI screen, that generates white noise that is broadcast through the Speakers of an earpiece and input through the Mobile Computing Device's microphone.
- the Estimator creates an adjustment profile calibrated to correct gain anomalies in the Speaker of the wired earpiece, based on the difference between the expected noise profile of white noise and the actual noise profile output from the Speaker.
- the adjustment profile which is stored, enables the RealClarity DSP to adjust for the anomalies.
- Noise Estimator [8063] is a process, triggered from a UI screen, that creates a Noise Profile based on, for example, a 5 sec audio stream of the ambient noise in an environment. This Noise Profile is stored and is then available to be utilized by the RealClarity DSP.
- DSP Digital Signal Processing - Overview
- DSP is performed within a time constraint acceptable for aggregate latency, i.e., a time period such that the brain can integrate the processed sound with the ambient sound directly coming to the ear.
- This aggregate latency, from microphone to Speaker, is best if it is around 20ms or 25ms. However, an aggregate latency up to around 40ms may be tolerable by most people.
- the exemplary embodiment discloses an effective set of DSP processes and a system design that fit within the aggregate time constraint, i.e., a target delay at 10ms for all signal processing (ADC, DSP and DAC).
- DSP is typically implemented by way of an application which is stored and executed, without requiring changing the underlying firmware of the device.
- the application is upgradable and may be utilized with any number of different mobile devices that support the same operating environment.
- the core algorithms are coded in the "C" computer language.
- C-based source can be ported to all current mobile operating systems.
- Alternate embodiments of the core DSP processing can be implemented with low cost. For example, by modifying the core DSP to interface with standardized API stacks, such as the open source OpenMAX AL (OpenMAX AL defines an interface between multimedia applications, such as a media player, and the platform media framework), the DSP can easily migrate to different platforms including Android.
- OpenMAX AL defines an interface between multimedia applications, such as a media player, and the platform media framework
- the exemplary embodiment utilizes the standard chipsets in a Mobile Computing Device, requiring no proprietary hardware. Since the power of these chipsets (given Moore's law that is certainly being followed in the mobile-device space) is dramatically increasing, progressively more sophisticated algorithms, particularly for noise reduction and speech clarity, will be enabled.
- DSP for an exemplary embodiment with the primary goal of reducing the speech-to-noise ratio.
- These signal processing algorithms can be applied to electronic audio as well as ambient sound.
- the time constraint on DSP relates to ambient sound where it is important to avoid an echo effect, i.e., to deliver sound to the speaker with an aggregate latency of less than 40ms.
- DSP is performed on a real-time or high priority thread utilizing Call-backs from the operating environment.
- the DSP component of aggregate latency is reduced by executing an effective set of algorithms for the DSP, where these algorithms are driven by parametric input.
- the values of the parameters are derived by background processing or from User input such that computation of these parameters does not add processing latency to the real-time DSP.
- the parametric input is either supplied as arguments to the DSP process or indirectly via profiles, sound samples, and state variables stored in a shared common memory space.
- An effective set of DSP could include, but are not limited to, gain control and gain shaping, frequency gain adjustment, frequency mapping, dynamic range compression, noise suppression, noise removal, speech detection, speech enhancement, detection and suppression of non-speech impulse sound.
- the exemplary embodiment (as shown in Fig. 4) contains an effective set of DSP algorithms that have been implemented for mobile devices running Apple's iOS operating environment.
- the DSP processing takes a frame of digital audio input in the time domain, transforms it into a Frequency Spectrum using a Fast Fourier Transform, processes that Frequency Spectrum and reconstructs a frame of digital audio output. There may be averaging or smoothing done between sequential Frequency Spectrum and between time domain audio frames.
- the DSP process, including buffering, is designed to take less than 10 ms.
- the DSP is based on a filter bank architecture with the following components: Audio Input [1]
- the audio input [1] is a digital stream that can come from a number of sources, such as the electronic sound from applications running on the mobile device or telephone conversations.
- the primary audio input comes from an analog-to-digital converter which receives its analog signal from an internal microphone or from an external microphone.
- FFT Fast Fourier Transform
- the time domain signal is then converted to the frequency domain using a Fast Fourier Transform [2] by transforming a time frame with 256 samples to a Frequency Spectrum of 256 bins where the frequency is represented by a complex number indicating the amplitude and phase of the bin. All FFT-based measurements assume that the signal is periodic in the time frame. When the measured signal is not periodic then leakage occurs. Leakage results in misleading information about the spectral amplitude and frequency.
- the exemplary embodiment applies a Hann Window transformation to reduce the effect of leakage.
- windowing functions like Hann are that the beginning and end of the signal is attenuated in the calculation of the spectrum. This means that more averages may be taken to get a good statistical representation of the spectrum. This would increase the latency of FFT algorithm.
- a 75% overlap process is implemented in the exemplary embodiment where only 64 samples are added and the remaining 196K come from the previous window. This moving average approach minimizes latency while compensating for the attenuated signal.
- the expected latency including the buffering of the time frame and the delay because of the averaging is estimated to be 5.8ms.
- the speech enhancement process [4] has a capability to perform continuous noise estimation, i.e., estimate what is noise. However, if a User is in a stable noise environment the speech enhancement algorithms work better with a fixed measurement of the noise profile.
- the Manual Noise Estimator process gets input from the FFT process and creates a stable Noise Profile.
- the Noise Profile is output as a Frequency Spectrum, which then can be input to the Speech Enhancement process.
- the creation of a noise sample by the Manual Noise Estimator process is initiated by the User pressing the "sample noise" control in the Filter screen [Fig. 6.16]. Sound is then gathered for a period of five seconds, the sound is transformed by the FFT process and input to the Manual Noise Estimator process that will create a Noise Profile.
- the created Noise Profile is then stored in the Noise Profile buffer where it will then be accessed by the Speech Enrichment process.
- the User Given the creation of the Noise Profile [Fig. 6.17], the User has the option of naming and saving the created Noise Profile for later use [Fig. 6.18]. Rather than creating a current noise sample the User can select a stored Noise Profile [Fig. 19]. The selected Noise Profile will be stored in the Noise Profile buffer where it can be accessed by the Speech enhancement module. There are three parametric arguments to the Manual Noise Sample process that are used to inform the process controller of the status of the Noise Profile creation:
- the Speech Enhancement process is the core process for improving the speech-to-noise ratio by removing noise from the audio input.
- the process implements an algorithm described by Diethorn (SUBBAND NOISE REDUCTION METHODSFOR SPEECH EN HANCEMENT, Eric J. Diethorn, Microelectronics and Communications Technologies, Lucent Technologies) that is "less complex" so that it does not significantly add to the aggregate latency.
- Diethorn SUBBAND NOISE REDUCTION METHODSFOR SPEECH EN HANCEMENT, Eric J. Diethorn, Microelectronics and Communications Technologies, Lucent Technologies
- the algorithm consists of four key processes: sub-band analysis, envelope estimation, gain computation, and sub-band synthesis [Fig. 5].
- the Speech Enhancement algorithm is designed to continually estimate the noise component of the audio 10 input (in the diagram).
- the Speech Enhancement process also estimates when speech is present through a soft Voice Activity Detection algorithm (VAD).
- VAD Voice Activity Detection algorithm
- the VAD limits the possible gain reduction for noise.
- in may be possible to substitute the background-computed time-domain estimate of when speech is present.
- the time domain estimate may be more accurate and would allow more flexibility in terms of gain reduction.
- the output of the Speech Enhancement process is a Frequency Spectrum with an increase in Speech-to-Noise ratio.
- the Broadband Squelch process removes these frequencies from the Frequency Spectrum. While that low frequency noise will still be heard by Users as ambient sound reaching their ears, it will not be presented in the audio output for the Speaker.
- the level of low frequency sound to be removed is chosen by the User by setting the lower slider control on the slider bar on the Filter screen [Fig. 6.16].
- the Broadband Squelch has three controlling arguments:
- the output of the Broadband Squelch process is a Frequency Spectrum with the low frequencies appropriately removed.
- the most important feature of hearing assistance is to be able to adjust the gain of different frequencies to match the User's hearing ability and hearing preference.
- the User Profile process accesses a Profile buffer, which is constructed by combining a Hearing Profile and an Equalizer Profile, to adjust the gain for frequencies in the Frequency Spectrum that is output from the Broadband Squelch process.
- the pre-stored Hearing Profiles can represent average hearing loss profiles by age.
- the stored Hearing Profiles cover the normal frequency range and decibel deficit that are used in standard hearing tests. While
- the User can make additional adjustments to fit particular sound situations or their own hearing preferences by adjusting the Equalizer Profile, e.g., emphasizing the frequencies most used for speech, increasing the higher frequencies to get a better experience listening to music.
- the Equalizer Profile defines a set of additive gain amounts that modify the Hearing Profile.
- the left-side wheel allows a User to select one of a number of pre-stored Equalizer Profiles.
- the pre-set Equalizer Profiles have frequency gain settings for common sound situations.
- Entered Equalizer Profiles can also be named and saved so that Users can define their own set of profiles for different sound situations and environments.
- a User can select a named Equalizer Profile on the Equalizer Select screen [Fig. 6.7]. To use an entered Equalizer Profile, the left-side wheel on the Clarify screen is set to the array icon.
- the Broadband AGC automatic gain control
- the Broadband AGC process adjusts the overall gain of the Frequency Spectrum to compensate for volume changes in the sound environment, e.g., going from a quiet environment to a loud environment. This is to make sure that a User does not hear any abrupt changes in the sound from the Speaker.
- the Broadband AGC process is important as it removes the threat that delivered sound from the Speaker would be loud enough to damage a User's hearing ability.
- the Broadband AGC process measures a moving average of the audio energy represented in the Frequency Spectrum to ascertain significant changes and will limit the absolute gain and will smooth the gain during an environmental transition. This Broadband AGC process cannot operate at low levels of sound energy, as the results would be too volatile, so an energy threshold may be set that indicates the energy level when the automatic gain control is activated.
- the Broadband ADC process has two controlling parameters:
- the output of the Broadband AGC is an adjusted Frequency Spectrum.
- the software volume is set on the main RealClarity screen using the "boost" control.
- the Volume Control process adjusts the gain in the Frequency Spectrum for each time frame to reflect the volume control setting specified by a User.
- Offering the software volume control is important as it means that knowledge of the volume level and specifically changes in the set volume level are known to the DSP.
- the performance of the DSP is affected by the volume setting, in particular if the volume is too high feedback can be introduced.
- the best practice for a User would be to set the hardware volume at one level near its maximum and only modify the software volume control.
- the Volume Control process has one control parameter:
- the output of the Volume Control is an adjusted frequency Spectrum.
- the Broadband Limiter process recognizes a potential loud noise interruption through a sudden increase to a high level in the energy of the audio signal. On recognizing the appearance of a sudden noise, the Broadband Limiter will reduce the overall gain in the Frequency Spectrum.
- the level of volume that is to be considered a sudden loud noise is chosen by the User by setting the upper slider control on the slider bar on the Filter screen [Fig. 6.16].
- the Broadband Limiter has two controlling parameters:
- the output of the Broadband Limiter process is a modified Frequency Spectrum. Multiband Limiter process [10]
- the Frequency Spectrum contains non-zero amplitude for frequencies outside of the range for which the Speaker can produce sound. In that case the Speaker will produce sound at its maximum for all frequencies above that physical limit. This will produce distortion.
- the Multiband Limiter cuts off these high energy peaks preventing that distortion.
- the Multiband Limiter process has one controlling parameter:
- the output of the Broadband Limiter process is a modified Frequency Spectrum.
- the Inverse Fast Fourier Transform converts the Frequency Spectrum produced by the DSP back to a time domain audio signal.
- the process is based on the Diethorn algorithm that accurately reconstructs the audio stream.
- the Diethorn algorithm is designed so that if the audio input signal [1] is transformed by the Fast Fourier Transform [2] and the resulting Frequency Spectrum is then inverted by the Inverse Fast Fourier Transform [11], with no intervening processing, the original audio signal will be near perfectly reproduced.
- the reconstructed Audio output of the DSP is feed to a digital-to-analog converter.
- the analog signal is sent to the Speaker, which produces the processed sound for the User.
- Transmission to the Speaker can be through wired connections, for example, utilizing a standard audio jack or USB connector that is part of the Mobile Computing Device. Transmission can also be through a radio component utilizing standard transmission protocols such as analog FM, digital FM, as long as the latency of that transmission maintains an aggregate latency of under 40ms.
- the exemplary embodiment includes an invention of a proprietary Bluetooth protocol. Use of the proprietary Bluetooth protocol requires a modification of the DSP algorithm.
- the exemplary embodiment takes advantage of multiple microphones to receive input if available on the Mobile Computing Device.
- the iPhone 5 has three built-in microphones and can also receive input from an external microphone, such as a microphone that is associated with a wired or wireless earpiece.
- the User Interface Mobile Computing Devices display digital content and controls to Users through an intuitive User Interface (UI) displayed as discrete screens.
- UI User Interface
- the UI may include, for example, various windows, tabs, icons, menus, sub-menus, and touch screen controls such as radio buttons, check boxes, slider bars, etc.
- Describing a set of primary screens of an exemplary embodiment can instruct and describe the underlying processes that they control.
- the described screens are for an exemplary embodiment implemented on an Apple device supporting iOS 7, which utilizes that devices touch screen interface.
- Main Screen (“RealClarity")
- the 'share' button activates a screen that allows a User to communicate with others about the app.
- the 'share' button can be used to send and share audio profiles, noise profiles, customized equalizer settings, etc.
- the 'info' button produces a text screen that provides information about source screen.
- This screen has a vertical display slider which visually shows the presence of audio input through a colored column. If the there is no column displayed then no source input is being received, most often because the ⁇ /Off button is set to Off.
- the lighted ⁇ /Off button indicates the application is active.
- volume corresponds to overall device volume, which may also be adjusted using hardware buttons on the device, in some embodiments. It is best that this control be close to the maximum as the gain reflected in the setting is out of the purvey of the exemplary embodiment processing.
- the 'boost' stepper allows you to change the internal volume (or gain) of the audio as
- the best sound quality is achieved by first maximizing the hardware volume, and then increasing the internal volume.
- the audio feedback may be decreased by reducing either the 'boost' stepper or Volume' slider.
- This main "RealClarity" screen includes two large buttons, the 'filter' button which deals with noise control and the 'clarify' button which deals with gain adjustments.
- the 'Clarify' screen has two wheel controls and a 'Customize' button.
- the two wheels allow a User to adjust the clarity of the processed sound by modifying a Hearing Profile or an Equalizer Profile.
- the 'Customize, button allows the User to create or modify a Profile or activate a stored Profile. Custom settings may be set by spinning the wheel to the setting where you will find the slider icon.
- the symbols on the left wheel allow the User to select a pre-set Equalizer Profile setting, for example, Profiles may be selected for Speech, TV, Outdoors, Music, Movie and Live Event. There is an option to select "Off", which means to not use an Equalizer Profile and a setting, with a tuner icon, which means to use the selected customized Equalizer Profile.
- the symbols on the right wheel allow the User to select a pre-set Hearing Profile.
- the preset Hearing Profiles reflect average hearing loss by age from 40 to 85 in increments of 2 or 3 years.
- the age chosen is shown in the small wheel. There is a flat-line setting for ages below 40. In general, the higher the age, the more amplification there is for medium and high frequency sounds. Users can start with a setting close to their age, and then experiment up or down to find the setting that works best for them in different environments.
- the "Clarify” screen has a 'return' control (left carrot) in the upper left corner, as do many other screens, that, if selected, returns to the calling screen.
- the "ClarifyCustom” screen displays three buttons that allow the User to customize input and has a “Cancel” button that returns to the "Clarify” screen.
- the User can enable and/or customize a number of features with respect to the clarity of the desired sound.
- the "Enter your audiogram” allows the User to enter of modify a Hearing Profile by bringing up the "Hearing Profile” screen.
- the "Optimize headphone sound” allows the User to create a base profile that corrects for frequency anomalies in a wired earpiece by bringing up the "Headphone” screen.
- the "Equalizer screen allows the User to modify the current active Equalizer Profile, which is displayed.
- the Equalizer Profile shapes sound, much like the treble and bass controls on a stereo, but with more fine-grained frequency tuning.
- the horizontal axis displays the frequencies that can be set. The key voice frequencies are 500Hz to 4 KHz.
- the vertical axis displays the decibels that will be added to the gain of a frequency.
- the display bar at the bottom of the screen identifies the current active Equalizer profile.
- the User moves the frequency sliders to the shape desired. The User can then select the 'return' control return to the calling Screen:
- the User can save or activate a stored Equalizer Profile by selecting the 'next' control (right carrot) at the bottom right corner of the screen.
- the "Equalizer Select” screen is displayed, which allows the user to activate a saved Equalizer profile.
- the "Equalizer Name” screen is displayed, which requires the User to name the modified Profile and stores it.
- the "Equalizer Select” screen is displayed with the newly saved Equalizer activated.
- the User can save or activate a stored Equalizer Profile by selecting the "Save” button at the top right corner of the screen.
- the Equalizer name screen allows the user to name and save the modified displayed Equalizer Profile.
- the Equalizer Select screen is displayed. The newly stored and named Equalizer Profile will be listed and checked as active.
- the "Equalizer Select" screen displays the set of saved Equalizer Profiles.
- the currently active Equalizer Profile is indicated by a check on the list of saved profiles.
- the User can activate another Equalizer Profile by selecting a name on the list.
- the check mark will move to that entry indicating that that profile is now the active Equalizer Profile.
- the currently displayed Equalizer Profile becomes the active Equalizer Profile and the calling screen is displayed.
- the Hearing Profile provides input to the DSP to add frequency-based gain to improve the audibility of Sound.
- the Hearing Profile contains separate profile components for the right and left ear.
- the vertical axis displays the decibels that will be added to the gain of a frequency.
- the vertical bar is inverted so that the frequency display mimics a typical audiogram that shows hearing loss in decibels, which increase at the lower settings.
- the display bar at the bottom of the screen identifies the displayed active Hearing Profile.
- the User moves the frequency sliders to the shape desired.
- the horizontal button bar selects the Hearing Profile component to display.
- the "Left” button displays the left-ear Hearing Profile component and the “Right” button displays the right-ear Hearing Profile component. If the Hearing Profile left and right components profiles are the same then the User can select the "Both” button.
- the supplied pre-set Hearing Profile have the same profile for both the left and right ears. Modifications on the "Both” displayed screen will be recorded in both the right-ear and left-ear Hearing Profile components. If there is a difference between the right and left component then modifying the displayed Hearing Profile will only modify the right-ear Hearing Profile component. The User can then select the 'return' control to return to the calling Screen
- the User can save or activate a stored Hearing Profile by selecting the 'next' control (right carrot) at the bottom right corner of the screen.
- the "Profile Select” screen is displayed, which allows the user to activate a saved Hearing Profile.
- the "myProfile Name” screen is displayed, which requires the User to name the modified Hearing Profile and then stores it.
- the "Profile Select” screen is displayed with the newly saved Equalizer activated. The User can save or activate a stored Equalizer Profile by selecting the "Save” button at the top right corner of the screen.
- the "myProfile Name” screen is displayed, which requires the User to name the modified Hearing Profile and stores it.
- the "Profile Select” screen is displayed with the newly saved Hearing Profile activated.
- the "Profile Select” screen displays the set of saved Hearing Profiles.
- the currently active Hearing Profile is indicated by a check on the list of saved profiles.
- the User can activate another Hearing Profile by selecting a name on the list.
- the check mark will move to that entry indicating that that profile is now the active Hearing Profile.
- These screens initiate a test of Speakers in a wired earpiece to identify any anomalies in the frequency gain. This is done by executing the Speaker Response Estimator process that results in an active Headphone Profile. The resulting Headphone Profile can be stored for later sessions that use the same earpiece.
- Each model of earpiece has its own frequency characteristic or profile. This screen allows the exemplary embodiment to measure that characteristic. Once measured the sample is used to create a profile that is used by the DSP to produce the best sound possible and to minimize the likelihood of audio feedback. The smaller earbuds often have a frequency bump that can cause feedback.
- the User sets or holds the earpiece as pictured. The best results are obtained 1) by doing it in a relatively quiet place, and 2) by setting the hardware volume control about two-thirds of the way to the right. Then the User selects the ""Start" button.
- the bar at the bottom of the screen displays the name of the active Headphone Profile. "Optimizing Headphone" screen, Fig. 13
- the optimization process takes about 15 seconds. This screen displays the duration of that optimization process. When the process is complete the "Save optimization?" pop-up screen is displayed.
- optimization profile for the current session. If the User selects the "Save and use” button, then a "Headphone Name” pop-up screen will display. Once the optimization profile is named it will be stored and displayed as the active optimization profile in the "Headphone Select" screen.
- Headphone Select screen Fig. 15
- the "Headphone Select" screen displays the set of saved Headphone Profiles.
- the currently active Headphone Profile is indicated by a check on the list of saved profiles.
- the User can activate another Headphone Select by selecting a name on the list.
- the check mark will move to that entry indicating that that profile is now the active Headphone Select.
- Filter Screens These screens provide parameters and Noise Profiles that are utilized by the DSP for noise control.
- the vertical display bar on the "Filter' screen has a vertical display slider which visually shows the presence of audio input through a colored column. Users can use the sliders on the vertical bar to reduce noise.
- the upper slider indicates a gain level that is used by the DSP to recognized sharp, sudden sound that should not be amplified.
- the lower slide represents the gain level for low frequency audio, i.e., out of the speech range, that should not be damped. The reason these controls are on the slider is to give a User a visual clue on the appropriate settings by seeing visualization of the audio being processed.
- the DSP processor has a capability to continuously estimate what is noise in the audio input.
- the algorithm works better with a static Noise Profile as long as that profile reflects noise in a stable environment, e.g., the air conditioner noise in an otherwise quiet room in which the User is participating in a meeting, the fairly constant noise produced in a traveling car, and, somewhat ironically, in a very quiet environment so the DSP algorithm does not guess wrong about what is noise and what is speech. If the user selects the "Sample Noise” button the "Sampling pop-up screen is displayed and the Noise estimating process is initiated.
- the sampling process takes about 5 seconds. It's best to sample when people are not speaking (since you will probably NOT want to filter or eliminate speech), so the User may often ask for a moment of silence.
- This screen displays the duration of the sampling process.
- the "Save noise sample?” pop-up screen is displayed. "Save noise sample?" screen, Fig. 18
- the computed Sample Noise Profile is the active profile for the current session.
- the slider on the horizontal bar of the "Advanced Filter” screen allows the user to fine tune the DSP process. This is primarily by affecting the timing of the transition once the DSP process decides that targeted speech has begun or that it has ended. In noisy environments the slider should be moved to the right towards the label; "Reduce Noise”. With this setting the DSP will quickly reduce noise but in the process may clip the beginning of speech. In a quiet
- the "Select a Noise Filter to Use” section of the screen lists the stored Sample Noise Profiles, with the active Sample Noise Profile indicated by a check-mark. The User can select a different Sample Noise Profile from the list, which is then activated. The "Continuous adaption" profile is always available and is the default, if the user has not created or activated a stored Sample Noise Profile.
- the aggregate latency may be reduced by forgoing the need to analyze the audio input in the frequency domain (e.g., by performing a Fast Fourier Transform).
- DSP discrete-semiconductor
- both time and frequency domain Voice Activity Detection may be utilized by processing the audio input in a separate thread that identifies Speech in the time- domain. Processing in this thread may include dividing the audio signal, at regular intervals reflecting the acceptable latency, into two frames— a small frame and large frame.
- the energy parameter (E) is calculated by frequency from the small frame and the calculated energy (E) is used to detect a start-point and endpoint of audio that is identified as speech. It is initially identified in the speech mode where a pitch period (P) is detected and measured from the large frame, and the pitch detection is used to determine whether there is voiced speech to validate that the audio is speech and may be identified as speech mode.
- the start and end of speech, as detected in this thread would be sent as an argument to the DSP process.
- This disclosed embodiment may utilize a unique two-step method to detect speech sounds.
- the embodiment works in the constrained environment of the operating system of a commercial mobile device. Given a sound input source, the embodiment detects speech in that sound in real time.
- the embodiment used in conjunction with a mobile device, can assist improved speech intelligibility for a listener. Once the speech is detected the speech can be amplified and non speech sound can be reduced or suppressed.
- An aspect of the embodiment is designed to detect speech in a short time. This is required if the latency between the processed speech and speech sound arriving directly to the listener's ear is too long, the listener's brain will not integrate the two sounds and speech clarity may be lost in the confusion of sound echoing.
- speech endpoints are detected in real time utilizing the computing power of an appropriate mobile device.
- the technique addresses a major constraint for detecting speech on such devices.
- One of the important requirements for hearing enhancement is that the time delay caused by processing the speech may be very short. The short period is defined as the time the majority of listeners would not hear the delay between the speech being processed and the unprocessed ambient speech directly reaching a listener's ear. Listeners would not hear the delayed sound because as long as the latency is very short, the listener's brain will integrate the two sounds. If the latency caused by the processing is longer, the latency would be noticeable and listening to the processed speech sound would be annoying or confusing.
- the embodiment describes a method that detects speech so that the latency of processing speech on a mobile device, including the built-in latency for the required processing of the device's operating system to input and output the sound, is very short.
- speech detection or voice activity detection (VAD)
- VAD voice activity detection
- Speech detection identifies the starting and ending points of speech versus the ambient noise. Speech detection is typically based on changes in short time sound energy, some algorithms use additional parameter such crossing rate (number of times the signal has crossed the zero value) for assistance. This mechanism works because when someone talks, they may talk louder than the background noise to be heard. This increase in sound energy can then be interpreted as speech.
- the embodiment first assumes a certain ambient noise level derived either from the beginning of the input signal or from manual training, and establishes a speech threshold a few dB above the noise level. It then continuously measures the input short time (10 20ms frames of data) energy. When the input short time energy exceeds the speech threshold for a period of time (N), it decides that the speech has started. When the input signal is in speech and the short time energy drops below a threshold set close to the background noise level for a period of time (M), it decides that the speech has ended. To avoid false trigger of speech by short duration loud noise, the time period (N) for speech trigger may range from 50ms to 200ms. Once the speech start is detected, the system back tracks the input signal by the time period (N) to mark it as the real starting point of speech. The time period (N), therefore, is the delay of speech detection.
- Speech recognition systems utilize methods with delays of up to about 200ms, as these systems are not providing real-time hearing assistance.
- (N) can be as short as 50ms.
- many short duration loud noises such as a tap on the table would trigger speech detection.
- Another problem with current speech detection systems when related to hearing assistance is the recognition that the ambient noise level has increased, such as when a person has just walked into a noisy restaurant. The increased sound energy would cause the higher level of ambient noise to be detected as speech.
- Current speech detection systems utilize some mechanism, such as automatic reset after a long period of continuous speech (i.e., tens of seconds) or by a manual user reset, to readjust the ambient noise level.
- This embodiment proposes a two-step method of speech detection to overcome the weakness as mentioned above. Once speech is detected, that speech can be amplified and background noise suppressed or reduced.
- Input signal is divided into two sequences of frames with frame size of 20ms and 40ms, respectively. Both sequences have the same frame interval of 10ms, that is, for every 10ms of input signal, a pair of frames, one with frame size 20ms and one with frame size 40ms are obtained. Therefore, the decision made based on a pair of frames (small and large) has an inherent delay of 10ms.
- a total energy (E) is calculated from the small frame
- a pitch period (P) is detected and measured from the large frame.
- Energy calculation and pitch measurement are well known prior art that can be found in many digital signal processing textbooks and publications.
- the energy (E) value would increase.
- vowels or voiced speech contain pitches that are caused by vibration of the vocal cord and display a periodic pattern.
- Human voice pitch frequencies range from 100Hz to 400Hz, which translate to pitch period of 10ms to 2.5ms. Since background noise rarely presents such periodic pitch pattern, detection of voiced speech or pitches is a reliable indication of speech, even in a noisy environment. However, not all speech is voiced.
- consonants such as "fs”are unvoiced that don't have pitches and are difficult to distinguish from noise. Fortunately, almost every word contains voiced speech, and the beginning consonant is short, typically 20 100ms long, and the transitional period from consonant to vowel typically shows some pitch pattern as well. A large frame of 40ms contains multiple pitch cycles and can result in more reliable pitch detection and measurement.
- the two-step method uses the energy to detect endpoints of speech, and the pitch detection to determine whether there is voiced speech.
- the energy-based speech detection responds quickly to speech, in 10ms as determined by the frame interval. Such short delay is critical for hearing enhancement applications. However, it can be easily triggered by increased noise as well.
- the pitch-based voiced speech detection distinguishes real speech from increased noise, but it takes longer duration (a few dozen milliseconds to a few seconds) to make a decision. If no voiced speech is detected after speech trigger, the detected speech is cut short and the speech detection threshold is updated to the increased noise level.
- the speech detection algorithm has two modes, noise mode where input signal is assumed as noise, and speech mode where input signal is assumed as speech.
- An input frame is labeled as "noise” in noise mode, and "speech” in speech mode, until the detection mode switches from one to another.
- speech When speech is detected, it switches from noise mode to speech mode# when speech ends or cut short, it switches from speech mode to noise mode.
- the algorithm starts with noise mode. The following outlines the speech detection algorithm:
- an energy (E) is calculated from the small frame.
- noise mode if (E) is above a speech detection threshold (T), detection enters speech mode and the current frame is labeled as speech# otherwise, update the overall noise level in sequence of previous "noise” frames, adapt the speech detection threshold (T) to the new noise level.
- T speech detection threshold
- a pitch measurement is calculated from the large frame. If pitch is detected and the pitch period is between 2.5ms and 10ms, the frame is labeled as 'voiced'. For a predetermined duration (M), typically between 100ms- 5 seconds, if the number of "voiced" frames exceeds a threshold (L), it is determined that there is real voiced speech in the current speech mode# otherwise, there is no voiced speech and the speed mode is invalid, and: a. the current frame is labeled as noise and detection mode switches to noise, b. if no voiced speech has ever been detected in the current speech mode, update the overall noise level in sequence of previous frames including those labeled as "speech" in the same speech mode, and adapt the speech detection threshold (T) to the new noise level.
- the voiced speech detection based on pitch measurement can be running continuously also in noise mode to reliably obtain a noise reference model.
- the frame is labeled as speech and the detection enters speech mode, and the speech detection threshold (T) is further lowered to reflect low signal to noise ratio.
- T speech detection threshold
- the energy-based speech detection depends on a threshold (T), which is set based on the noise energy. Therefore, the robustness of the detection depends on the reliability of obtaining a noise reference.
- Pitch detection can be used to reliably obtain a noise reference in the noise mode by detecting a period of sound at least one or a few seconds long where no pitch is detected, denoting this period of sound as unvoiced sound. By discarding the beginning and ending parts (e.g., a few hundredths milliseconds each) of this unvoiced sound, the center part of the unvoiced sound can reliably serve as noise reference.
- a filter bank can be used to obtain a set of energy values across a frequency spectrum for speech detection instead of the total energy.
- a filter bank is an array of band pass filters covering the voice spectrum, such as from 100 Hz to 5000Hz, with each band pass filter covering a different frequency sub band.
- a soft speech typically an unvoiced consonant, has higher energy in one or more sub bands even when its total energy may be very close to the background noise. For example, the consonant "f" or "s" has higher energy in frequency sub band of 2000Hz and above.
- a filter bank output therefore can be used to detect speech in each frequency sub band, which is more sensitive than the total energy.
- the adaptation may use different speeds depending on whether the noise level is increasing or decreasing and on the distance of energy level of previous detected voiced speech from the noise level. Faster adaptation to lower noise level makes it more likely to detect soft speech in rapidly changing ambient noise. And if the distance of energy level of detected, voiced speech from the noise level is small (an indication of low signal to noise level), speech detection threshold (T) may be set lower to more easily detect soft speech. Additional Hearing Profile and Equalizer Settings
- a facility would be offered that allows a User to take a "standard" hearing test and create and store a resulting audiogram.
- the hearing test would be implemented by having the User recognizing whether they can hear a sound of a certain frequency and depreciating gain on frequency until it cannot be heard. Given that the hearing test would utilize the same earpiece and speaker system that the User will use for Hearing assistance, the resulting audiogram can be more useable than an audiogram resulting from an externally administered hearing test. Also the hearing test could be performed in controlled but different auditory settings, potentially providing more accurate audiogram variants.
- a speech intelligibility test would be offered to more precisely deal with a particular User's audibility. The intelligibility test would be accomplished by playing words at various levels of sound and noise. The result of the intelligibility test, for example, an inability to distinguish certain consonants, would be provided to an enhanced DSP that would be able to process the information and moderate that User's intelligibility issues.
- a facility would be offered for a User to create paired equalizer setting for left and right ears. This would be especially useful for those where there is a marked difference in audibility between the left and right ear.
- Equalizer Profiles based on an analysis of the audio input being processed. For example, different Equalizer Profiles would be selected as a User went from a quiet to a noisy environment or switched from listening to music to listening to targeted Speech. A UI would be provided to the User to associate Equalizer Profiles with an audio environment.
- the speech intelligibility aspect of a hearing test could be accomplished by playing words at various levels of sound and noise.
- the processor could take information from the speech test to enhance and/or modify the basic hearing profile.
- the alternate embodiment would have controls to record and store any input audio or processed audio on the Mobile Computing Device's local storage or in the Cloud.
- the alternate embodiment would have controls to access the stored audio, so a User could rehear the stored audio; controls to reprocess the stored audio, for example, to create or refine Profiles and re-sample noise; and controls to utilize restored audio in a hearing test.
- the alternate embodiment would have controls to set a preferred volume level. This would be implemented by allowing Users to select a volume level utilizing prerecorded sound. The embodiment would use the selected sound level to adjust for gain changes in the real-time audio input.
- the alternate embodiment would have a facility to be trained to recognize a keyword such that when a User utters that keyword the embodiment expects a following command phrase.
- the embodiment would provide a set of audio command phrases as an alternate
- the alternative embodiment would monitor ambient sound and when Speech is recognized, would reduce or mute the gain from the electronic audio that is produced by another application, allowing processed ambient sound to be heard by a User.
- the alternative embodiment would also have a UI control that explicitly switches between electronic audio and ambient sound processing.
- the alternative embodiment would process the electronic audio in the same manner that ambient sound is processed, so that Users would get the benefits of hearing assistance for electronic audio.
- the alternate embodiment would have explicit mechanisms for other applications to provide audio input, allowing the other applications to take advantage of the "always-on" audio connection with a User. For example, Users could get appointment reminders whispered in their ear, and be connected to body-area health monitors where they could, for example receive and audio warning of unusually high blood pressure. Utilizing the Internet and its Cloud
- a history of a designated set of parameter settings, state variables, and Profiles are saved to storage on the Cloud, such that the history of use can be examined and that previous settings can be restored.
- the embodiment could interface with other internet services.
- the User would be using the embodiment continuously for hearing assistance, there would be an "always on" audio connection to the Internet. This would be useful to push audio information to a User, specifically advertisements and marketing messages.
- the Bluetooth radio link is comprised of various packets sent in "time-slots" where a time slot is 625 micro-seconds.
- the synchronous packets were designed to carry voice signals, whereas the asynchronous packets are designed to carry data.
- the SCO packets are real-time and provide no recovery for lost packets.
- eSCO has a modest retransmit capability and ACL has a full ack type protocol to insure data reliability at the expense of uncertain delivery time.
- Bluetooth profiles determine the type of packets that are used for each case. For wireless headsets two profiles are almost universally supported. One is called HFS (Hands Free) and the other is called A2DP. HFS uses SCO packets and sends data via the RFCOMM API, a serial port emulation port that uses +AT commands to control call setup, select modes, etc. HFS supports bi-directional calls but only 64kbps data rates - mono and low fidelity. A2DP uses ACL packets and sends data via the GAVDP Interface. It is uni-directional and can support data rates up to 721 kbps.
- the innovation presented here may mimic the input for an A2DP profile, in which case a receiver that handles the A2DP profile may be useable. However, it may be that a proprietary profile needs to be defined, this would be third profile beyond the standard HFS and A2DFP protocols and may require specialized receivers.
- the audio coder This includes coding for error recovery at both the bit and packet level.
- the coder also handles bit rate synchronization due to the difference in clock signals of the Bluetooth link and the sampling rate of the signal chain.
- the presented invention is in the audio coding and how it interfaces to the signal processing chain.
- One aspect of the invention is the efficiencies that are possible by more closely integrating the output of the signal processing chain and the sub band filters that are used in many audio coders.
- SBC the Bluetooth default coder for music is of this type, for example.
- the delay is determined by the input and output buffers which, in turn is dependent upon the number of sub-bands. Half of the delay comes on the input and other half comes on the final output when the data is sent one sample at a time to the DAC.
- One key aspect of this approach is to make sure that additional delay is introduced only by the radio link and to minimize the delay due to serialization for the radio link.
- the SBC codec is a subband based algorithm with block based ADPCM coding of the subband outputs. By making an entire buffer available— the output of the IFFT— the SBC has enough data to begin processing. The normal delay of waiting for a sufficient number of samples is bypassed. Processing efficiency is possible by converting from the oversampled complex frequency domain of the FFT to the subband filters of many coder algorithms.
- the A2DP SBC codec is one option
- Adaptive Delta Pulse Coded Modulation when implemented with backward prediction is a 0ms delay codec.
- ADPCM Adaptive Delta Pulse Coded Modulation
- Early versions were implemented for compressing telephone calls from 64 kbps to 32 kbps. To achieve greater bandwidth than the 3.2 khz bandwidth of the phone network, filter banks were developed to break the desired frequency range into smaller bands and then using ADPCM to code the output of each of the bands. Note that there is no requirement that the same number of bits are required to code each sub band.
- the signal processing chain does analysis using the overlap-add method and processes 256 samples into 256 frequency bins. With 75% overlap, 64 of the output bits are valid after every overlap add execution (aka one cycle through the signal chain).
- the delay from input of 256 samples to 64 bit output samples, at 44.1Khz is 5.8ms plus the processing time.
- the processing time is under 0.1ms, so the total processing delay is less than 5.9ms.
- the total delay includes the input and output delay of the device.
- An iPod Touch Gen 4 has 5.2ms of delay for 256 samples at 44.1KHz. This is in addition to the 256 bit delay for the processing.
- MDCT Modified Discrete Cosine Transform
- This delay is less than the estimated system output delay. If we assume that the earpiece delay plus the MDCT delay is equal to 1 ⁇ 2 the system delay, the total wireless delay will be 11.1ms plus the wireless transport delay.
- Coders based on subband filtering and followed by a quantization and coding. The delay and most of the calculations are due to the subband.
- Coders based on linear prediction followed by quantization and coding. The delay
- Some prediction coders such as AD PCM can have zero delay. Pairing these coders with the signal processing chain, yields a total delay of 5.9ms + delay of the coder. If ADPCM is the coder, for example, the delay is under 6ms.
- the radio link audio processing includes coding and error prevention and recovery.
- the table below shows the types of packets and their corresponding delay and bit rates that are illustrative of the Bluetooth packets type that could be used for this application.
- HV3 is an SCO packet. HV3 packets are sent without options for re-transmit. EV3 is an eSCO packet. eSCO packets have a re-transmission request if the CRC indicates a problem. EV3 may be a good choice because 1) it has a re-transmit capability, 2) if we put in redundancy for packet loss, the delay would be 5 ms if we repeated each packet (note this would require compression to 48kbps).
- ACL provides several options, some of which include FEC and CRC at the expense of bandwidth. Error rates may indicate another choice.
- the data will be sent in DM1 ACL packets.
- the data bits will be packed with the voice bits, so the effective data rate will be somewhat less than ideal rate.
- the data rate is expected to be low enough that the audio bit rate for EV3 will be over 90 kbps.
- the audio that will be sent over the wireless link from the signal processing chain is bandwidth limited to 8 kHz.
- the dynamic range is muted by the broadband AGC. (Note we may want to move the Broadband AGC.)
- Delay can be added into the system when buffers are serialized. This implies that fitting a frame of data into one or, at most two, packets is advantageous. Based on the data above, a bit rate of about 95,500 bps would yield the lowest delay (including overhead for error recovery/mitigation).
- Wireless communications links have particular levels of susceptibility with respect to increasing range, interference from other devices and the effects of multipath propagation.
- the audio codec has a role to play in terms of tolerance to bit errors and recovery from longer-term data loss.
- the maximum allowable time for the audio decoder to re-synchronize to the data stream after longer-term data loss is of the order of 3ms.
- AMR uses information about the channel to determine bit rate. See also paper to use jitter buffer info to adjust the encoding and decoding, loss concealment, etc.
- This innovation leverages the sub-band structure of the signal processing chain to produce a sub-band coder for the wireless link.
- the complex frequency representation is converted into a real modified cosine transform sub-band representation. Then the sub-band outputs are quantized and coded.
- SBC combines several output vectors which add latency, so the key is not to combine successive frames.
Landscapes
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Neurosurgery (AREA)
- Otolaryngology (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Circuit For Audible Band Transducer (AREA)
- Telephone Function (AREA)
- Headphones And Earphones (AREA)
Abstract
La présente invention se rapporte à un procédé et à un appareil adaptés pour fournir un dispositif d'assistance auditive riche en caractéristiques utilisant un logiciel qui s'exécute dans l'environnement d'exploitation standard de plates-formes mobiles disponibles dans le commerce. Des exemples de ces systèmes d'exploitation comprennent : iOS pour la ligne de produits Iphone, iTouch et iPad d'Apple ; Google Android utilisé par plusieurs Smartphones, tablettes et autres plates-formes mobiles ; et Windows Mobile de Microsoft. Par ailleurs, l'environnement d'exploitation pourrait comprendre des sous-programmes de niveau inférieur appelés par l'environnement d'exploitation, comme les pilotes de dispositifs et le micrologiciel qui peuvent être utilisés dans la prise en charge de jeux de puces par exemple, et qui sont inclus dans les plates-formes mobiles disponibles dans le commerce. En plus de l'utilisation de ces plates-formes mobiles, le mode de réalisation fourni décrit utilise des écouteurs filaires et des écouteurs sans fil. La plate-forme mobile transmet aux écouteurs sans fil un son stéréo ou mono au moyen de protocoles de communication, tels le protocole Bluetooth A2DP par ex., pour une transmission stéréo de haute qualité audio, une transmission numérique à très faible latence ou une transmission RF ou Bluetooth spéciale bon marché. [Se référer à la section 4 pour une description de l'écouteur requis]. Le mode de réalisation fourni à titre d'exemple de l'invention utilise également une rallonge de microphone très simple d'utilisation qui se branche dans la prise microphone de la plate-forme mobile. Les microphones utilisés peuvent être dotés de caractéristiques spéciales afin de garantir un traitement efficace du son ambiant, en particulier la parole.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201361829242P | 2013-05-30 | 2013-05-30 | |
| US61/829,242 | 2013-05-30 |
Publications (3)
| Publication Number | Publication Date |
|---|---|
| WO2014194273A2 true WO2014194273A2 (fr) | 2014-12-04 |
| WO2014194273A8 WO2014194273A8 (fr) | 2015-01-08 |
| WO2014194273A3 WO2014194273A3 (fr) | 2015-11-26 |
Family
ID=51989545
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2014/040359 Ceased WO2014194273A2 (fr) | 2013-05-30 | 2014-05-30 | Systèmes et procédés d'amélioration d'une audibilité ciblée |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2014194273A2 (fr) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2019035835A1 (fr) * | 2017-08-17 | 2019-02-21 | Nuance Communications, Inc. | Détection à faible complexité de parole énoncée et estimation de hauteur |
| EP3528251A4 (fr) * | 2016-10-12 | 2019-08-21 | Alibaba Group Holding Limited | Procédé et dispositif destinés à détecter un signal audio |
| CN111933168A (zh) * | 2020-08-17 | 2020-11-13 | 齐鲁工业大学 | 基于binder的软回路动态消回声方法及移动终端 |
| CN113129917A (zh) * | 2020-01-15 | 2021-07-16 | 荣耀终端有限公司 | 基于场景识别的语音处理方法及其装置、介质和系统 |
| WO2022079476A1 (fr) * | 2020-10-15 | 2022-04-21 | Palti Yoram Prof | Dispositif de télécommunication assurant une compréhension améliorée de la parole dans des environnements bruyants |
| US11363147B2 (en) | 2018-09-25 | 2022-06-14 | Sorenson Ip Holdings, Llc | Receive-path signal gain operations |
| US20240420729A1 (en) * | 2022-08-31 | 2024-12-19 | Elisa Oyj | Computer-implemented method for detecting activity in an audio stream |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5566237A (en) * | 1994-02-03 | 1996-10-15 | Dobbs-Stanford Corporation | Time zone equalizer |
| US7120258B1 (en) * | 1999-10-05 | 2006-10-10 | Able Planet, Inc. | Apparatus and methods for mitigating impairments due to central auditory nervous system binaural phase-time asynchrony |
| US7054453B2 (en) * | 2002-03-29 | 2006-05-30 | Everest Biomedical Instruments Co. | Fast estimation of weak bio-signals using novel algorithms for generating multiple additional data frames |
| US7257372B2 (en) * | 2003-09-30 | 2007-08-14 | Sony Ericsson Mobile Communications Ab | Bluetooth enabled hearing aid |
| US8886545B2 (en) * | 2007-03-07 | 2014-11-11 | Vlingo Corporation | Dealing with switch latency in speech recognition |
| US7929722B2 (en) * | 2008-08-13 | 2011-04-19 | Intelligent Systems Incorporated | Hearing assistance using an external coprocessor |
| DK2164066T3 (da) * | 2008-09-15 | 2016-06-13 | Oticon As | Støjspektrumsporing i støjende akustiske signaler |
-
2014
- 2014-05-30 WO PCT/US2014/040359 patent/WO2014194273A2/fr not_active Ceased
Cited By (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3528251A4 (fr) * | 2016-10-12 | 2019-08-21 | Alibaba Group Holding Limited | Procédé et dispositif destinés à détecter un signal audio |
| US10706874B2 (en) | 2016-10-12 | 2020-07-07 | Alibaba Group Holding Limited | Voice signal detection method and apparatus |
| WO2019035835A1 (fr) * | 2017-08-17 | 2019-02-21 | Nuance Communications, Inc. | Détection à faible complexité de parole énoncée et estimation de hauteur |
| US11176957B2 (en) | 2017-08-17 | 2021-11-16 | Cerence Operating Company | Low complexity detection of voiced speech and pitch estimation |
| US11363147B2 (en) | 2018-09-25 | 2022-06-14 | Sorenson Ip Holdings, Llc | Receive-path signal gain operations |
| CN113129917A (zh) * | 2020-01-15 | 2021-07-16 | 荣耀终端有限公司 | 基于场景识别的语音处理方法及其装置、介质和系统 |
| CN111933168A (zh) * | 2020-08-17 | 2020-11-13 | 齐鲁工业大学 | 基于binder的软回路动态消回声方法及移动终端 |
| CN111933168B (zh) * | 2020-08-17 | 2023-10-27 | 齐鲁工业大学 | 基于binder的软回路动态消回声方法及移动终端 |
| WO2022079476A1 (fr) * | 2020-10-15 | 2022-04-21 | Palti Yoram Prof | Dispositif de télécommunication assurant une compréhension améliorée de la parole dans des environnements bruyants |
| US20240420729A1 (en) * | 2022-08-31 | 2024-12-19 | Elisa Oyj | Computer-implemented method for detecting activity in an audio stream |
| EP4581619A1 (fr) * | 2022-08-31 | 2025-07-09 | Elisa Oyj | Procédé mis en oeuvre par ordinateur pour détecter une activité dans un flux audio |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2014194273A8 (fr) | 2015-01-08 |
| WO2014194273A3 (fr) | 2015-11-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20150281853A1 (en) | Systems and methods for enhancing targeted audibility | |
| US12279092B2 (en) | Interactive system for hearing devices | |
| CN111447539B (zh) | 一种用于听力耳机的验配方法和装置 | |
| US10186276B2 (en) | Adaptive noise suppression for super wideband music | |
| WO2014194273A2 (fr) | Systèmes et procédés d'amélioration d'une audibilité ciblée | |
| JP6374529B2 (ja) | ヘッドセットと音源との間のオーディオの協調的処理 | |
| CN105960794B (zh) | 用于语音命令的智能蓝牙耳机 | |
| JP6325686B2 (ja) | ヘッドセットと音源との間のオーディオの協調的処理 | |
| US20120263317A1 (en) | Systems, methods, apparatus, and computer readable media for equalization | |
| JP2014524593A (ja) | 適応音声了解度プロセッサ | |
| JP2011511571A (ja) | 複数のマイクからの信号間で知的に選択することによって音質を改善すること | |
| US20080228473A1 (en) | Method and apparatus for adjusting hearing intelligibility in mobile phones | |
| US20140365212A1 (en) | Receiver Intelligibility Enhancement System | |
| CN109416914A (zh) | 适于噪声环境的信号处理方法和装置及使用其的终端装置 | |
| US6999922B2 (en) | Synchronization and overlap method and system for single buffer speech compression and expansion | |
| TWI503814B (zh) | 使用時間上及/或頻譜上緊密音訊命令之控制 | |
| EP2743923B1 (fr) | Dispositif et procédé de traitement vocal | |
| JP2025527151A (ja) | インテリジェントな発話又は対話の強化 | |
| CN113571072B (zh) | 一种语音编码方法、装置、设备、存储介质及产品 | |
| US8340972B2 (en) | Psychoacoustic method and system to impose a preferred talking rate through auditory feedback rate adjustment | |
| CN115713942A (zh) | 音频处理方法、装置、计算设备及介质 | |
| JP5027127B2 (ja) | 背景雑音に応じてバイブレータの動作を制御することによる移動通信装置の音声了解度の向上 | |
| EP4258689A1 (fr) | Prothèse auditive comprenant une unité de notification adaptative | |
| US11302342B1 (en) | Inter-channel level difference based acoustic tap detection | |
| JPH11331328A (ja) | ハンズフリー電話装置 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14804347 Country of ref document: EP Kind code of ref document: A2 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 14804347 Country of ref document: EP Kind code of ref document: A2 |