[go: up one dir, main page]

WO2025229374A1 - A method for speech recognition using tactile stimulation - Google Patents

A method for speech recognition using tactile stimulation

Info

Publication number
WO2025229374A1
WO2025229374A1 PCT/IB2024/054167 IB2024054167W WO2025229374A1 WO 2025229374 A1 WO2025229374 A1 WO 2025229374A1 IB 2024054167 W IB2024054167 W IB 2024054167W WO 2025229374 A1 WO2025229374 A1 WO 2025229374A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequencies
tactile
frequency
audio input
key features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/IB2024/054167
Other languages
French (fr)
Inventor
Evagoras XYDAS
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Irerobot Ltd
Original Assignee
Irerobot Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Irerobot Ltd filed Critical Irerobot Ltd
Priority to PCT/IB2024/054167 priority Critical patent/WO2025229374A1/en
Publication of WO2025229374A1 publication Critical patent/WO2025229374A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/016Input arrangements with force or tactile feedback as computer generated output to the user
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B21/00Teaching, or communicating with, the blind, deaf or mute
    • G09B21/009Teaching or communicating with deaf persons
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/16Transforming into a non-visible representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L2021/065Aids for the handicapped in understanding

Definitions

  • the present invention relates to speech recognition, in particularly to devices and methods for converting speech to tactile stimulation for recognition.
  • Hearing loss is a problem that affects millions of people. There are different degrees of hearing loss ranging from complete deafness from birth to diminished hearing from aging. Hearing aids are sufficient for many people having diminished hearing but in more severe cases such as for people who are fully deaf, hearing aids are usually not able to help them perceive speech or music. People who are both deaf and blind have even greater difficulty in communication.
  • Tactile stimuli may be conveniently provided on a particular skin area of the hand for example.
  • the use of tactile devices as a substitute for audition is dependent on the efficacy of phonemic recognition via tactile signals alone.
  • vibrotactile aids can enable deaf people to detect segmental and suprasegmental features of speech, and the discrimination of common environmental sounds.
  • Existing vibrotactile aids provide very useful information as to speech and non-speech stimuli. However, this is not enough. It is necessary to provide the means for accurate speech recognitions so that deaf people would be able to understand the meaning of what is being said. There is a need for a means that provides sound-to-tactile stimulation that enables the distinction of vowels.
  • the provided methods are to detect speech sounds and convert those sounds into haptic stimuli, in real-time.
  • the spectrum of sound frequencies is represented through a corresponding spectrum of tactile stimulation by using a vibrating surface that has a plurality of modal nodes.
  • the sound frequencies corresponding to vowel sounds of the detected sound are extracted and utilized to provide tactile stimulus.
  • this invention provides tactile stimuli that correspond to both spatial and frequency domains through vibration nodes on the surface of a membrane that is in contact with the skin of the user.
  • a method for enabling comprehension of speech by tactile stimulation includes the following steps: receiving an audio input signal; performing an analog-to-digital conversion on the audio input signal; performing a Discrete Fourier Transformation on the digital audio input signal to generate a frequency domain representation of the digital audio input signal; extracting one or more key features of the audio input signal based on a frequency band in the frequency domain representation, wherein the key features contain frequencies that are representative of at least one vowel; and converting the key feature frequencies to one or more tactile frequencies, which comprises performing a mapping process on the key feature frequencies to shift or map them into one or more tactile frequencies, such that the mapping process transfers the frequencies from a high-frequency interval to a low-frequency interval, wherein the low-frequency interval is within a perceivable range of tactile frequencies.
  • a tactile applicator is provided.
  • the tactile applicator includes an elastomer body stimulated
  • an apparatus for enabling comprehension of speech by tactile stimulation by executing the afore-described method includes a sound receiver, a filter, and a converter.
  • the sound receiver is for receiving an audio input.
  • the filter is for noise reduction for the audio input.
  • the converter is configured to: receive the audio input signal from the filter; perform an analog-to- digital conversion on the audio input signal; perform a Discrete Fourier Transformation on the audio input signal to generate a frequency domain representation of the digital audio input signal; extract one or more key features of the audio input signal based on a frequency band in the frequency domain representation, wherein the key features contain frequencies that are representative of at least one vowel; and convert the key feature frequencies to one or more tactile frequencies, comprising performing a mapping process on the key feature frequencies to shift or map them into one or more tactile frequencies, such that the mapping process transfers the frequencies from a high-frequency interval to a low-frequency interval, wherein the low-frequency interval is within a perceivable range of tactile frequencies.
  • the extracted features comprise at least two frequencies; and, for each of the extracted features, two corresponding tactile frequencies are generated.
  • frequencies below a first frequency point (low frequency threshold) and beyond a second frequency point higher than the first frequency point (high frequency threshold) are filtered out.
  • this is performed by analogue means (i.e., an analogue/a band-pass filter).
  • this is performed by digital means by setting the signal value to zero through a digital bandpass filter.
  • FIG. 1 illustrates the principles applied by embodiments of the present invention in which peak frequencies in the audio domain that represent a vowel are converted to tactile stimulation;
  • FIG. 2 is an illustration of an example of the power spectrum of a vowel recorded by a male voice
  • FIG. 3 is an illustration of a typical spectrogram and power spectrum of a speech segment
  • FIG. 4 depicts a logical block diagram of a device for converting speech to tactile stimulation for recognition according to one embodiment of the present invention
  • FIG. 5 depicts a logical block diagram of a signal processing architecture of a method for converting speech to tactile stimulation for recognition according to one embodiment of the present invention
  • FIG. 6 is an illustration of the band-pass filter used in one embodiment of the present invention.
  • FIG. 7 illustrates the principle of frequency shifting
  • FIG. 8 is an example of frequency shifting according to one embodiment of the present invention.
  • FIG. 9 A and FIG. 9B illustrate the sound-to-tactile frequency mapping curves according to one embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a Tactile Stimulation Body (TSB) according to one embodiment of the present invention.
  • the present invention refers to apparatuses and methods that can convert sound to tactile stimulation, in real time, to enable deaf people to perceive and distinguish different sounds so that, with training they can understand the meaning of speech by utilizing tactile sensory input in lieu of auditory sensory input.
  • FIG. 1 illustrates the principles applied by embodiments of the present invention in which peak frequencies in the audio domain that represent a vowel are converted to tactile stimulation.
  • Each vowel is characterized by certain peak frequencies.
  • the vowel “/a:/” as in [FATHER] has three peak frequencies 101, 102, 103 at approximately 615Hz, 1230Hz and 2470Hz.
  • these peak frequencies are transformed into tactile stimulation that corresponds to equivalent frequencies and resonance nodes that can be perceived by touch and notably the Pacinian corpuscle, which serves as the mechanoreceptor that is primarily responsible for the ability of humans to feel vibrations.
  • vowel phonemes can be well distinguished from consonant phonemes.
  • Vowel phonemes can be represented by a superposition of sinusoidal functions for the formant frequencies. Generally, there are 2, 3, or 4 formant frequencies.
  • a static acoustic description can be given in the two-dimensional formant plane (Fi x F2), three-dimensional formant space (Fl x F2 x F3), or potentially the four-dimensional formant hypercube (Fl x F2 x F3 x F4), where Fl, . . ., Fn represent the formant frequencies of a vowel phoneme.
  • Formants, or formant frequencies are essentially peak frequencies 101, 102, 103 that are generated during speech and specifically when speaking vowel phonemes. Although the formant frequencies do not always have the highest energy in the vowel frequency spectrum, they are essential for the perception of vowel sounds. Each of the formants corresponds to a resonant frequency, within the vocal track. Any dimension of the resonating chamber, (width, length) will have an impact on the frequencies which we can produce. These resonances may change depending on movements of our tongue, lips etc. Harmonics are visible on the spectrum. Each of the peaks represent a harmonic. Harmonics are equally spaced, frequency components that are integer multiples of a Fundamental Frequency. The fundamental frequency is the repetition rate of the Glottal source.
  • the power spectrum of the vowel "/s:/” has three formant frequencies lOle, 102e, 103e. Around each formant frequency there are harmonics.
  • the first formant frequency of “/s:/” lOle has three harmonics 104e, 105e, 106e.
  • FIG. 3 is an illustration of a typical spectrogram and power spectrum of a speech segment.
  • the spectrogram shows the frequency energy through time.
  • the denser areas 107 are where the most energy dense frequencies are located.
  • the denser areas 107 represent the formant frequencies of vowels.
  • the less dense areas 108, where there is very little energy, represent consonants. Consonants are much more complex.
  • the apparatuses and methods in accordance with the various embodiments of the present invention detect and convert vowels while ignoring consonants. This allows for a simplified algorithm and simpler implementation in devices or methods of the present invention.
  • Audio sample segmentation in some embodiments, is done by using the method/manner of overlap-add. It segments audio signals into M-length windows. Then it implements a Fast Fourier Transform (FFT) with a size of N > M samples, followed by the required processing, i.e., sound-to-tactile. The windows are stitched back together in a buffer, to produce the output. This process is done irrespectively from the sound sample (i.e., it can be speech, music, carroar, bird singing, etc.).
  • FFT Fast Fourier Transform
  • the segmentation of phonemes during speech is done by an algorithm that detects vowels’ formants.
  • the algorithm detects the first three formant frequencies, by using a formant detection method following the Discrete Fourier Transform.
  • formant detection can be done using analysis of linear predictive coding (LPC) coefficients, Kalman filter or other methods/manners.
  • LPC linear predictive coding
  • a simple peak detection algorithm detects frequencies that have the highest power in dB from the power spectrum. More specifically, it searches and detects the formants in three specific ranges; for example, the first formant between 200Hz and 800Hz, the second between 800Hz and 2500Hz and third between 2000 and 3600Hz. Most of the other frequencies are filtered out by using bandpass filters.
  • the algorithm searches the formants only in the dense frequency areas 107; and within these areas, it tries to find the three formants (i.e., the peak frequencies 101, 102, 103) in different ranges. If there is no dense frequency area, it then assumes that there is either a consonant or silence and filters out these low-power frequencies. This can be achieved by a Fast-Fourier-Transform (FFT) denoise filter, which filters out low power frequencies.
  • FFT Fast-Fourier-Transform
  • process steps are designed to simplify the complexity of the conversion or procedure, thereby reducing power consumption and further enhancing the computational efficiency for the apparatuses or methods (i.e., reducing the power consumption of executing steps or processes on the computer or efficiently improving the performance of executing steps or processes on the computer).
  • FIG. 4 depicts a schematic drawing of a device 200 for converting speech to tactile stimulation for recognition according to one embodiment of the present invention.
  • the device 200 includes sound receiver 210, a filter 300, a converter 400, and a tactile applicator 500.
  • the sound receiver 210 can receive sound input and transmit it to pass through the filter 300 for noise reduction. Then, after the noise reduction, sound input is converted via the converter 400) to a signal suitable for the tactile applicator 500.
  • the sound receiver 210 includes a microphone.
  • the sound input received by the sound receiver 210 is human speech which can be obtained via the microphone.
  • input is sensed through Electret Condenser Microphones (ECMs) and/or Micro-Electro-Mechanical Systems (MEMS) microphones (or other types of microphones).
  • ECMs Electret Condenser Microphones
  • MEMS Micro-Electro-Mechanical Systems
  • the key element about speech recognition is the detection of vowels, and the conversion of vowel formant frequencies into appropriate tactile stimulation frequencies.
  • filtering is first used to remove some noise and improve the input signal quality.
  • a variety of filtering methods can be used by the filter 300 for noise reduction, such as: active noise reduction (ANR)/active noise cancelling (ANC) techniques, finite impulse response (FIR), low-pass, high-pass or band-pass filters, averaging filters or others.
  • pre-noise reduction can also be implemented in the hardware (i.e., analog filters set in the sound receiver 210) before the signal is converted to digital.
  • Analog low-pass filter can be used.
  • the filter 300 is an analogue low-pass filter that removes the general random noise caused by a variety of sources.
  • the converter 400 is an electronic device including a processor, electronic peripherals such as amplifiers and audio codecs, memory, and a battery.
  • the electronic device implements a signal-processing algorithm that accomplishes the detection of formant frequencies, their conversion into the tactile spectrum, and subsequently outputs electrical signals representing the tactile frequencies that stimulate or trigger the tactile applicator 500.
  • the tactile applicator 500 is a device configured to generate tactile stimulation characterized by a plurality of vibration nodes spatially distributed on a surface, aiming to provide physical feedback information to users.
  • the tactile applicator 500 can include one or more actuators such as: an Eccentric Rotating Mass (ERM) motor, a Linear Resonant Actuator (LRA), Electro- Active Polymer (EAP) actuator, Voice Coil Actuator, audio exciter, a speaker, a piezoelectric device, and/or any other form of vibratory element.
  • the vibratory elements can be coupled to the skin of a user, by an interface which is part of the tactile applicator 500.
  • the conversion of audio frequencies to frequencies for stimulating the tactile applicator 500 is based on the correspondence of audio and Pacinian corpuscle sensitivities.
  • FIG. 5 depicts a block diagram of a signal processing architecture of a method for converting speech to tactile stimulation for speech recognition according to one embodiment of the present invention.
  • Such method implemented by electronics that converts audio frequencies to tactile vibrations is essentially a “vocoder” that is specifically tailored for the considered application.
  • a vocoder is a category of speech coding that analyses and synthesizes the human voice signal for audio data compression.
  • the process steps involve performing interleaving computations in the time domain and the frequency domain.
  • the sound receiver 210 generates an input sound signal 201. While operating in the time domain 300, the input sound signal 201 is converted from analog to digital via the A/D converter 301. Analog filters may be applied to the input sound signal 201 before the analog-to-digital conversion.
  • the signal is passed to a noise reduction filter 302 where the filtering takes place.
  • digital filtering is used (i.e., in addition to the analog filters) to further remove noise and improve the input signal quality.
  • the filtering takes place via a discrete Kalman Filter.
  • filtering methods can be used for noise reduction such as active noise reduction/cancelation techniques and low-pass, high-pass, or band-pass filters.
  • Pre-noise reduction can be implemented in the hardware with standard analogue filters before the analog signal gets converted to digital.
  • the digitized and filtered input sound signal 201 is divided into several parts with similar size that are called “windows 303”. Windowing involves the slicing of the audio waveform into sliding frames. Two alternatives are the Hamming window and the Hanning window. A sinusoidal waveform will be chopped off using these windows. For the Hamming and Hanning window, the amplitude drops off near the edge. In one embodiment, the Hanning window is used during the converting speech to tactile stimulation.
  • DFT Discrete Fourier transform
  • Computational DFT is applied to extract information from the windows 303 in frequency domain.
  • DFT 420 is applied for each window 303 and the estimated tone with its individual frequency is approximated.
  • the DFT 420 used is a FFT during the converting speech to tactile stimulation.
  • features can be extracted by the feature extraction processor 430.
  • the signal contains the formant information contaminated by environment noise, where the low pass filter removes most of the noise.
  • the feature extraction processor 430 is based on the identification of a few high-power frequencies in phonemes, as shown in FIG. 2 which shows an example of the peaks detected for the vowel “/s:/.” These frequencies are called formants (i.e., formant frequencies lOle, 102e, and 103e). Starting from the lower frequency these are Fl, F2, and F3 respectively. With these three formants, it is possible to identify different vowels.
  • the feature extraction processor 430 works in the following way: (1) detecting the three peaks that represent the formant frequencies by using a highest- peak detection algorithm, which is focused in specific ranges where the vowel phoneme formant frequencies lie;
  • the band-pass filters 440 are applied for further processing.
  • three band-pass filters 440 are provided or used, one for each formant, to reduce the band of the output frequencies.
  • the filter is described by the center frequency, the passband, and the quality factor (Q) which determines how narrow the curve is.
  • the center frequency is set to correspond to each formant, and the bandwidth of each band-pass filter 440 is determined by the formant and is quite narrow, to effectively isolate the formants without picking-up non-formant harmonics.
  • FIG.6 is an illustration of output of a band-pass filter 440 according to one embodiment of the present invention. According to the example illustrated by FIG. 6, if the formant F is 500Hz, the bandwidth provided by the band-pass filter 440 can ranges 490Hz (fl)-510Hz (f2), so that it only allows these frequencies to pass.
  • the frequency-shifting processor 450 is applied.
  • the frequency-shifting processor 450 is configured to transfer the high- frequency samples to low-frequency samples while keeping all useful information.
  • the frequency shifting is implemented in real-time (RT), while the latency is not perceivable by human senses ( ⁇ 5ms).
  • RT real-time
  • frequency shifting is an essential process wherein a processing function (i.e., which is provided by the frequency-shifting processor 450) shifts or “maps” the audio frequencies into a subset of the range of frequencies that can be detected by touch.
  • FIG. 7 is an example that illustrates the principle of frequency shifting.
  • a wider frequency range (acoustic) can be mapped into a narrower one (tactile) that enables perception of tactile stimuli.
  • Many alternatives are available for the shifting of frequencies.
  • the example illustrated by FIG. 7 is a sigmoid exponential function according to the function below: where ftct is the tactile frequency; f au d is the audio frequency; A, C1-C2 are constants.
  • the data is generated based on audio frequencies and appropriate tactile representation frequencies for five long vowels: /i:/ /u:/ /a:/ /o:/ /s:/.
  • a variety of shifting/compression functions can be used such as linear, sinusoidal, logarithmic or a polynomial.
  • another low pass filter can be applied (with a cutoff tactile frequency at 400Hz) to limit the output frequency to the tactile sensitive range (i.e., 50-400Hz).
  • the signal is processed via the Inverse Discrete Fourier Transformation (IDFT) processor 460.
  • the output signal may need to be reconstructed, with the corresponding modified (compressed/ shifted) frequencies and modified amplitudes that are illustrated by means of an example as per FIG. 8.
  • an inverse DFT Discrete Fourier Transformation
  • the IDFT processor 460 can reconstruct the signal back to the time domain 300 from the frequency domain 410.
  • FIG. 8 is an example of frequency shifting according to one embodiment of the present invention, showing the final compressed signal spectrum and the original input signal spectrum of the phoneme /a:/. As illustrated by FIG. 8, in the output signal, it can be observed that the frequencies are compressed to within the tactile frequency range of 0 to 400Hz. The first formant is shifted at 260Hz, while the second at 350Hz and the third formant at 400Hz.
  • the frequency compression or frequency shifting can be done according to a sound-to-tactile frequency mapping based on specific mapping curves that produce two tactile frequencies for every audio frequency in the range.
  • FIG. 9A and FIG. 9B illustrate examples of such sound-to-tactile frequency mapping curve.
  • FIG. 9A shows a symmetric mapping according to one embodiment
  • FIG. 9B shows an asymmetric mapping according to an alternative embodiment.
  • a key feature is the generation of two tactile frequencies for every audio frequency.
  • the tactile signal By having two tactile frequencies for every audio frequency, the tactile signal’s identity increases. This means that the ability of a user to distinguish and recognize a uniquely identifiable tactile stimulus that corresponds to a phoneme is enhanced.
  • the tactile pattern is a richer one, and enables an easier distinction of tactile patterns by a user.
  • the curve is skewed so that the tactile frequencies on the upper range are more densely populated around the frequency of maximum sensitivity of the Pacinian corpuscle, which according to literature is around 250Hz.
  • the mapping of the inverted bell-shaped curves can be implemented in several ways. In one embodiment, the conversion is implemented by using a look-up table. In another embodiment, the mapping is done by using inverted raised cosine functions or by constructing fitted polynomial functions. In the examples shown in FIG. 9A, the audio frequencies are mapped to within the 0 to 550Hz tactile frequency range. 0- 400Hz is a preferred range of tactile frequencies because 400Hz is below the audible range. This mapping is illustrated in FIG. 9B. The curve in FIG. 9B shows that the curves can be non-symmetrical.
  • each excitation frequency generates a different waveform on the applicator surface.
  • a combination of two excitation frequencies generates a combination of two waveforms, providing a more complex surface with improved resolutions for the distinction and recognition of tactile stimulation patterns.
  • each phoneme consists of three formant frequencies and some of these frequencies are common among phonemes.
  • Each identified audio frequency generates two tactile frequencies within the tactile frequency range. Therefore, for each phoneme a combination of six distinct waveforms are produced. This is like having six actuators producing different signals. In this way, there is a greater distinction between phonemes regarding the spatial excitation of the skin.
  • FIG. 10 is a schematic illustration of a Tactile Stimulation Body (TSB) 900 according to one embodiment of the present invention.
  • the TSB 900 is to be put to make contact with the user skin.
  • the process illustrated by FIG. 5 can be applied to a TSB 900 having an elastic body in a frustoconical shape.
  • the TSB 900 includes an external housing 901, a miniature voice-coil actuator 902 with a vibrating coil, a frustoconical housing 903, and an elastic body 904 in a conical shape.
  • the overall structure of the TSB 900 can be supported by the external housing 901.
  • the elastic body 904 can include a silicone elastomer cast inside a frustoconical housing 903 made of either paper or plastic.
  • the conical housing 903 may have the same structure as the cone of the miniature VCA 902. This frustoconical housing 903 is excited by the vibrating coil of the miniature VCA 902 from beneath.
  • FIG. 5 After the processing by the IDFT processor 460, in the time domain 300, processing steps identical with or similar to those of the A/D converter 301 and windows 303 is performed but are inverted. The processing steps are implemented by windows 470 and Digital-to-analog (D/A) convertor 480 for D/A conversion in sequence. After the D/A conversion by the D/A convertor 480, the processing steps illustrated by FIG. 5 generate at least one vibration signal to the tactile applicator 500 (i.e., the tactile applicator 500 as shown in FIG. 4).
  • the TSB 900 of FIG. 10 can be coupled with the tactile applicator 500 so the vibration signal can be fed into the TSB 900 for exciting the elastic body 904 through the miniature VCA 902, to achieve tactile stimulation for recognition.
  • the functional units and modules of the apparatuses and methods in accordance with the embodiments disclosed herein may be implemented using computing devices, computer processors, or electronic circuitries including but not limited to application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), microcontrollers, and other programmable logic devices configured or programmed according to the teachings of the present disclosure.
  • ASIC application specific integrated circuits
  • FPGA field programmable gate arrays
  • microcontrollers and other programmable logic devices configured or programmed according to the teachings of the present disclosure.
  • Computer instructions or software codes running in the computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.
  • All or portions of the methods in accordance to the embodiments may be executed in one or more computing devices including server computers, personal computers, laptop computers, mobile computing devices such as smartphones and tablet computers.
  • the embodiments may include computer storage media, transient and non-transient memory devices having computer instructions or software codes stored therein, which can be used to program or configure the computing devices, computer processors, or electronic circuitries to perform any of the processes of the present invention.
  • the storage media, transient and non-transient memory devices can include, but are not limited to, floppy disks, optical discs, Blu-ray Disc, DVD, CD-ROMs, and magneto-optical disks, ROMs, RAMs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.
  • Each of the functional units and modules in accordance with various embodiments also may be implemented in distributed computing environments and/or Cloud computing environments, wherein the whole or portions of machine instructions are executed in distributed fashion by one or more processing devices interconnected by a communication network, such as an intranet, Wide Area Network (WAN), Local Area Network (LAN), the Internet, and other forms of data transmission medium.
  • a communication network such as an intranet, Wide Area Network (WAN), Local Area Network (LAN), the Internet, and other forms of data transmission medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Otolaryngology (AREA)
  • Electrotherapy Devices (AREA)

Abstract

: A method for enabling comprehension of speech by tactile stimulation is provided. The method includes steps as follows: receiving an audio input; performing an analog-to-digital conversion on the audio input; performing a DFT to convert the audio input to a frequency domain; extracting key features of the audio input based on a band of the audio input, wherein the key features contain frequencies that are representative of at least one vowel; and converting frequencies of the extracted key features to a range of tactile frequencies, which comprises performing a mapping process on the frequencies to shift or map them into a subset, such that the mapping process transfers the frequencies from a high-frequency interval to a low-frequency interval, wherein the low-frequency interval is a perceivable range of tactile frequencies.

Description

A METHOD FOR SPEECH RECOGNITION USING TACTILE STIMULATION
Inventor: Evagoras XYDAS
Technical Field:
[0001] The present invention relates to speech recognition, in particularly to devices and methods for converting speech to tactile stimulation for recognition.
Background:
[0002] Hearing loss is a problem that affects millions of people. There are different degrees of hearing loss ranging from complete deafness from birth to diminished hearing from aging. Hearing aids are sufficient for many people having diminished hearing but in more severe cases such as for people who are fully deaf, hearing aids are usually not able to help them perceive speech or music. People who are both deaf and blind have even greater difficulty in communication.
[0003] Deaf people can benefit from tactile stimuli that are associated with sound. Tactile stimuli may be conveniently provided on a particular skin area of the hand for example. The use of tactile devices as a substitute for audition is dependent on the efficacy of phonemic recognition via tactile signals alone. Several studies verify that vibrotactile aids can enable deaf people to detect segmental and suprasegmental features of speech, and the discrimination of common environmental sounds. Existing vibrotactile aids provide very useful information as to speech and non-speech stimuli. However, this is not enough. It is necessary to provide the means for accurate speech recognitions so that deaf people would be able to understand the meaning of what is being said. There is a need for a means that provides sound-to-tactile stimulation that enables the distinction of vowels.
[0004] There are devices known to provide haptic communication through the generation of vibrations. Many of these devices utilize cutaneous actuators to transmit vibrations. For example, patent No. U.S. Patent No. 10,222,864B2 teaches a method in which operating cutaneous actuators are used to enhance haptic communication through a path of a receiving user’s skin by using constructive or destructive interference between haptic outputs. The cutaneous actuators, spaced apart from each other on a patch of skin, generate haptic outputs by the cutaneous actuators such that the generated haptic outputs constructively or destructively interfere on the patch of skin.
[0005] However, techniques like the ones outlined above, exhibit two major disadvantages:
(a) inadequate resolution and in turn low capability to distinguish speech; and (b) insufficient useability consideration for the user to learn to distinguish tactile stimuli in a consistent way that corresponds to specific phonemes.
[0006] Therefore, there is a need for methods that can convert speech to tactile stimulation with sufficient tactile resolution to allow the users to easily distinguish different phonemes and consistently understand the intended meaning of speech.
Summary of Invention:
[0007] It is an objective of the present invention to provide systems and methods to address the aforementioned shortcomings and unmet needs in the state of the art.
[0008] In the present invention, methods are provided for sound to tactile stimulation that enables deaf people to perceive and distinguish different sounds, so that, with training, they can understand part of the meaning of speech and be able to utilize tactile sensory input as a substitute to auditory sensory input.
[0009] Further, the provided methods are to detect speech sounds and convert those sounds into haptic stimuli, in real-time. The spectrum of sound frequencies is represented through a corresponding spectrum of tactile stimulation by using a vibrating surface that has a plurality of modal nodes. The sound frequencies corresponding to vowel sounds of the detected sound are extracted and utilized to provide tactile stimulus. As such, this invention provides tactile stimuli that correspond to both spatial and frequency domains through vibration nodes on the surface of a membrane that is in contact with the skin of the user.
[0010] In accordance with a first aspect of the present invention, a method for enabling comprehension of speech by tactile stimulation is provided. The method includes the following steps: receiving an audio input signal; performing an analog-to-digital conversion on the audio input signal; performing a Discrete Fourier Transformation on the digital audio input signal to generate a frequency domain representation of the digital audio input signal; extracting one or more key features of the audio input signal based on a frequency band in the frequency domain representation, wherein the key features contain frequencies that are representative of at least one vowel; and converting the key feature frequencies to one or more tactile frequencies, which comprises performing a mapping process on the key feature frequencies to shift or map them into one or more tactile frequencies, such that the mapping process transfers the frequencies from a high-frequency interval to a low-frequency interval, wherein the low-frequency interval is within a perceivable range of tactile frequencies. [0011] In accordance with a second aspect of the present invention, a tactile applicator is provided. The tactile applicator includes an elastomer body stimulated by at least one vibration signal of the tactile frequencies generated by the afore-described method.
[0012] In accordance with a third aspect of the present invention, an apparatus for enabling comprehension of speech by tactile stimulation by executing the afore-described method is provided. The apparatus includes a sound receiver, a filter, and a converter. The sound receiver is for receiving an audio input. The filter is for noise reduction for the audio input. The converter is configured to: receive the audio input signal from the filter; perform an analog-to- digital conversion on the audio input signal; perform a Discrete Fourier Transformation on the audio input signal to generate a frequency domain representation of the digital audio input signal; extract one or more key features of the audio input signal based on a frequency band in the frequency domain representation, wherein the key features contain frequencies that are representative of at least one vowel; and convert the key feature frequencies to one or more tactile frequencies, comprising performing a mapping process on the key feature frequencies to shift or map them into one or more tactile frequencies, such that the mapping process transfers the frequencies from a high-frequency interval to a low-frequency interval, wherein the low-frequency interval is within a perceivable range of tactile frequencies.
[0013] In some embodiments, the extracted features comprise at least two frequencies; and, for each of the extracted features, two corresponding tactile frequencies are generated. In some embodiments, frequencies below a first frequency point (low frequency threshold) and beyond a second frequency point higher than the first frequency point (high frequency threshold) are filtered out. In one such embodiment, this is performed by analogue means (i.e., an analogue/a band-pass filter). In another embodiment, this is performed by digital means by setting the signal value to zero through a digital bandpass filter.
Brief Description of Drawings:
[0014] Embodiments of the invention are described in more details hereinafter with reference to the drawings, in which:
[0015] FIG. 1 illustrates the principles applied by embodiments of the present invention in which peak frequencies in the audio domain that represent a vowel are converted to tactile stimulation;
[0016] FIG. 2 is an illustration of an example of the power spectrum of a vowel recorded by a male voice; [0017] FIG. 3 is an illustration of a typical spectrogram and power spectrum of a speech segment;
[0018] FIG. 4 depicts a logical block diagram of a device for converting speech to tactile stimulation for recognition according to one embodiment of the present invention;
[0019] FIG. 5 depicts a logical block diagram of a signal processing architecture of a method for converting speech to tactile stimulation for recognition according to one embodiment of the present invention;
[0020] FIG. 6 is an illustration of the band-pass filter used in one embodiment of the present invention;
[0021] FIG. 7 illustrates the principle of frequency shifting;
[0022] FIG. 8 is an example of frequency shifting according to one embodiment of the present invention;
[0023] FIG. 9 A and FIG. 9B illustrate the sound-to-tactile frequency mapping curves according to one embodiment of the present invention; and
[0024] FIG. 10 is a schematic diagram of a Tactile Stimulation Body (TSB) according to one embodiment of the present invention.
Detailed Description of the Invention:
[0025] In the following description, apparatuses and methods for converting speech to tactile stimulation for recognition and the likes are set forth as preferred examples. It will be apparent to those skilled in the art that modifications, including additions and/or substitutions may be made without departing from the scope and spirit of the invention. Specific details may be omitted so as not to obscure the invention; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.
[0026] The present invention refers to apparatuses and methods that can convert sound to tactile stimulation, in real time, to enable deaf people to perceive and distinguish different sounds so that, with training they can understand the meaning of speech by utilizing tactile sensory input in lieu of auditory sensory input.
[0027] FIG. 1 illustrates the principles applied by embodiments of the present invention in which peak frequencies in the audio domain that represent a vowel are converted to tactile stimulation. Each vowel is characterized by certain peak frequencies. In the example shown in FIG. 1, the vowel “/a:/” as in [FATHER] has three peak frequencies 101, 102, 103 at approximately 615Hz, 1230Hz and 2470Hz. In accordance with the various embodiments of the present invention, these peak frequencies are transformed into tactile stimulation that corresponds to equivalent frequencies and resonance nodes that can be perceived by touch and notably the Pacinian corpuscle, which serves as the mechanoreceptor that is primarily responsible for the ability of humans to feel vibrations.
[0028] In this regard, vowel phonemes can be well distinguished from consonant phonemes. Vowel phonemes can be represented by a superposition of sinusoidal functions for the formant frequencies. Generally, there are 2, 3, or 4 formant frequencies. A static acoustic description can be given in the two-dimensional formant plane (Fi x F2), three-dimensional formant space (Fl x F2 x F3), or potentially the four-dimensional formant hypercube (Fl x F2 x F3 x F4), where Fl, . . ., Fn represent the formant frequencies of a vowel phoneme.
[0029] Formants, or formant frequencies are essentially peak frequencies 101, 102, 103 that are generated during speech and specifically when speaking vowel phonemes. Although the formant frequencies do not always have the highest energy in the vowel frequency spectrum, they are essential for the perception of vowel sounds. Each of the formants corresponds to a resonant frequency, within the vocal track. Any dimension of the resonating chamber, (width, length) will have an impact on the frequencies which we can produce. These resonances may change depending on movements of our tongue, lips etc. Harmonics are visible on the spectrum. Each of the peaks represent a harmonic. Harmonics are equally spaced, frequency components that are integer multiples of a Fundamental Frequency. The fundamental frequency is the repetition rate of the Glottal source.
[0030] As illustrated in FIG. 2, the power spectrum of the vowel "/s:/", has three formant frequencies lOle, 102e, 103e. Around each formant frequency there are harmonics. For example, as illustrated in the example of FIG. 2, the first formant frequency of “/s:/” lOle has three harmonics 104e, 105e, 106e.
[0031] FIG. 3 is an illustration of a typical spectrogram and power spectrum of a speech segment. The spectrogram shows the frequency energy through time. The denser areas 107 are where the most energy dense frequencies are located. The denser areas 107 represent the formant frequencies of vowels. The less dense areas 108, where there is very little energy, represent consonants. Consonants are much more complex. In view of this, the apparatuses and methods in accordance with the various embodiments of the present invention detect and convert vowels while ignoring consonants. This allows for a simplified algorithm and simpler implementation in devices or methods of the present invention.
[0032] For the conversion of audio input into tactile stimulus, the devices or methods according to various embodiments of the present invention operate on successive segments of the audio input. Audio sample segmentation, in some embodiments, is done by using the method/manner of overlap-add. It segments audio signals into M-length windows. Then it implements a Fast Fourier Transform (FFT) with a size of N > M samples, followed by the required processing, i.e., sound-to-tactile. The windows are stitched back together in a buffer, to produce the output. This process is done irrespectively from the sound sample (i.e., it can be speech, music, carroar, bird singing, etc.).
[0033] According to one embodiment of the present invention, the segmentation of phonemes during speech is done by an algorithm that detects vowels’ formants. In a segmented audio sample, the algorithm detects the first three formant frequencies, by using a formant detection method following the Discrete Fourier Transform. For example, formant detection can be done using analysis of linear predictive coding (LPC) coefficients, Kalman filter or other methods/manners.
[0034] In one embodiment, a simple peak detection algorithm detects frequencies that have the highest power in dB from the power spectrum. More specifically, it searches and detects the formants in three specific ranges; for example, the first formant between 200Hz and 800Hz, the second between 800Hz and 2500Hz and third between 2000 and 3600Hz. Most of the other frequencies are filtered out by using bandpass filters.
[0035] As illustrated in FIG. 3, according to one embodiment of the present invention, the algorithm searches the formants only in the dense frequency areas 107; and within these areas, it tries to find the three formants (i.e., the peak frequencies 101, 102, 103) in different ranges. If there is no dense frequency area, it then assumes that there is either a consonant or silence and filters out these low-power frequencies. This can be achieved by a Fast-Fourier-Transform (FFT) denoise filter, which filters out low power frequencies. These process steps are designed to simplify the complexity of the conversion or procedure, thereby reducing power consumption and further enhancing the computational efficiency for the apparatuses or methods (i.e., reducing the power consumption of executing steps or processes on the computer or efficiently improving the performance of executing steps or processes on the computer).
[0036] FIG. 4 depicts a schematic drawing of a device 200 for converting speech to tactile stimulation for recognition according to one embodiment of the present invention. The device 200 includes sound receiver 210, a filter 300, a converter 400, and a tactile applicator 500. Briefly, the sound receiver 210 can receive sound input and transmit it to pass through the filter 300 for noise reduction. Then, after the noise reduction, sound input is converted via the converter 400) to a signal suitable for the tactile applicator 500.
[0037] In one embodiment, the sound receiver 210 includes a microphone. The sound input received by the sound receiver 210 is human speech which can be obtained via the microphone. In some variations, input is sensed through Electret Condenser Microphones (ECMs) and/or Micro-Electro-Mechanical Systems (MEMS) microphones (or other types of microphones).
[0038] The key element about speech recognition is the detection of vowels, and the conversion of vowel formant frequencies into appropriate tactile stimulation frequencies.
[0039] After the analog to digital conversion by the sound receiver 210, filtering is first used to remove some noise and improve the input signal quality. A variety of filtering methods can be used by the filter 300 for noise reduction, such as: active noise reduction (ANR)/active noise cancelling (ANC) techniques, finite impulse response (FIR), low-pass, high-pass or band-pass filters, averaging filters or others. In one embodiment, pre-noise reduction can also be implemented in the hardware (i.e., analog filters set in the sound receiver 210) before the signal is converted to digital. Analog low-pass filter can be used. In various embodiments, the filter 300 is an analogue low-pass filter that removes the general random noise caused by a variety of sources.
[0040] In one embodiment, the converter 400 is an electronic device including a processor, electronic peripherals such as amplifiers and audio codecs, memory, and a battery. The electronic device implements a signal-processing algorithm that accomplishes the detection of formant frequencies, their conversion into the tactile spectrum, and subsequently outputs electrical signals representing the tactile frequencies that stimulate or trigger the tactile applicator 500. The tactile applicator 500 is a device configured to generate tactile stimulation characterized by a plurality of vibration nodes spatially distributed on a surface, aiming to provide physical feedback information to users. The tactile applicator 500 can include one or more actuators such as: an Eccentric Rotating Mass (ERM) motor, a Linear Resonant Actuator (LRA), Electro- Active Polymer (EAP) actuator, Voice Coil Actuator, audio exciter, a speaker, a piezoelectric device, and/or any other form of vibratory element. The vibratory elements can be coupled to the skin of a user, by an interface which is part of the tactile applicator 500. In one embodiment, the conversion of audio frequencies to frequencies for stimulating the tactile applicator 500 is based on the correspondence of audio and Pacinian corpuscle sensitivities.
[0041] FIG. 5 depicts a block diagram of a signal processing architecture of a method for converting speech to tactile stimulation for speech recognition according to one embodiment of the present invention. Such method implemented by electronics that converts audio frequencies to tactile vibrations is essentially a “vocoder” that is specifically tailored for the considered application. A vocoder is a category of speech coding that analyses and synthesizes the human voice signal for audio data compression. [0042] The process steps involve performing interleaving computations in the time domain and the frequency domain. The sound receiver 210 generates an input sound signal 201. While operating in the time domain 300, the input sound signal 201 is converted from analog to digital via the A/D converter 301. Analog filters may be applied to the input sound signal 201 before the analog-to-digital conversion.
[0043] After the analog-to-digital conversion, the signal is passed to a noise reduction filter 302 where the filtering takes place. In one embodiment, digital filtering is used (i.e., in addition to the analog filters) to further remove noise and improve the input signal quality.
[0044] In one embodiment, the filtering takes place via a discrete Kalman Filter. A variety of other filtering methods can be used for noise reduction such as active noise reduction/cancelation techniques and low-pass, high-pass, or band-pass filters. Pre-noise reduction can be implemented in the hardware with standard analogue filters before the analog signal gets converted to digital.
[0045] After filtering, the digitized and filtered input sound signal 201 is divided into several parts with similar size that are called “windows 303”. Windowing involves the slicing of the audio waveform into sliding frames. Two alternatives are the Hamming window and the Hanning window. A sinusoidal waveform will be chopped off using these windows. For the Hamming and Hanning window, the amplitude drops off near the edge. In one embodiment, the Hanning window is used during the converting speech to tactile stimulation.
[0046] The processing is then moved to the frequency domain 410 where a Discrete Fourier transform (DFT) 420 is applied. Computational DFT is applied to extract information from the windows 303 in frequency domain. DFT 420 is applied for each window 303 and the estimated tone with its individual frequency is approximated. In one embodiment, the DFT 420 used is a FFT during the converting speech to tactile stimulation.
[0047] Based on the DFT 420, features can be extracted by the feature extraction processor 430. The signal contains the formant information contaminated by environment noise, where the low pass filter removes most of the noise. The feature extraction processor 430 is based on the identification of a few high-power frequencies in phonemes, as shown in FIG. 2 which shows an example of the peaks detected for the vowel “/s:/.” These frequencies are called formants (i.e., formant frequencies lOle, 102e, and 103e). Starting from the lower frequency these are Fl, F2, and F3 respectively. With these three formants, it is possible to identify different vowels.
[0048] In one embodiment, the feature extraction processor 430 works in the following way: (1) detecting the three peaks that represent the formant frequencies by using a highest- peak detection algorithm, which is focused in specific ranges where the vowel phoneme formant frequencies lie;
(2) applying a filter which allows these frequencies to pass while rejecting other frequencies; and
(3) compressing the signal in the frequency dimension, resulting in shifting of the formant frequencies so that they fall into the vibrotactile range.
[0049] After the formants are extracted, the band-pass filters 440 are applied for further processing. According to one embodiment of the present invention, three band-pass filters 440 are provided or used, one for each formant, to reduce the band of the output frequencies. The filter is described by the center frequency, the passband, and the quality factor (Q) which determines how narrow the curve is. The center frequency is set to correspond to each formant, and the bandwidth of each band-pass filter 440 is determined by the formant and is quite narrow, to effectively isolate the formants without picking-up non-formant harmonics.
[0050] FIG.6 is an illustration of output of a band-pass filter 440 according to one embodiment of the present invention. According to the example illustrated by FIG. 6, if the formant F is 500Hz, the bandwidth provided by the band-pass filter 440 can ranges 490Hz (fl)-510Hz (f2), so that it only allows these frequencies to pass.
[0051] Referring to FIG. 5 again for the following description. According to one embodiment of the present invention, after the processing/ implementation by the feature extraction processor 430 and the band-pass filter 440, the frequency-shifting processor 450 is applied. In one embodiment, the frequency-shifting processor 450 is configured to transfer the high- frequency samples to low-frequency samples while keeping all useful information. Moreover, the frequency shifting is implemented in real-time (RT), while the latency is not perceivable by human senses (<5ms). In the present disclosure, herein, frequency shifting is an essential process wherein a processing function (i.e., which is provided by the frequency-shifting processor 450) shifts or “maps” the audio frequencies into a subset of the range of frequencies that can be detected by touch.
[0052] FIG. 7 is an example that illustrates the principle of frequency shifting. A wider frequency range (acoustic) can be mapped into a narrower one (tactile) that enables perception of tactile stimuli. Many alternatives are available for the shifting of frequencies. The example illustrated by FIG. 7 is a sigmoid exponential function according to the function below: where ftct is the tactile frequency; faud is the audio frequency; A, C1-C2 are constants.
[0053] In the present disclosure, the data is generated based on audio frequencies and appropriate tactile representation frequencies for five long vowels: /i:/ /u:/ /a:/ /o:/ /s:/. A variety of shifting/compression functions can be used such as linear, sinusoidal, logarithmic or a polynomial.
[0054] After the processing by the frequency-shifting processor 450, in one embodiment of the present invention, another low pass filter can be applied (with a cutoff tactile frequency at 400Hz) to limit the output frequency to the tactile sensitive range (i.e., 50-400Hz). As illustrated by FIG. 5, after the frequency shifting, the signal is processed via the Inverse Discrete Fourier Transformation (IDFT) processor 460. The output signal may need to be reconstructed, with the corresponding modified (compressed/ shifted) frequencies and modified amplitudes that are illustrated by means of an example as per FIG. 8. For this, an inverse DFT (Discrete Fourier Transformation) is used, and the IDFT processor 460 can reconstruct the signal back to the time domain 300 from the frequency domain 410.
[0055] FIG. 8 is an example of frequency shifting according to one embodiment of the present invention, showing the final compressed signal spectrum and the original input signal spectrum of the phoneme /a:/. As illustrated by FIG. 8, in the output signal, it can be observed that the frequencies are compressed to within the tactile frequency range of 0 to 400Hz. The first formant is shifted at 260Hz, while the second at 350Hz and the third formant at 400Hz.
[0056] Specifically, according to embodiments of the present invention, the frequency compression or frequency shifting can be done according to a sound-to-tactile frequency mapping based on specific mapping curves that produce two tactile frequencies for every audio frequency in the range. FIG. 9A and FIG. 9B illustrate examples of such sound-to-tactile frequency mapping curve. FIG. 9A shows a symmetric mapping according to one embodiment and FIG. 9B shows an asymmetric mapping according to an alternative embodiment. In both cases, a key feature is the generation of two tactile frequencies for every audio frequency. By having two tactile frequencies for every audio frequency, the tactile signal’s identity increases. This means that the ability of a user to distinguish and recognize a uniquely identifiable tactile stimulus that corresponds to a phoneme is enhanced. When two tactile frequencies are used for every formant audio frequency, the tactile pattern is a richer one, and enables an easier distinction of tactile patterns by a user.
[0057] In one embodiment, as illustrated by FIG. 9B, the curve is skewed so that the tactile frequencies on the upper range are more densely populated around the frequency of maximum sensitivity of the Pacinian corpuscle, which according to literature is around 250Hz. [0058] The mapping of the inverted bell-shaped curves, as illustrated by FIG. 9A and FIG. 9B, can be implemented in several ways. In one embodiment, the conversion is implemented by using a look-up table. In another embodiment, the mapping is done by using inverted raised cosine functions or by constructing fitted polynomial functions. In the examples shown in FIG. 9A, the audio frequencies are mapped to within the 0 to 550Hz tactile frequency range. 0- 400Hz is a preferred range of tactile frequencies because 400Hz is below the audible range. This mapping is illustrated in FIG. 9B. The curve in FIG. 9B shows that the curves can be non-symmetrical.
[0059] As explained above, the generation of two tactile frequencies may enhance the ability of a user to distinguish the stimulus for each phoneme. Each excitation frequency generates a different waveform on the applicator surface. A combination of two excitation frequencies generates a combination of two waveforms, providing a more complex surface with improved resolutions for the distinction and recognition of tactile stimulation patterns. Overall, each phoneme consists of three formant frequencies and some of these frequencies are common among phonemes. Each identified audio frequency generates two tactile frequencies within the tactile frequency range. Therefore, for each phoneme a combination of six distinct waveforms are produced. This is like having six actuators producing different signals. In this way, there is a greater distinction between phonemes regarding the spatial excitation of the skin.
[0060] FIG. 10 is a schematic illustration of a Tactile Stimulation Body (TSB) 900 according to one embodiment of the present invention. The TSB 900 is to be put to make contact with the user skin. The process illustrated by FIG. 5 can be applied to a TSB 900 having an elastic body in a frustoconical shape. The TSB 900 includes an external housing 901, a miniature voice-coil actuator 902 with a vibrating coil, a frustoconical housing 903, and an elastic body 904 in a conical shape. The overall structure of the TSB 900 can be supported by the external housing 901. The elastic body 904 can include a silicone elastomer cast inside a frustoconical housing 903 made of either paper or plastic. The conical housing 903 may have the same structure as the cone of the miniature VCA 902. This frustoconical housing 903 is excited by the vibrating coil of the miniature VCA 902 from beneath.
[0061] Referring to FIG. 5 for the following description. After the processing by the IDFT processor 460, in the time domain 300, processing steps identical with or similar to those of the A/D converter 301 and windows 303 is performed but are inverted. The processing steps are implemented by windows 470 and Digital-to-analog (D/A) convertor 480 for D/A conversion in sequence. After the D/A conversion by the D/A convertor 480, the processing steps illustrated by FIG. 5 generate at least one vibration signal to the tactile applicator 500 (i.e., the tactile applicator 500 as shown in FIG. 4). The TSB 900 of FIG. 10 can be coupled with the tactile applicator 500 so the vibration signal can be fed into the TSB 900 for exciting the elastic body 904 through the miniature VCA 902, to achieve tactile stimulation for recognition.
[0062] The functional units and modules of the apparatuses and methods in accordance with the embodiments disclosed herein may be implemented using computing devices, computer processors, or electronic circuitries including but not limited to application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), microcontrollers, and other programmable logic devices configured or programmed according to the teachings of the present disclosure. Computer instructions or software codes running in the computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.
[0063] All or portions of the methods in accordance to the embodiments may be executed in one or more computing devices including server computers, personal computers, laptop computers, mobile computing devices such as smartphones and tablet computers.
[0064] The embodiments may include computer storage media, transient and non-transient memory devices having computer instructions or software codes stored therein, which can be used to program or configure the computing devices, computer processors, or electronic circuitries to perform any of the processes of the present invention. The storage media, transient and non-transient memory devices can include, but are not limited to, floppy disks, optical discs, Blu-ray Disc, DVD, CD-ROMs, and magneto-optical disks, ROMs, RAMs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.
[0065] Each of the functional units and modules in accordance with various embodiments also may be implemented in distributed computing environments and/or Cloud computing environments, wherein the whole or portions of machine instructions are executed in distributed fashion by one or more processing devices interconnected by a communication network, such as an intranet, Wide Area Network (WAN), Local Area Network (LAN), the Internet, and other forms of data transmission medium.
[0066] The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. [0067] The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated.

Claims

Claims: What is claimed is:
1. A method for enabling comprehension of speech by tactile stimulation, comprising: receiving an audio input; performing an analog-to-digital conversion on the audio input; performing a Discrete Fourier Transformation on the audio input to generate a frequency domain representation of the audio input; extracting one or more key features of the audio input based on a frequency band in the frequency domain representation, wherein the key features contain frequencies that are representative of at least one vowel; and converting frequencies of the extracted key features to one or more tactile frequencies, which comprises performing a mapping process on the frequencies to shift or map them into the one or more tactile frequencies, such that the mapping process transfers the frequencies from a high-frequency interval to a low-frequency interval, wherein the low-frequency interval is within a perceivable range of tactile frequencies.
2. The method according to claim 1, wherein the perceivable range of the tactile frequencies refers to being perceivable by Pacinian corpuscles
3. The method according to claim 1, further comprising: converting the frequencies in the low-frequency interval into a time-domain signal; converting the time-domain signal from digital to analogue; and applying the time-domain signal to a tactile applicator.
4. The method according to claim 1, wherein the mapping process is achieved according to a sound-to-tactile frequency mapping based on specific mapping curves that produce two tactile frequencies for every audio frequency in the low-frequency interval.
5. The method according to claim 4, wherein the converting frequencies comprising the mapping process is implemented via a function that yields a bell-shaped curve.
6. The method according to claim 5, wherein the bell-shaped curve is slightly asymmetric to provide bias towards frequencies that tend to be more easily perceived.
7. The method according to claim 5 or claim 6, wherein the bell-shaped curve is shifted to produce a bias towards higher or towards lower frequencies.
8. The method according to claim 5, wherein the shape of the bell-shaped is arranged so that at least 70% of the converted frequencies are in a range between 280Hz and 320Hz within the low-frequency interval.
9. The method according to claim 4, wherein the converting frequencies comprising the mapping process is implemented by a polynomial function.
10. The method according to claim 4, wherein the converting frequencies comprising the mapping process is implemented by an inverse cosine function.
11. The method according to claim 1, wherein the extracted key features comprise two frequencies for two vowels, and two corresponding tactile frequencies are generated for each of the extracted key features.
12. The method according to claim 1, wherein the extracted key features comprise three frequencies for three vowels, and two corresponding tactile frequencies are generated for each of the extracted key features.
13. A tactile applicator device, comprising: an elastomer body stimulated by at least one vibration signal generated from the frequencies in the low-frequency interval by processing the method according to anyone of claims 1-12.
14. A device for enabling comprehension of speech by tactile stimulation, comprising: a sound receiver for receiving an audio input; a filter for noise reduction for the audio input a converter configured to: receive the audio input from the filter; perform an analog-to-digital conversion on the audio input; perform a Discrete Fourier Transformation on the audio input to generate a frequency domain representation of the audio input; extract one or more key features of the audio input based on a frequency band in the frequency domain representation, wherein the key features contain frequencies that are representative of at least one vowel; and convert frequencies of the extracted key features to one or more tactile frequencies, comprising performing a mapping process on the frequencies to shift or map them into the one or more tactile frequencies, such that the mapping process transfers the frequencies from a high-frequency interval to a low- frequency interval, wherein the low-frequency interval is within a perceivable range of tactile frequencies.
15. The device according to claim 14, further comprising a tactile applicator configured to receive a vibration signal resulting from the converter, such that the tactile applicator enables to excite an elastomer body for tactile stimulation for recognition.
PCT/IB2024/054167 2024-04-30 2024-04-30 A method for speech recognition using tactile stimulation Pending WO2025229374A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2024/054167 WO2025229374A1 (en) 2024-04-30 2024-04-30 A method for speech recognition using tactile stimulation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2024/054167 WO2025229374A1 (en) 2024-04-30 2024-04-30 A method for speech recognition using tactile stimulation

Publications (1)

Publication Number Publication Date
WO2025229374A1 true WO2025229374A1 (en) 2025-11-06

Family

ID=91030179

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2024/054167 Pending WO2025229374A1 (en) 2024-04-30 2024-04-30 A method for speech recognition using tactile stimulation

Country Status (1)

Country Link
WO (1) WO2025229374A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150070261A1 (en) * 2013-09-06 2015-03-12 Immersion Corporation Haptic Conversion System Using Frequency Shifting
US20180300996A1 (en) * 2017-04-17 2018-10-18 Facebook, Inc. Haptic communication system using broad-band stimuli
US10754428B1 (en) * 2019-02-25 2020-08-25 Facebook Technologies, Llc Systems, methods, and devices for audio-tactile mapping

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150070261A1 (en) * 2013-09-06 2015-03-12 Immersion Corporation Haptic Conversion System Using Frequency Shifting
US20180300996A1 (en) * 2017-04-17 2018-10-18 Facebook, Inc. Haptic communication system using broad-band stimuli
US10222864B2 (en) 2017-04-17 2019-03-05 Facebook, Inc. Machine translation of consonant-vowel pairs and syllabic units to haptic sequences for transmission via haptic device
US10754428B1 (en) * 2019-02-25 2020-08-25 Facebook Technologies, Llc Systems, methods, and devices for audio-tactile mapping

Similar Documents

Publication Publication Date Title
Roy et al. Listening through a vibration motor
CN104040627B (en) Method and apparatus for wind noise detection
EP2887351A1 (en) Computer-implemented method, computer system and computer program product for automatic transformation of myoelectric signals into audible speech
CA1222569A (en) System for the processing of speech signals to extract desired information therefrom
EP2064698B1 (en) A method and a system for providing sound generation instructions
JP2017509014A (en) A system for speech analysis and perceptual enhancement
Li et al. Learning normality is enough: a software-based mitigation against inaudible voice attacks
Nossier et al. Mapping and masking targets comparison using different deep learning based speech enhancement architectures
CN116964669A (en) Systems and methods for generating audio signals
CN113614828A (en) Method and apparatus for fingerprinting audio signals via normalization
Tapkir et al. Novel spectral root cepstral features for replay spoof detection
Zhang et al. A continuous liveness detection for voice authentication on smart devices
KR102878799B1 (en) System and Method for Spotting Embedded Voice Keyword Considering Noisy Environment
Borgström et al. A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement
WO2025229374A1 (en) A method for speech recognition using tactile stimulation
EP1913591B1 (en) Enhancement of speech intelligibility in a mobile communication device by controlling the operation of a vibrator in dependance of the background noise
Liang et al. Accmyrinx: Speech synthesis with non-acoustic sensor
Yuan et al. Cantonese tone recognition with enhanced temporal periodicity cues
Guzewich et al. Cross-Corpora Convolutional Deep Neural Network Dereverberation Preprocessing for Speaker Verification and Speech Enhancement.
CN116597850A (en) System and method for processing an audio input signal
Wu et al. Robust target feature extraction based on modified cochlear filter analysis model
Barker et al. Ultrasound-coupled semi-supervised nonnegative matrix factorisation for speech enhancement
Prasetio et al. Spectral Gating for Noise Reduction in Speech Stress Recognition System
Ozsahin et al. A speech recognition system using technologies of audio signal processing
Phan et al. Speaker identification through wavelet multiresolution decomposition and ALOPEX

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24724628

Country of ref document: EP

Kind code of ref document: A1