[go: up one dir, main page]

WO2024254467A2 - Systèmes et procédés d'extraction de signal cible et d'annulation de bruit - Google Patents

Systèmes et procédés d'extraction de signal cible et d'annulation de bruit Download PDF

Info

Publication number
WO2024254467A2
WO2024254467A2 PCT/US2024/033033 US2024033033W WO2024254467A2 WO 2024254467 A2 WO2024254467 A2 WO 2024254467A2 US 2024033033 W US2024033033 W US 2024033033W WO 2024254467 A2 WO2024254467 A2 WO 2024254467A2
Authority
WO
WIPO (PCT)
Prior art keywords
signals
target
binaural
sound
examples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/033033
Other languages
English (en)
Other versions
WO2024254467A3 (fr
Inventor
Bandhav VELURI
Malek ITANI
Shyamnath GOLLAKOTA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Washington
Original Assignee
University of Washington
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Washington filed Critical University of Washington
Publication of WO2024254467A2 publication Critical patent/WO2024254467A2/fr
Publication of WO2024254467A3 publication Critical patent/WO2024254467A3/fr
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2460/00Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
    • H04R2460/01Hearing devices using active noise cancellation

Definitions

  • Examples described herein relate generally to audio systems that provide noise cancellation and extraction of target signals from an environment.
  • Noise cancellation systems e.g., earbud systems like AirPods Pro and AirPods Max
  • AirPods Pro and AirPods Max may achieve reasonable noise cancellation in practical scenarios.
  • Speech systems have predominantly focused on improving the performance of speech-related tasks for in-ear devices (e.g., Airpods), telephony (e.g.. Microsoft Teams), and voice assistants (e.g., Google Home). Oftentimes, these systems collectively regard all nonspeech sounds just as noise.
  • in-ear devices e.g., Airpods
  • telephony e.g.. Microsoft Teams
  • voice assistants e.g., Google Home
  • Acoustic transparency mode for in-ear devices tries to imitate the sound response of an open-ear system by transmitting the appropriate signals into the ear canal. Like active noise cancellation, this is agnostic to the sounds. Adaptive transparency on Apple Airpods is designed to automatically reduce the amplitude of loud sounds. However, this does not allow the user to pick and choose which sounds to hear.
  • An example method includes receive audio signals from at least one microphone, generate noise cancellation signals based at least in part on the audio signals, extract target signals based at least in part on the audio signals using digital neural network processing, and provide the noise cancellation signals and the target signals from at least one speaker.
  • Generating noise cancellation signals may include using signal processing.
  • the generation of noise cancellation signals is performed independent of extracting target signals.
  • Example methods may also include receive an indication of a sound class from a user.
  • the digital neural network processing may be configured to extract the target signals based at least in part on the indication of the sound class.
  • Extracting the target signals may include extracting signals originating from one or more human speakers.
  • the target signals are binaural signals
  • the methods may include preserving a directionality of the binaural signals.
  • a time to generate a portion of said noise cancellation signals based on a portion of the audio signals is a shorter time than a time to extract a portion of said target signals based on the portion of the audio signals.
  • using digital neural network processing includes using an encoder-decoder architecture.
  • An example system includes a plurality of microphones, the plurality of microphones configured to generate binaural audio signals from an environment.
  • the system also includes at least one processor.
  • the system also includes at least one non-transitory computer readable medium encoded with instructions which, when executed by the at least one processor cause the system to perform operations including utilize a trained neural network to extract target signals from the binaural audio signals based on a sound class for the target signals, where the target signals comprise binaural target signals, and a plurality of speakers, the plurality of speakers configured to playback the binaural target signals.
  • the operations may also include receive an indication of the sound class from a user.
  • the plurality of microphones and the plurality of speakers may be disposed in a headset.
  • the plurality' of microphones and the plurality' of speakers may be disposed in a hearing aid.
  • Examples of systems may also include signal processing circuitry, the signal processing circuitry configured to generate noise cancelling signals based on the binaural audio signals.
  • the sound class may include human speakers.
  • the sound class includes a class of sounds to eliminate from the binaural audio signals.
  • the trained neural network includes an encoder-decoder architecture including an encoder configured to encode the binaural audio signals independent of the target signals to generate encoded data, and a decoder configured to condition the encoded data with an embedding to provide conditioned data and extract the target signals based on the conditioned data.
  • FIG. 1 is a schematic illustration of a system arranged in accordance with examples described herein.
  • FIG. 2 is a schematic timing diagram illustrating components contributing to end-to- end latency in binaural acoustic processing systems.
  • FIG. 3 is a schematic illustration of a framework for a neural network arranged in accordance with examples described herein.
  • FIG. 4 is a schematic illustration of an example encoder for use in neural networks described herein.
  • FIG. 5 is a schematic illustration of an example decoder for use in neural networks described herein.
  • FIG. 6 is a sheet of Equations referred to herein.
  • Examples described herein provide functionality for hearable devices, which allow users to program acoustic scenes in real-time and choose the sounds they want to hear from real-world environments, while also preserving the spatial cues in the target sounds. For example, users may listen to the birds chirping in a park without hearing the chatter from other hikers. In another example, users may block out traffic noise on a busy street while still being able to hear emergency sirens and car honks.
  • Examples described herein include examples of neural networks that can achieve binaural target sound extraction in the presence of interfering sounds and background noise.
  • Example training methodologies are described that allow systems to generalize to real-world use.
  • Semantic hearing systems and methods described herein generally utilize an understanding of the semantics of various natural and artificial sounds in real-time, in the presence of interfering sounds, and determine which sounds to allow and which to block, based on user input. Speech may be one amongst many other sound classes in systems described herein.
  • Examples described herein may advantageously address challenges relating to focusing on sounds using in- or over-ear devices (e.g., headsets).
  • One set of challenges may be real-time use of such systems.
  • the sound output should be generally synced with the user’s visual senses. This may involve real-time processing that satisfies stringent latency requirements. Generally, a latency of less than 20-50 ms may be preferred in some examples. This may involve identifying the target sounds using 10 ms or less of audio blocks, separating them from interfering sounds, and then playing them back, all on a computationally-constrained device like a smartphone.
  • Examples described herein include semantic hearing systems and methods which may program a binaural acoustic scene based on semantic sound descriptions (e.g., sound classes).
  • Examples of neural networks are described that may achieve binaural target sound separation and demonstrate that the network can run in real-time on smartphones.
  • Examples of training methodologies are described to generalize a system to unseen real-world environments, and users.
  • An example system is implemented using certain off-the-shelf hardware to show that an example system may achieve goals described herein in real-world environments.
  • binaural processing is provided. Sounds arrive at the two ears with different delays and attenuations.
  • the physical separation between the two ears and the reflections/diffraction from the wearer’s head, e.g., the head-related transfer function, provide cues for spatial perception.
  • a binaural output may be used to preserve or recover this spatial information for the target sounds across the two ears.
  • neural network models described herein may have real-world generalization. Training and testing a neural network on synthetic data may be used. Examples of binaural target sound extraction networks may generalize to real-world hearable applications. The complexity of real-world reverberations and head-related transfer functions (HRTFs) may be addressed in simulations. Generalization to in-the-wild use in unseen acoustic environments across different users may be used.
  • HRTFs head-related transfer functions
  • Some implemented results show that an example system can operate with 20 sound classes and that an example transformer-based network has a runtime of 6.56 ms on a connected smartphone.
  • In-the-wild evaluation with participants in previously unseen indoor and outdoor scenarios shows that an example implemented system can extract the target sounds and generalize to preserve the spatial cues in its binaural output.
  • Examples of systems and methods described herein may receive audio signals from at least one microphone.
  • Example systems and methods may generate noise cancellation signals based at least in part on the audio signals and extract target signals based at least in part on the audio signals using digital neural network processing.
  • Example systems and methods may provide the noise cancellation signals and the target signals from at least one speaker.
  • FIG. 1 is a schematic illustration of a system arranged in accordance with examples described herein.
  • the system of FIG. 1 includes headset 102 and computing system 1 12.
  • the headset 102 may include speaker 104, speaker 106, microphone 108, microphone 110 and noise-cancelling circuitry 120.
  • the noise-cancelling circuitry' 120 may be coupled to speaker 104, speaker 106, microphone 108. and microphone 110.
  • the headset 102 may be coupled to (e.g., in communication with) computing system 112.
  • the computing system 112 may include processor 114, computer readable media 116, and user interface 122.
  • the computer readable media 116 may include executable instructions for target signal extraction 118 including neural network 124.
  • FIG. 1 The components shown in FIG. 1 are exemplary only. Additional, fewer, and/or different components may be used in other examples.
  • Examples of systems described herein may include one or more microphones and one or more speakers, such as speaker 104, speaker 106, microphone 108, and microphone 110 of FIG. 1.
  • the microphones and/or speakers may be provided in any of a variety of form factors. As shown in FIG. 1, the speakers and microphones may form all or part of a headset, such as headset 102.
  • microphones and/or speakers may be provided in one or more ear buds.
  • the speaker 104 and microphone 108 may be provided in an enclosure formed as an ear bud.
  • the microphone speaker 106 and microphone 110 may be provided in an enclosure formed as an ear bud.
  • the microphones and/or speakers may be provided in one or more ear cups or other structures which may be shaped to sit on and/or over the ears of a user. In some examples, the microphones and/or speakers may be provided in a hearing aid which may sit in and/or proximate the ears of a user. While two microphones and two speakers are shown, any number may be used. Additional components may be provided in headset 102 which may be coupled to the components shown, such as speaker 104, speaker 106, microphone 108, microphone 110, and/or noise-cancelling circuitry 120. For example, drivers may be provided for one or more speakers.
  • a headband may be provided to generally be located above a user's head and may hold ear cups in place.
  • One or more interfaces for wired or wireless electrical connectivity may be provided including, but not limited to, one or more 3.5mm jacks, USB interfaces, Bluetooth, and/or WiFi interfaces.
  • two microphones may be used to receive audio signals from an environment, such as microphone 108 and microphone 110.
  • the two microphones may be referred to as binaural microphones.
  • the microphones may be used to receive audio signals, which may also be referred to as acoustic signals.
  • the microphones may receive audio signals which may be referred to as binaural signals.
  • the speakers may provide (e.g., play back) audio signals which may be referred to as binaural signals.
  • Binaural signals generally refer to signals associated with a same audio source received and/or transmitted at different locations.
  • binaural signals may refer to audio signals received by and/or associated with both ears of a listener.
  • Binaural signals typically preserve a directionality of an audio signal. For example, receipt and/or playback of binaural signals (e.g., through speaker 104 and speaker 106) may preserve a direction of one or more sound sources.
  • Headsets described herein may be noisecancelling headsets.
  • Noise-cancelling headsets may generally generate noise cancellation signals intended to cancel audio signals that were generated in an environment and received at the noise-cancelling headset.
  • the noise cancellation signals may be based on the audio signals received at microphones described herein, such as microphone 108 and microphone 110.
  • examples described herein may utilize signal processing techniques to generate noise cancellation signals.
  • the noise cancellation signals may have amplitudes and/or frequencies calculated to cancel the received audio signals.
  • Analog signal processing may be used to generate the noise cancellation signals.
  • Analog signal processing refers to the processing of continuous-time signals. Analog signals have a value that may vary over time, and the value may be representative of a physical measurement (e.g., an audio frequency and/or amplitude).
  • Analog signal processing generally uses analog components (e.g., circui try including one or more resistors, capacitors, inductors, diodes, and/or transistors). Analog signal processing may accordingly occur generally real-time, which may be advantageous to reflect real-time changes in audio sources in the environment.
  • Real-time refers to a latency with which it is possible to generate effective noise cancellation in an environment.
  • the analog signal processing may generate noise cancellation signals within less than a millisecond, on the order of microseconds in some examples. Other time scales may be used.
  • Examples of systems described herein may accordingly include noise-cancelling circuitry, such as noise-cancelling circuitry' 120 of FIG. 1.
  • the noise-cancelling circuitry 120 is coupled to microphone 108 and microphone 110.
  • the noise-cancelling circuitry’ 120 may receive analog audio signals from an environment, which may be binaural signals.
  • the noisecancelling circuitry 120 may generate noise cancellation signals using signal processing techniques, such as analog signal processing.
  • the noise-cancelling circuitry' 120 may provide the noise cancellation signals to speakers, such as speaker 104 and speaker 106, which mayplay back the noise cancellation signals. Playback of the noise cancellation signals may result in a generally quiet and/or silent listening environment for a user as the noise cancellation signals may cancel audio signals received at the headset 102.
  • the noise-cancelling circuitry 120 may be implemented using any number of analog components in some examples, including one or more resistors, capacitors, inductors, diodes, and/or transistors.
  • Examples of systems described herein may include one or more computing systems, such as computing system 112 of FIG. 1.
  • the computing system 112 may be implemented using, for example, one or more computers, tablets, laptops, desktops, servers, smartphones, smartspeakers, cellular phones, wearable devices, appliances, and/or vehicles.
  • the computing system 112 may be in communication yvith headset 102 in some examples.
  • the computing system 112 may communicate with the headset 102 using Bluetooth, ZigBee, USB, 3.5mm wired connection, WiFi, and/or other communication interface.
  • audio signals received at microphone 108 and/or microphone 110 may be communicated to the computing system 112.
  • Target signals generated by the computing system 112 may be communicated to the headset 102 for playback by speaker 104 and/or speaker 106.
  • the computing system 112 may be wholly and/or partially integrated with the headset 102.
  • a noise-cancelling headset may be used that provides user access to the microphone data, such as the Sennheiser AMBEO Smart Headset. Examples of systems described herein may be implemented on such a device using fewer wires and may directly connect to the smartphone at a single point, without the need for an additional pair of binaural earphones.
  • Examples of computing systems described herein may include one or more processors and computer readable media, such as processor 114 and computer readable media 116 of FIG. 1.
  • Processors described herein may generally be implemented using any processor circuitry, including one or more processors, one or more processor cores, field programmable gate arrays (FPGAs), central processing units (CPUs), graphical processing units (GPUs), application specific integrated circuits (ASICs), microcontrollers, and/or embedded processors.
  • Computer readable media may generally be implemented using memory, random access memory (RAM), solid state drives (SSDs), read only memory (ROM), SD cards, and/or disk drives. While a single processor 114 and computer readable media 116 are shown in FIG. 1, any number may be used.
  • Software may be used to implement operations described herein.
  • the computer readable media 116 of FIG. 1 may be encoded with instructions which, when executed, cause the processor 114 to perform operations described herein.
  • the computer readable media 116 may include executable instructions for target signal extraction 118.
  • Computer readable media 116 may store data used in operations described herein, such as indications of sound classes, sound sources, and/or embeddings described herein. While a computer readable media 116 is shown as including executable instructions for target signal extraction 118, it is to be understood that multiple computer readable media may be used.
  • Examples of software described herein may utilize neural network processing to extract target signals from audio signals received from an environment.
  • the executable instructions for target signal extraction 118 may include neural network 124.
  • the neural network 124 may be used to extract target signals from the audio signals, such as the audio signals received at microphone 110 and/or microphone 108.
  • audio signals may be provided from headset 102 (e.g., from microphone 108 and/or microphone 110) to the computing system 112.
  • the audio signals may be converted into digital signals either before and/or after being provided to the computing system 112.
  • the headset 102 may convert the audio signals to digital signals.
  • the computing system 112 may convert the audio signals to digital signals.
  • Digital processing such as digital neural network processing, may be used to extract target signals from the audio signals.
  • the audio signals provided to the computing system 112 may be binaural audio signals.
  • the target signals extracted may be binaural target signals.
  • the neural network 124 may be a neural network trained to extract target signals from the audio signals.
  • the neural network 124 may be trained using supervised and/or unsupervised learning techniques or other learning techniques.
  • the neural network 124 may be trained to extract a particular kind or type of target signals.
  • the neural network 124 may be trained to extract signals originating from a particular class of sound sources in an environment.
  • the neural network 124 may be trained to extract signals originating from one or more human speakers in an environment.
  • the neural network 124 may be trained to extract signals from audio signals such that sounds made by sources in an environment that are undesirable to hear are wholly and/or partially removed.
  • Examples of neural netw orks described herein may be capable of achieving binaural target sound extraction.
  • An example netw ork (e.g., an example of neural network 124) takes two audio signals from microphones at the two ears (e.g., microphone 108 and microphone 110 of FIG. 1) as binaural input and outputs two audio signals as binaural output, while preserving the directionality of the target sounds in the acoustic scene.
  • the neural network 124 may start with a single-channel (e.g., not binaural) transformer model for target signal extraction.
  • the network may be optimized for real-time operations on computing systems, such as smartphones.
  • the neural network 124 may include a network that jointly processes the binaural input signals, allowing the network to preserve the spatial information about the target sounds and output binaural audio. This joint processing may be more effective at binaural target sound extraction and may have less (e.g., half) the computational cost of processing the binaural input signals separately.
  • Examples of a training methodology are described that may allow a binaural network to generalize to real-world situations, such as reverberations, multipath, and HRTFs.
  • Obtaining training data in fully natural environments can be difficult because mixtures maybe captured but lack access to the ground truth sounds needed for supervised learning.
  • training a network that can generalize to in-the-wild use with hearables generally involves training data that captures reverberations, multipath, and HRTFs across a large number of users.
  • examples described herein synthesize training data using multiple datasets.
  • an HRTF dataset is used, which includes measurements from users in non-reverberant environments.
  • the room impulse responses maybe convolved with thousands of examples from different audio classes (e.g., 20 different audio classes) to generate both mixtures and ground truth binaural audio. However, this may not capture the reverb and multipath in realistic environments. Therefore, these synthesized mixtures may be augmented with training data synthesized from three different datasets that provide binaural room impulse responses captured in real rooms. This facilitates example networks to generalize to users and real-world environments that are not in the training dataset.
  • the neural network 124 may be implemented using an encoderdecoder architecture.
  • the neural network 124 may include an encoder.
  • the encoder may encode audio signals.
  • the audio signals may be encoded in a manner independent of the target signals. For example, an indication of sound class, sound source, or other enrollment for target signal extraction may not be used to encode the audio signals by the encoder.
  • the neural network 124 may include a decoder.
  • the decoder may condition the encoded data with an embedding to provide conditioned data and extract the target signals based on the conditioned data.
  • the embedding may be indicative of a sound class and/or sound source as described herein.
  • digital neural network processing may 7 be used to extract target signals from audio signals described herein.
  • the neural network processing may operate on digital signals. Utilizing digital signal processing and/or utilizing software executed by one or more processors may generally cause the process of extracting target signals to be slower than the process of generating noise cancellation signals described herein. Note that the generation of noise cancellation signals by systems described herein may be independent of the extraction of target signals from the audio signals.
  • One process e.g., signal processing
  • Another process e.g., digital neural network processing
  • the noise-cancelling circuitry 7 120 may generate noise cancellation signals using signal processing techniques (e.g., analog signal processing). This process may be real-time (e.g., near real-time). For example, the generation of noise cancellation signals may occur within less than a millisecond in some examples. While the noise-cancelling circuitry 120 is generating noise cancellation signals based on received audio signals, the headset 102 is also providing the audio signals to the computing system 112. The computing system 112 may process the audio signals using digital neural network processing to extract target signals. The digital neural network processing may take a longer time than the generation of the noise cancellation signals. In some examples, the digital neural network processing may take milliseconds of time.
  • signal processing techniques e.g., analog signal processing
  • This process may be real-time (e.g., near real-time). For example, the generation of noise cancellation signals may occur within less than a millisecond in some examples.
  • the headset 102 is also providing the audio signals to the computing system 112.
  • the computing system 112 may process the audio signals
  • the target signals may be extracted as digital signals.
  • the target signals may be converted into audio signals at either computing system 112 and/or headset 102 for playback by speaker 104 and/or speaker 106.
  • speakers described herein may produce noise-cancelling signals and target signals. While each of speaker 104 and speaker 106 is shown in FIG. 1 as generating both noise-cancelling signals and target signals, in other examples one speaker may generate noise-cancelling signals and another speaker may generate target signals. Accordingly, a user listening to the output of speakers described herein, such as speaker 104 and speaker 106, may hear target signals clearly, with other portions of the audio signals suppressed. Note that the noise cancellation signals will be played by the speakers at an earlier time than the target signals, due to the longer time taken by the digital neural network processing to generate the target signals. Moreover, the communication between the headset 102 and computing system 112 may further delay the target signals. However, this delay is likely acceptable to users because the noise cancellation generated through signal processing is adequate to cancel incoming audio sounds, creating a cancelled sound environment into which the target signals can be played back, even if they are played back at a delay.
  • the audio signals provided by microphones described herein may be binaural audio signals. Accordingly, binaural audio signals may be used to generate noise cancellation signals.
  • the noise cancellation signals may be binaural noise cancellation signals.
  • the binaural audio signals may also be used to extract target signals using digital neural network processing. Accordingly, the target signals may be binaural target signals. In this manner, the target signals output from the speakers may preserve a directionality present in the binaural audio signals (e.g., a directionality of one or more sound sources).
  • Examples of systems described herein may include a user interface, such as user interface 122 of FIG. 1. While user interface 122 is shown in computing system 112, in some examples, some or all of user interface 122 may be implemented in headset 102.
  • the user interface 122 may be implemented, for example, as a button, a touchscreen, a speaker for receipt of audio input commands (which may be implemented using another speaker described herein), a display, a keyboard, and/or a mouse. Other input devices may also be used.
  • the user interface 122 may be used for a user to input information used by the executable instructions for target signal extraction 118 to extract target signals.
  • a sound class and/or sound source may be input by a user using the user interface 122.
  • a sound class refers to sounds which may be specified by semantic description (e.g., a semantic name for a particular type or source of sound).
  • semantic description e.g., a semantic name for a particular type or source of sound.
  • a sound class may be in contrast to directional hearing, which may refer to the ability to hear sounds from a particular direction.
  • Examples of sound classes include, but are not limited to, mechanical sounds, animal sounds, human speech sounds, vehicle sounds, etc.
  • Examples of sound classes include particular audio sources.
  • Examples of sound classes include emergency sirens, ocean sounds, human speech, alarm clock sounds, baby sounds, street noise, and/or birds chirping.
  • the user may use the user interface 122 to provide an indication of a sound class.
  • the indication of the sound class may be used (e.g., by processor 114 and computer readable media 116) to extract the target signals.
  • the neural network 124 may be trained to extract target signals belonging to a selected sound class.
  • a sound class and/or sound source may be identified for suppression.
  • a user may use user interface 122 to provide an indication of a sound class to be suppressed.
  • the executable instructions for target signal extraction 118 may accordingly extract target signals from the audio signals where the target signals represent the audio signals without the audio generated by the sound class.
  • the user interface 122 may display multiple sound classes. A user may select one or more sound classes to hear and one or more sound classes to suppress.
  • automatic speech recognition may be used to provide an indication of sound classes. For example, a user could say, “Only allow sounds from birds.” Speech-to- text and intent classification systems can be used to transcribe these speech commands and identify the target sound class as “bird chirps.” and then use a neural network to perform the task.
  • a described example herein utilized 20 sound classes in an implementation to push the system to its limits and demonstrate that a real-time binaural neural network has the capability' to operate with a decent number of classes. However, it is likely that from a user interface perspective, one may want to limit the number of classes for a better user experience. Accordingly, a fewer number of classes may be used in other examples.
  • the indication of sound classes to hear and sound classes to suppress may be used by neural networks described herein to extract target signals for playback. Accordingly, examples of binaural target sound extraction can also be used to subtract the identified sounds and play the residual sounds (which would be the target signals in those examples) into the ear. For example, computer typing and/or hammer sounds may be removed from audio signals to focus on human speech. This can be beneficial when the user knows the specific type of environmental noise that they feel is annoying (e.g., computer typing in an office room) as this approach would remove only the specified noise and thus allow the user to focus on the speech and the other sounds in the environment.
  • users of systems described herein may provide an indication of one or more audio sources for target signal extraction.
  • the indication of audio sources may be, for example, an indication of one or more sound classes and/or one or more human speakers in an environment.
  • Processors described herein, such as processor 114 may enroll the audio sources and/or may facilitate enrollment of the audio source(s). Enrollment generally refers to the process of generating an enrollment signal (e.g., an embedding, also referred to as an embedding vector in some examples) that may be indicative of characteristics of the target audio source(s) and may subsequently be used to extract target signals from received audio signals.
  • an enrollment signal e.g., an embedding, also referred to as an embedding vector in some examples
  • the processor 114 may perform the enrollment (e.g.. the computer readable media 116 may include executable instructions for generating enrollment signals).
  • another computing system may perform the enrollment.
  • the processor 114 may provide audio signals received to another computing system (e.g., a cloud computing system).
  • the processor 114 may transmit the audio signals through a wired or wireless connection.
  • the other computing system may provide the enrollment signals back to the system of FIG. 1, and the processor 114 may store the enrollment signals in the computer readable media 116 and/or other computer readable media.
  • Audio signals that are received and and/or provided by microphones described herein may be used to provide both noise cancellation signals and target signals which may be extracted from the audio signals.
  • target audio sources e.g., target human speakers and/or sound classes
  • speakers described herein may provide noise-cancelling signals and target signals.
  • the noise-cancelling signals generally may be intended to create a silent environment in the ear of a hearer, and then the target signals may be intended to provide the audio generated by particular audio source(s), or by sources other than particular audio source(s) in some examples.
  • users of systems described herein may hear desired audio sources with increased clarify. Because enrollment signals are used to extract the target signals, the extraction may be generally robust as the user and/or the target audio source(s) move through the environment in the presence of other interfering signals.
  • Examples of systems described herein, such as the example system of FIG. 1, may be implemented using a variety of form factors including one or more noise-cancelling headsets, smartphones, ear buds, headbands, hearing aids and/or other wearable devices.
  • Examples of systems described herein may be used in a variety of use cases. For example, systems described herein may be used to more clearly hear a speaker in a crowded or noisy area. Examples include waiting rooms, classrooms, performances, outdoor environments, and/or indoor environments having mechanical or other noise. Examples of systems described herein may be used to provide target speaker signals in hearing aids. Note that noise-cancelling may be of less benefit and may not be used in hearing aids. This may be due to the poor hearing of noise sounds by the user in any event.
  • a user may wear ear-worn devices on a beach and input an indication to listen to the calming sounds of the ocean (e.g., a sound class of ocean sounds) while suppressing any human speech nearby (e.g., a sound class of human speech).
  • a user may provide an indication to listen only to a sound class of emergency sirens.
  • a user may provide an indication to listen to a class of alarm clock sounds or baby sounds but to suppress noise from the street (e.g., a sound class of street noise).
  • a user may be on a plane and input to hear human speech and announcements but to suppress the sound of a crying baby.
  • examples described herein may program the output acoustic scene in real-time at least in part by semantically associating the individual incoming sounds with user input to determine which sounds to allow in and which sounds to block.
  • Examples of systems described herein may program the acoustic environment with imperceptible latency such that the target sounds of interest are present but interfering sounds are suppressed.
  • the computation may not be desirably performed in the cloud, but may operate in real-time using computationally -constrained devices like smartphones (e.g., on computing system 112 of FIG. 1).
  • the target signals extracted by the neural network should originate from the same spatial directions as the real- world target sounds.
  • examples described herein may advantageously have: 1) real-time low-latency operation, and 2) binaural real-world generalization.
  • FIG. 2 is a schematic timing diagram illustrating components contributing to end-to- end latency in binaural acoustic processing systems.
  • FIG. 2 depicts acoustic signals, such as the audio signals which may be received by microphone 108 and microphone 110 of FIG. 1. Blocks for processing time are shown. A resulting output of speakers is depicted, such as speaker 104 and speaker 106 of FIG. 1.
  • Received audio signals may be stored into two memory' buffers of the binaural microphones, e.g. microphone 108 and microphone 110.
  • the acoustic data from the two microphones in each block may then be fed into a neural network (e.g., neural network 124 of FIG. 1) that outputs a block-length worth of binaural target sound data.
  • This binaural output may then be played back through the two speakers on the headset (e.g., speaker 104 and speaker 106).
  • this end-to-end latency should preferably be less than 50 ms in some examples, or less than 20-50 ms in some examples.
  • the buffer duration, the lookahead duration and the processing time may all be reduced. Note that 1) a small buffer duration of say 10 ms means that the target signal extraction technique has only a 10 ms block of data to not only understand the semantics of the acoustic scene but also to separate the target sound from other interfering sounds.
  • neural networks are not known for their lightweight computation.
  • the processing may preferably be performed on-device on computationally-constrained devices like smartphones.
  • the operating system used to buffer the sound and/or run the neural network may also have I/O delays which for audio on iOS is on the order of 4 ms, depending on the buffer size.
  • the target sounds experience reverberations and multipath propagation due to reflections from walls and other objects in the environment. Further, the human head and torso reflect and obstruct sounds. As a result the target sound arrives at systems described herein, such as the microphones of headset 102 of FIG. 1, with different amplitudes and delays at the two ears. The differences in the received sounds across the two ears provide spatial awareness to humans. Thus, it may be desirable to preserve these differences and play the target sounds with different amplitudes and delays through the two speakers of the headset. Note that the target and interfering sounds can be at different positions and experience different reverberations and reflections from the HRTFs. Further, the multipath effects and reverberations may be complex in real-world environments, and the HRTFs can change across wearers.
  • the binaural target sound extraction networks may, for example, be used to implement and/or may be implemented by the neural network 124 of FIG. 1.
  • the networks generally receive binaural input sounds (e.g., data representing binaural input sounds generated by microphones).
  • the networks may be implemented using one or more computing devices (e.g.. one or more smartphones, tablets, computers, servers, desktops, wearable devices), such as the computing system 112 of FIG. 1.
  • the networks may be implemented using one or more neural networks, which may include circuitry and/or executable instructions which may be executed by one or more processors for performing the neural network operations described herein.
  • the networks may output binaural sounds that may be extracted from the input binaural sounds.
  • the networks may be trained networks (e.g., data for implementing the networks may be obtained through training of the network on training data).
  • the network may be trained to output binaural sounds from one or more sound classes (e.g., semantic sound classes).
  • the output binaural sounds may be provided to one or more speakers and played for a user.
  • Examples of neural networks using an encoder-decoder network, e.g., binaural target sound extraction networks, are described with reference to FIGS. 3 through 5.
  • the described neural networks may be used to implement the neural network 124 of FIG. 1, for example.
  • FIG. 3 provides an example high-level binaural extraction framework.
  • a mask estimation network may be an encoder-decoder architecture operating on latent space representation of binaural signals to extract the mask for target sound based on the query vector q.
  • FIGS. 4 and 5 show the encoder and decoder architectures used in the mask estimation network.
  • the encoder processes the previous input context and does not consider the label embedding.
  • the decoder first conditions the encoded representation with the label embedding, 1. and then generates the mask corresponding to the target sound using the conditioned representation.
  • These encoders and decoders may be implemented using computing devices described herein, such as one or more smartphones.
  • FIG. 3 is a schematic illustration of a framework for a neural network arranged in accordance with examples described herein.
  • the framework shown in FIG. 3 may be used to implement the neural network 124 of FIG. 1, for example.
  • the neural network framework of FIG. 3 includes a convolution block 302.
  • An output of convolution block 302 may be provided to the mask estimation network 304.
  • An output of the mask estimation network 304 may be combined with an output of the convolution block 302 at operator 306.
  • An output of operator 306 may be provided to transposed convolution block 308.
  • the input binaural signal may be, for example, the audio signals received at and/or generated by the microphone 108 and microphone 110 of FIG. 1.
  • the input binaural signal is first mapped to a representation in a latent space, x G R D * ⁇ 1 !L ⁇ by using a ID convolution layer with a kernel size > L and a stride equal to L.
  • D and L are tuneable hyperparameters of the model.
  • D is the dimensionality of the model, having a significant effect on the parameter count, and consequently the computational and memory complexities.
  • L determines the duration of the smallest audio chunk that can be processed with the model.
  • the latent space representation, x is then passed to a mask generator, M, which estimates an element-wise mask m, given as Equation 1 in FIG. 6.
  • Equation 1 is the total number of sound classes the model is trained for.
  • the representation (y) corresponding to the target sound is obtained by element-wise multiplication of the input representation, a;, and the mask, m, as given in Equation 2 in FIG. 6.
  • the output audio signal ? c J? 2 ' T is then obtained by applying a ID transposed convolution on y, with a stride of L.
  • examples of neural networks described herein may jointly process the two channels of binaural signals for computational efficiency.
  • examples of this jointly processing framework performed competitively with the parallel processing frameworks in terms of target sound extraction accuracy, even with a 50% lower runtime cost.
  • the model For real-time on-device operation, the model generally should output the audio corresponding to the target sound (e.g., target signals as described herein) as soon as the input audio is received, e g., within the latency requirements described herein. Since the audio is fed to the model from the device buffers, the buffer size generally determines and/or influences the duration of the audio chunk the model receives at each time step. Assuming the buffer size to be divisible by the stride size L, the audio chunk size can be represented as the number of strides, K. That is, the buffer size of an audio chunk of size K is equal to KL samples.
  • K the buffer size of an audio chunk of size K is equal to KL samples.
  • neural networks provided herein, such as neural network 124 of FIG. 1, are preferably causal with a time resolution of the buffer size, e.g., KL audio samples.
  • the input convolution, the mask estimation block, the element-wise multiplication, and the output transposed convolution may operate on one audio chunk at each time step.
  • the binaural target sound extraction framework described herein may be adapted to chunk-wise streaming inference in some examples as follows.
  • the input audio signal corresponding to the feth chunk to be g e 2*KL , as shown in FIG. 3.
  • the input ID convolution e.g., convolution block 302 maps this audio chunk to its latent space representation, x
  • the mask estimation block e.g., mask estimation network 304 is then used to estimate the mask corresponding to the target sound, based on the current chunk, as well as a finite number of the previous chunks, as given by Equation 3 of FIG. 6.
  • the previous chunks act as the audio context for the neural network, referred to as the receptive field of the model.
  • An example receptive field of 1-1. s is shown to result in good performance.
  • the output representation of the current chunk corresponding to the target sound, y k g can then be obtained at the operator 306 as given by Equation 4 of FIG. 6.
  • the resulting output representation is then converted to the output signal j j:KL by applying the ID transposed convolution in transposed convolution block 308.
  • Examples of architectures that may be used for mask estimation include Conv- TasNet, U-Net, SepFormer, ReSepFormer, and Waveformer.
  • Waveformer is an efficient streaming architecture implementing chunk-based processing, which may make it advantageous in examples described herein. Examples of Waveformer are described in Bandhav Veluri, Justin Chan. Malek Itani, Tuochao Chen, Takuya Yoshioka, and Shyamnath Gollakota, "Real-time target sound extraction.” in IEEE 1CASSP, 2023, arXiv:2211.02250v3 [cs.SD], which publication is hereby incorporated by reference in its entirety for any purpose.
  • the mask estimation network (e.g., mask estimation network 304) may be implemented as an encoder-decoder neural network architecture, where the encoder may be purely convolution-based and the decoder is a transformer decoder.
  • the same dimensionality may be used for both the encoder and the decoder. This may allow for use of a standard transformer decoder, instead of a modified one as may be used in the Waveformer. For examples of binaural applications described herein, however, different dimensionality is not necessarily providing gains that warrant the complexity of the projection layers and the long residual connection.
  • FIG. 4 is a schematic illustration of an example encoder for use in neural networks described herein.
  • the encoder of FIG. 4 may be used, for example, to implement an encoder of neural network 124 of FIG. 1.
  • Recall mask estimation in Equation 3 involves processing many previous chunks in addition to the current chunk to obtain the mask corresponding to the current chunk. Repeated processing of the entire receptive field for each iteration could become intractable for a real-time on-device application.
  • examples of mask estimation networks described herein may implement Wavenet style dilated causal convolutions for processing the input and previous chunks.
  • Wavenet A generative model for raw audio,” 2016, arXiv: 1609.03499v2 [cs.SD], which publication is hereby incorporated by reference in its entirety for any purpose.
  • the dynamic programming algorithm proposed in Fast Wavenet may be implemented. Examples of Fast Wavenet are described in Tom Le Paine, Pooya Khorrami, Shiyu Chang. Yang Zhang, Prajit Ramachandran, Mark A. Hasegawa- Johnson, and Thomas S. Huang, “Fast wavenet generation algorithm,” 2016, arXiv: 1611.09482vl [cs.SD], which publication is hereby incorporated by reference in its entirety for any purpose.
  • the encoder function processes the input chunk and an encoder context to generate the encoded representation of the input chunk as given by Equation 5 of FIG. 6.
  • the size of the context depends on the hyperparameters of the encoder.
  • the encoder may include a stack of 10 dilated causal convolution layers.
  • the kernel size of all layers is equal to 3, and the dilation factor is progressively doubled after each layer starting with 1 , resulting in dilation factors ⁇ 2°, 2 1 ,... , 2 9 ⁇ . Since the kernel size is equal to 3, the context needed for each dilated convolution layer is twice the layer's dilation factor. As long as this context is saved after each iteration, and padded with the input chunk in the next iteration, the intermediate results corresponding to the previous chunks do not have to be recomputed. Thus the size of the context k is equal 2046
  • FIG. 5 is a schematic illustration of an example decoder for use in neural networks described herein.
  • the decoder of FIG. 5 may be used, for example, to implement a decoder of neural network 124 of FIG. 1.
  • the decoder of FIG. 5 includes a combination of the encoded data from an encoder (e g., from the encoder of FIG. 4) with an embedding, I.
  • the combination is then provided to multiple multi-head attention layers, each followed by an add and normalization block.
  • a feed-forward block is provided followed by a final add and normalization block to provide an output of the decoder.
  • the embedding I may correspond with one or more sound classes described herein and/or other indication of target signals.
  • the query vector ( -l is first embedded into the embedding space using a linear layer to generate a label embedding I C B .
  • the mask corresponding to the target sound ⁇ n k is estimated using a transformer decoder layer, represented here
  • the encoded representation is first conditioned with the label embedding I by an element-wise multiplication.
  • the encoded representation and the conditioned encoded representation are first concatenated in the time dimension, with those from the previous time step, before processing with the transformer decoder layer .
  • the encoded representation from the previous time step, e A- ⁇ i acts as the decoder context.
  • the mask estimation can be written as Equation 6 in FIG. 6, where ⁇ represents concatenation in the time dimension. As shown in FIG.
  • the transformer decoder & first computes the self- attention result of the conditioned encoded representation V using the first multi-head attention block, followed by cross-attention between the self-attention result and the unconditioned encoded representation ⁇ e k -i? k ⁇ using the second multi-head attention block.
  • a feed-forward block along with residual connection generates the final mask corresponding to the target signals.
  • Examples of neural networks described herein may be trained to extract target signals based on an indication of audio sources.
  • neural networks may be trained to extract target signals based on an indication of sound class. Examples of audio class dataset curation are described and then training methodologies are described to generalize to real-world scenarios.
  • Example systems should preferably efficiently handle target sounds encountered in real-world situations. By focusing on practical applications, a manageable set of target sound classes may be identified to extract. However, in reality, a wide range of background sounds may be presented, many of which may not be part of the system’s target sound classes.
  • an ontology 7 e.g., an AudioSet ontology
  • the ontology arranges the sound classes as nodes in a graph and groups them into seven main sound categories.
  • Each sound class node has a unique AudioSet ID and may contain one or more child nodes that represent more specific sound classes. For example, the “Hands’’ sound class has two children, namely “Finger Snapping” and “Clapping.” Examples of target sound classes are described as well as the interfering classes.
  • a diverse set of interfering sound classes may be preferable in a dataset.
  • the target sound classes may be interfering with each other. Note that these sounds can come from a very large variety of sources, which may make it infeasible to exhaustively enumerate all of them.
  • these sounds can come from a very large variety of sources, which may make it infeasible to exhaustively enumerate all of them.
  • it may be preferable to ensure that these sound classes do not overlap with a set of target classes.
  • examples may use the AudioSet hierarchical structure and a set of target classes (e.g., 20 target classes) to generate a large set of other sound classes (e.g, 141 other sound classes).
  • this set may be defined as the nodes that are neither a more specific nor a more general instance of any target (or known) class, according to the AudioSet hierarchy. Accordingly, by considering the AudioSet ontology as a directed acyclic graph with edges from each sound class node towards its child nodes, unknown sound classes may be defined as the set of AudioSet nodes that are disconnected from all target sound class nodes.
  • labeled audio recordings may be obtained for each of the sound classes.
  • FSD5OK and MUSDB18 additional datasetspecific pre-processing procedures may be performed. For example, to create binaural mixtures of individual sources from multiple directions, audio samples may be excluded from FSD5OK that were already mixtures of multiple distinct sound sources. For MUSDB18, examples may extract and split audio into vocal and instrumental streams and assign them the AudioSet labels “Singing” and “Melody” respectively.
  • a set of sound classes may be used which includes: alarm clock, baby cry 7 , birds chirping, car horn, cat, rooster crow, ty ping, cricket, dog, door knock, glass breaking, gunshot, hammer, music, ocean, singing, siren, speech, thunderstorm, toilet flush.
  • Other classes may be used in other examples.
  • the resulting audio samples may be divided into segments and silent ones discarded.
  • Each dataset may be split into mutually exclusive training, testing, and validation sets and then combined into a final dataset.
  • the training and validation audio files were sampled from the development split (90-10 split), and the testing samples from the evaluation split.
  • the ESC-50 dataset the first three folds were used for training, the fourth fold for validation and the fifth for testing.
  • the audio samples for each sound class were split into train, test and validation sets (60-33-7) before combining with the rest of the datasets.
  • the final combined dataset includes 20 target sound classes and 141 other sound classes.
  • binaural data sets may be created and/or used to train neural networks described herein. It may be preferable to create binaural mixtures that (1) are representative of spatial sounds perceived by a diverse set of listeners, and (2) capture the idiosyncrasies of real-world reverberant environments.
  • examples described herein may use a dataset including human head-related transfer function (HRTF) measurements. In one example, a pre-existing dataset of 43 human (HRTF) measurements were used. This may be augmented with additional (e.g., three) datasets of measured and simulated reverberant binaural room impulse responses (BRIRs).
  • HRTF human head-related transfer function
  • Each dataset may be split across rooms and listeners into train, test and validation (70-20-10) sets. BRIR subjects or rooms may not be sampled across different sets. For each sample during training, one of the datasets may be randomly chosen and may sample a single room and participant from its training set. Then, to create a binaural mixture with K sources. a source direction was independently selected for each of the K sources, out of all source directions available for this room and this participant subject in the dataset. Note that since the source directions are independently picked, two different sound sources might end up being at the same direction from the wearer. A set of 2K room impulse responses , where N is the length of the room impulse response, may be obtained.
  • a toolkit such as the Scaper toolkit, examples of which are described at Justin Salamon, Duncan MacConnell, Mark Cartwright, Peter Li, and Juan Pablo Bello, Scaper: A library for soundscape synthesis and augmentation, in WASPAA, 2017, may be used to synthesize binaural mixtures dynamically on the fly during training.
  • binaural mixtures may include two randomly picked target classes, each with a 5- 15 dB SNR relative to the background sounds, and 1-2 other classes that each have a 0-5 dB SNR relative to the background sounds. Background sounds sourced from the TAU Urban Acoustic Scenes 2019 dataset were also used in mixtures described herein.
  • a network was then trained to produce a pair of left and right channel target sound estimates VL and V .
  • the sample-sensitive and scale-sensitive signal-to- noise ratio (SNR) loss function was used and applied independently, and then the left and right SNRs were averaged to obtain the loss function as shown in Equations 7 and 8 in FIG. 6.
  • a transformer model was trained for 80 epochs, with an initial learning rate of 5e-4. After completing 40 epochs, the learning rate was halved if there was no improvement in the validation SNR for more than five epochs. Note that in some examples the training data do not include any measurements with example binaural hardware described herein, and the results reported may accordingly be used to evaluate generalization to example hardware, unseen users and environments.
  • a variety of binaural target signal extraction frameworks may be used.
  • a dual-channel architecture may be used for efficient binaural target sound extraction.
  • the binaural signal is converted into a combined latent space representation before the mask estimation. Since both left and right channels are combined into a common representation, a single instance of the mask estimation network is used for estimating the mask corresponding to the target sound.
  • a parallel may be used that implements parallel processing of the left and right channels, along with some cross-communication between channels.
  • This framework may be implemented for both a mask estimation network with D — 128 and Conv-TasNet, for example. Other networks may be used in other examples.
  • An implemented example arranged in accordance with techniques described herein includes an augmented off-the-shelf noise-cancelling headset with commercial wired binaural earphones that provide access to data from both microphones.
  • An example neural network was implemented on a connected smartphone and trained with 20 different sound classes, including sirens, baby cries, speech, vacuum cleaners, alarm clocks, and bird chirps.
  • Example results included an average signal improvement of 7.17 dB across 20 target sounds, in the presence of interfering sounds and urban background noise.
  • An example implemented real-time network had a 6.56 ms runtime on an iPhone 11 for processing a 10 ms chunk of binaural audio.
  • An example hardware setup includes a pair of SonicPresence SP15C binaural microphones that are wired to capture high-quality recordings.
  • An iPhone 12 was used to process the recorded data and output the audio through noise-cancelling headphones like JBL Live 65OBTNC and the NUBWO gaming headsets.
  • a lightning-to-aux adapter was used to connect the headphones to the iPhone over a wire.
  • a USB hub was used to connect both the microphones and the headphones to the smartphone.
  • an in-the-wild evaluation captured both mobile wearers as well as mobile sound sources that naturally occurred in real-world scenarios (e.g., cars moving or birds flying).
  • clean, sample-aligned ground truth signals may not be available to objectively compare the binaural outputs of the system with.
  • a listening study was conducted to compute a mean opinion score (MOS) regarding the sound extraction accuracy. This metric may be used to evaluate the perceptual quality of algorithms described herein for end users.
  • MOS mean opinion score
  • the audio samples played at each section were in-the-wild recordings processed in the following three ways for the same target label: (1) the original recording, (2) the output of a 128-dimensional binaural network arranged in accordance with examples described herein, and (3) the output of a 256-dimensional binaural network arranged in accordance with examples described herein.
  • an additional fourth audio sample was also included that was obtained by extracting the interfering class (e.g., door knocks) and then subtracting it from the input recording to estimate the target speech.
  • results of user evaluations for the interference sound suppression and overall quality improvement of an example system for different target sound labels were obtained.
  • the results demonstrated the system’s capability to significantly reduce unwanted background sounds, as indicated by an increase in the overall noise suppression score.
  • a similar trend was also observed in the overall MOS improvement, with an improvement from 2.63 for the input signal to 3.54 and 3.80 after processing with the 128-dimensional and 256-dimensional models, respectively.
  • Results also showed that examples of networks described herein preserve the timing of the target sounds and can silence noise outside the target sound duration.
  • Examples described herein may refer to various components as “coupled” or signals as being “provided to” or “received from” certain components. It is to be understood that in some examples the components are directly coupled one to another, while in other examples the components are coupled with intervening components disposed between them. Similarly, signal may be provided directly to and/or received directly from the recited components without intervening components, but also may be provided to and/or received from the certain components through intervening components.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

Des systèmes et de procédés donnés à titre d'exemple décrits dans la présente invention peuvent recevoir des signaux audio provenant d'au moins un microphone. Des systèmes et des procédés donnés à titre d'exemple peuvent générer des signaux d'annulation de bruit sur la base, au moins en partie, des signaux audio et extraire des signaux cibles sur la base, au moins en partie, des signaux audio à l'aide d'un traitement de réseau neuronal numérique. Des systèmes et procédés donnés à titre d'exemple peuvent fournir les signaux d'annulation de bruit et les signaux cibles depuis au moins un haut-parleur.
PCT/US2024/033033 2023-06-09 2024-06-07 Systèmes et procédés d'extraction de signal cible et d'annulation de bruit Pending WO2024254467A2 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202363507360P 2023-06-09 2023-06-09
US63/507,360 2023-06-09
US202363593933P 2023-10-27 2023-10-27
US63/593,933 2023-10-27

Publications (2)

Publication Number Publication Date
WO2024254467A2 true WO2024254467A2 (fr) 2024-12-12
WO2024254467A3 WO2024254467A3 (fr) 2025-01-23

Family

ID=93794622

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/033033 Pending WO2024254467A2 (fr) 2023-06-09 2024-06-07 Systèmes et procédés d'extraction de signal cible et d'annulation de bruit

Country Status (1)

Country Link
WO (1) WO2024254467A2 (fr)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9190043B2 (en) * 2013-08-27 2015-11-17 Bose Corporation Assisting conversation in noisy environments
US9998847B2 (en) * 2016-11-17 2018-06-12 Glen A. Norris Localizing binaural sound to objects

Also Published As

Publication number Publication date
WO2024254467A3 (fr) 2025-01-23

Similar Documents

Publication Publication Date Title
Veluri et al. Semantic hearing: Programming acoustic scenes with binaural hearables
CN114556972B (zh) 用于辅助选择性听觉的系统和方法
Leng et al. Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis
EP3811625B1 (fr) Amélioration audio guidée par des données
CN115035907B (zh) 一种目标说话人分离系统、设备及存储介质
CN109644314B (zh) 渲染声音程序的方法、音频回放系统和制造制品
Chatterjee et al. ClearBuds: wireless binaural earbuds for learning-based speech enhancement
US20250008287A1 (en) Three-dimensional audio systems
Gupta et al. Augmented/mixed reality audio for hearables: Sensing, control, and rendering
Liu et al. Multichannel speech enhancement by raw waveform-mapping using fully convolutional networks
US10937443B2 (en) Data driven radio enhancement
US20230164509A1 (en) System and method for headphone equalization and room adjustment for binaural playback in augmented reality
Veluri et al. Look once to hear: Target speech hearing with noisy examples
Richard et al. Audio signal processing in the 21st century: The important outcomes of the past 25 years
Drossos et al. Investigating the impact of sound angular position on the listener affective state
Corey Microphone array processing for augmented listening
CN119789005A (zh) 音频处理方法、装置和耳机
JPWO2022023417A5 (fr)
El-Mohandes et al. DeepBSL: 3-D personalized deep binaural sound localization on earable devices
WO2024254467A2 (fr) Systèmes et procédés d'extraction de signal cible et d'annulation de bruit
CN115705839A (zh) 语音播放方法、装置、计算机设备和存储介质
WO2023272575A1 (fr) Système et procédé d'utilisation d'un réseau neuronal profond pour générer des signaux d'énoncé binauraux de haute intelligibilité à partir d'une entrée unique
WO2025090963A1 (fr) Génération de signal de source audio cible comprenant des exemples d'inscription et préservant la directionnalité
Zippert Towards Generalized Speech Separation For Hearing Aids: Deep Learning Approach For Combined Music and Speech
Ick Virtual Soundscapes for Machine Listening