WO2024228650A1 - Sound classification in noisy environments - Google Patents
Sound classification in noisy environments Download PDFInfo
- Publication number
- WO2024228650A1 WO2024228650A1 PCT/SE2023/050435 SE2023050435W WO2024228650A1 WO 2024228650 A1 WO2024228650 A1 WO 2024228650A1 SE 2023050435 W SE2023050435 W SE 2023050435W WO 2024228650 A1 WO2024228650 A1 WO 2024228650A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- wearable device
- user
- sound source
- sound
- sensor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
- H04R1/406—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R25/00—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
- H04R25/40—Arrangements for obtaining a desired directivity characteristic
- H04R25/405—Arrangements for obtaining a desired directivity characteristic by combining a plurality of transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
- H04S7/303—Tracking of listener position or orientation
- H04S7/304—For headphones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K11/00—Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/18—Methods or devices for transmitting, conducting or directing sound
- G10K11/26—Sound-focusing or directing, e.g. scanning
- G10K11/34—Sound-focusing or directing, e.g. scanning using electrical steering of transducer arrays, e.g. beam steering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/10—Earpieces; Attachments therefor ; Earphones; Monophonic headphones
- H04R1/1083—Reduction of ambient noise
Definitions
- Embodiments presented herein relate to a device, method, computer program, computer program product and an apparatus for sound enhancement in the vicinity of a user by means of a technical device worn by the user.
- API application programming interface
- automatic speech recognition software is not able to manage the problem of the so-called cocktail party effect, which refers to the ability for the human brain to hear selectively, i.e., to pay attention to a single auditory stimulus, e.g., a conversation, while filtering out a combination of diffuse background noise, music, and multiple simultaneous conversations.
- cocktail party effect refers to the ability for the human brain to hear selectively, i.e., to pay attention to a single auditory stimulus, e.g., a conversation, while filtering out a combination of diffuse background noise, music, and multiple simultaneous conversations.
- This is also an issue for hearing aids, since the human ability to perform such filtering is diminished when the received sound signal is amplified. Since the positioning of the hearing aids relative to each other is approximately known (i.e., each hearing aid is placed in or on a human ear), it is possible to beamform in the forward direction of the gaze of the user and thus improve the sound-to-noise ratio (SNR) of the desired signal.
- Embodiments presented herein relate to a device, method, computer program, computer program product and an apparatus for sound enhancement in the vicinity of a user by means of a technical device worn by the user. It should be appreciated that these embodiments can be implemented in numerous ways. Several of these embodiments are described below.
- a wearable device arranged to be worn by a user.
- the wearable device comprises a processing circuitry or the wearable device is operatively connectible to a cloud, the wearable device being adapted to acquire, from at least one microphone operatively connected to the wearable device, at least one sound signal indicative of a sound environment in the vicinity of the user.
- the processing circuitry or the cloud is adapted to acquire, from at least one sensor operatively connected to the wearable device, a visual sensor reading indicative of a visual environment in the vicinity of the user and/or a movement sensor reading indicative of a movement in the vicinity of the user.
- the processing circuitry or the cloud is adapted to identify, from one or more of: the at least one sound signal, the visual sensor reading and the movement sensor reading, at least one sound source in the vicinity of the user.
- the processing circuitry or the cloud is adapted to classify the at least one sound source as a wanted sound source or an unwanted sound source from acquired input data, the acquired input data being based on one or more of: the at least one sound signal, the visual sensor reading and the movement sensor reading.
- a method for preferred-speaker sound enhancement in vicinity of a user wearing a wearable device comprises acquiring, from at least one microphone operatively connected to the wearable device, at least one sound signal indicative of a sound environment in the vicinity of the user.
- the method comprises acquiring, from at least one sensor operatively connected to the wearable device, a visual sensor reading indicative of a visual environment in the vicinity of the user and/or a movement sensor reading indicative of a movement in the vicinity of the user.
- the method comprises identifying, from one or more of: the at least one sound signal, the visual sensor reading and the movement sensor reading, at least one sound source in the vicinity of the user.
- the method comprises classifying the at least one sound source as a wanted sound source or an unwanted sound source from acquired input data, the acquired input data being based on one or more of: the at least one sound signal, the visual sensor reading and the movement sensor reading.
- an apparatus configured to perform the method according to the second aspect.
- a computer program comprising instructions, which when executed by processing circuitry or a cloud, carries out the method according to the second aspect.
- a computer program product comprising a non-transitory storage medium including program code to be executed by a processing circuitry of a wearable device or a cloud operatively connected to the wearable device, whereby execution of the program code causes the wearable device to perform operations comprising acquiring, from at least one microphone operatively connected to the wearable device, at least one sound signal indicative of a sound environment in the vicinity of a user as well as acquiring, from at least one sensor operatively connected to the wearable device, a visual sensor reading indicative of a visual environment in the vicinity of the user and/or a movement sensor reading indicative of a movement in the vicinity of the user.
- the operations comprise identifying, from one or more of: the at least one sound signal, the visual sensor reading and the movement sensor reading, at least one sound source in the vicinity of the user.
- the operations comprise classifying the at least one sound source as a wanted sound source or an unwanted sound source from acquired input data, the acquired input data being based on one or more of: the at least one sound signal, the visual sensor reading and the movement sensor reading.
- Fig. 1 is showing a user who is wearing a wearable device for preferred-speaker sound enhancement in the vicinity of a user by means of a technical device worn by the user.
- Fig. 2 is showing functional units of the method for sound enhancement in the vicinity of a user by means of a technical device worn by the user, according to an embodiment of the disclosure.
- Fig. 3 is showing a computer program product and a computer program, according to an embodiment of the disclosure.
- the current disclosure concerns a wearable device 100, method 110 and computer program product for speech/sound enhancement, e.g., for facilitating sufficient speech/sound loudness or sound pressure level (SPL) for a preferred speaker or sound source SoS in order to improve, i.a., speech intelligibility in a noisy environment.
- speech/sound enhancement e.g., for facilitating sufficient speech/sound loudness or sound pressure level (SPL) for a preferred speaker or sound source SoS in order to improve, i.a., speech intelligibility in a noisy environment.
- the user's intention to listen to a specific speaker or a specific source of sound cannot be easily exploited for beamforming as the relative orientation between beamforming direction and gaze is unknown.
- the beamforming commonly used for directivity in hearing aids has a broad beam width and is not adaptive to the intent of the wearer.
- Maps with locations of sound sources can be created using simultaneous localization and mapping (SLAM) and microphone sensors.
- SLAM simultaneous localization and mapping
- the current state-of-the-art methods can only naively make use of them for signal enhancement and do not exploit the possibility of using technology such as AR glasses as a user interface for classification.
- the aim of embodiments presented herein is to improve preferred-speaker sound enhancement in the vicinity of a user by means of a technical device worn by the user.
- Fig. 1 illustrates a user who is wearing a wearable device 100 for sound enhancement in the vicinity of the user, in accordance with embodiments of the invention.
- the wearable device 100 comprises a processing circuitry or the wearable device 100 is operatively connectible to a cloud, the wearable device 100 being adapted to acquire a sound signal S indicative of a sound environment in the vicinity of the user.
- the sound signal S is acquired from at least one microphone 200, which is operatively connected to the wearable device 100.
- the processing circuitry or the cloud is further adapted to acquire a visual sensor reading V indicative of a visual environment in the vicinity of the user and/or a movement sensor reading M indicative of a movement in the vicinity of the user.
- the visual sensor reading V and/or the movement sensor reading M is acquired from at least one sensor 300 which is operatively connected to the wearable device 100.
- the processing circuitry or the cloud is further adapted to identify from one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M, at least one sound source SoS in the vicinity of the user.
- the processing circuitry or the cloud is adapted to classify the at least one sound source SoS as a wanted sound source WSoS or an unwanted sound source USoS from acquired input data, the acquired input data being based on one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M.
- the user may be a pedestrian, a person (i.e., a human being), a robot or an animal, e.g., a monkey or a dog.
- the wearable device 100 may comprise one of a helmet, a hat, a headset, earphones, augmented reality glasses, and virtual reality glasses.
- the vicinity of the user is defined as the area in front of the user (or the area in every direction around the user) within the range of 0-10 meters, preferably within the range of 0-5 meters and even more preferably within the range of 0-2 meters.
- the at least one sound signal S may be produced by a speaker, i.e., a person (i.e., a human being), a robot or an animal, e.g., a monkey or parrot, by sirens from emergency vehicles, music, traffic, home appliances, running water, etc.
- a speaker i.e., a person (i.e., a human being)
- a robot or an animal e.g., a monkey or parrot
- sirens from emergency vehicles, music, traffic, home appliances, running water, etc.
- the at least one sensor 300 may be any of camera (e.g., an event camera or a stereo camera), a 3D sensor, a contour sensor, an accelerometer, a gyroscope, an eye tracker, a passive infrared sensor, an ultrasonic sensor, a microwave sensor, an RGB-D sensor, a RADAR, a Wi-Fi, a 5G modem, a sonar, a lidar, a compass, or a tomographic sensor.
- a gyroscope is a device used for measuring orientation and angular velocity.
- RGB-D sensor is a specific type of depth-sensing device that works in association with an RGB (red, green and blue color) sensor camera.
- the movement sensor reading M could relate to hand movements, head movements, eye movements or other so-called internal movements by the user, or mouth movements of a speaker, thus indicating that the speaker is speaking, or other external movements indicating that a speaker or other sound source SoS is approaching or moving away from the user.
- the wearable device 100 can also not comprise a sensor 300, according to embodiments of the invention. Acquired input data and classification
- the processing circuitry or the cloud is adapted to classify the at least one sound source SoS as a wanted sound source WSoS or an unwanted sound source USoS from the acquired input data, either automatically (e.g., from machine learning algorithms) or manually by input (e.g., gestures) from the user.
- the acquired input data is based on one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M. In some embodiments, the acquired input data can be based on the at least one sound signal S only. In other embodiments, the acquired input data can be based on two or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M.
- the acquired input data may be at least one of: input classification data from a classifying machine learning algorithm (i.e., automatic classification), and a manual classification signal produced by the user.
- the input classification data from a classifying machine learning algorithm is based on one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M.
- the speech identified from the sound signal S can be further analyzed to match two separate speakers as being highly likely to be in the same conversation based on the content of their speech, pauses in their conversations when other conversational partners are speaking, as well as shared laughs.
- the resulting classification can be intuitively displayed for the user of the wearable device 100, e.g., a bounding box can turn red for unwanted sound source/s USoS and turn green for wanted sound source/s WSoS.
- the color of the bounding box can gradually change as the user's intentions indicate, e.g., the user's gaze at an unwanted sound source USoS correlated with the user's nodding his/her head may gradually change the bounding box from red to green.
- the manual classification signal produced by the user for example with the purpose of changing the volume, loudness or sound pressure level (SPL) of the at least one sound source SoS, may be any of a head movement, hand movement or an eye movement of the user picked up by the at least one sensor 300, or a vocal sound by the user picked up by the at least one microphone 200.
- the gestures made by the user can be with his/her head (nodding, tilting, shaking, etc.), his/her hand (thumb up/down, etc.) or his/her eye/s (long blink with one or two eyes, etc.).
- the manual classification signal may be registered, i.a., by an accelerometer 300 to detect hand movements of the user, a camera 300 to detect hand gestures of the user, an eye tracker 300 to detect eye gaze (e.g., the user looking in the direction of the sound source SoS for a prolonged amount of time, such as when engaging in conversation) and/or eye blinking of the user, and/or an at least one microphone 200 to detect vocal commands made by the user.
- the user can indicate a focused conversation mode with a single speaker by holding up his/her finger while looking at the single speaker or indicate a group conversation mode by holding up several of his/her fingers.
- the manual classification signal produced by the user e.g., manual gestures by the user, may be used for training the machine learning algorithms used for automatic classification.
- the manual classification signal produced by the user may be a primary mode of classifying the at least one sound source SoS as a wanted sound source WSoS or an unwanted sound source USoS, thus overriding the automatically determined classification.
- the sound pressure level (SPL) or loudness of the at least one sound source SoS classified as a wanted sound source WSoS is increased by a determined increase level. In another embodiment, the sound pressure level (SPL) or loudness of an at least one sound source SoS classified as an unwanted sound source USoS is decreased by a determined decrease level.
- Loudness is the subjective perception of sound pressure, more formally defined as the attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud.
- the perceived loudness of a sound depends on its fundamental frequency (fo), its sound pressure level (dB SPL), as well as various subjective factors.
- the determined increase level is defined as the increase in sound pressure level (dB SPL or a weighted SPL level, e.g., A-weighted SPL (d BA)) or loudness for the preferred speaker/sound that has been determined automatically or manually by the user, e.g., by the user pushing a button or switch operatively connected to the wearable device 100, or by the user making a gesture for volume adjustment of the preferred speaker/sound.
- the determined increase level can be within the range of 3-30 dB SPL, preferably within the range of 5-20 dB SPL and even more preferably within the range of 6-12 dB SPL, e.g., per pushed button or switch.
- the determined decrease level is defined as the decrease in sound pressure level (dB SPL or a weighted SPL level, e.g., A-weighted SPL (d BA)) or loudness for the unwanted speaker/s or sound/s that has been determined automatically or manually by the user, e.g., by the user pushing a button or switch operatively connected to the wearable device 100, or by the user making a gesture for volume adjustment of the unwanted speaker/s or sound/s.
- the determined decrease level can be within the range of 3-30 dB SPL, preferably within the range of 5-20 dB SPL and even more preferably within the range of 6-12 dB SPL, e.g., per pushed button or switch.
- Machine learning algorithms may be used for determining the determined increase level and/or the determined decrease level based on one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M.
- a database or list of speakers classified as wanted (or unwanted) by default can be created.
- the wanted (or unwanted) speaker/s is/are identified by obtaining vocal data from the at least one sound signal S and/or by facial data from the at least one sensor 300 and thereafter comparing the obtained vocal data and/or facial data with vocal data and/or facial data stored in the wanted speaker's list or unwanted speaker's list.
- vocal data is used for voice recognition of a speaker
- facial data is used for face or visual recognition of a speaker.
- sound lists of wanted and/or unwanted sounds can be created for automatic classification, containing, e.g., crying babies or commonly occurring nuisance noises such as drilling, dog barking, or sirens.
- sound lists of wanted and/or unwanted sounds can be populated by collecting statistics of labels classified manually.
- Automatic classification of the at least one sound source SoS as unwanted USoS or wanted WSoS may also be based on speech content of the at least one speaker in the vicinity of the user. For example, certain words or sounds may immediately classify the at least one sound source SoS as unwanted USoS or wanted WSoS. Such words could be the user's name or a warning.
- the suppression (e.g., decrease in sound pressure level) of an at least one sound source SoS could be indicated by the user by making a "sssh" sound or using the word "quiet".
- certain keywords may also be learned automatically.
- the speaker of the keyword may be added as a wanted sound source WSoS. Accordingly, a database or list of the user's favorite topics may be created and used for this purpose.
- the wearable device 100 is adapted to perform beamforming for determining the direction of and to spatially filter out the sound of the at least one sound source SoS.
- This beamforming information can be used by the wearable device 100 that is further adapted to perform 3D mapping of the at least one sound source SoS.
- Beamforming or spatial filtering is a signal processing technique whereby radio or sound signals can be steered in a specific direction, undesirable interference sources can be suppressed and/or the signal-to-noise ratio (SNR) of received signals can be improved.
- Beamforming is widely used in, e.g., radars and sonar systems, biomedical, and particularly in communications (telecom, Wi-Fi), specially 5G.
- beamforming can be used in sensor arrays for directional signal transmission or reception. This is achieved by combining elements in an antenna array in such a way that signals at particular angles experience constructive interference while others experience destructive interference. Beamforming can be used at both the transmitting and receiving ends in order to achieve spatial selectivity. The improvement compared with omnidirectional reception/transmission is known as the directivity of the array.
- State-of-the-art methods exist such that if the angle of arrival of an unwanted sound source is known, it can be dampened by creating a spatial filter that maintains the main beam in the direction towards the wanted sound source while attenuating the unwanted sound source/s.
- a typical example of such a system would maintain a beam straight ahead as a proxy for direction of a wanted sound source while using signal processing techniques to adaptively estimate the angle of arrival of unwanted sound source/s.
- the direction of the at least one sound source SoS can be determined in current embodiments.
- a microphone array 200, 400 is needed.
- the signal fidelity can be improved, e.g., with non-coherent integration of each sound source's beamformed signal or by estimating the most probable sentence using the output from the speech-to-text algorithm of each nearby device.
- Communication between the nearby devices allows the best combination of microphones in the vicinity of the user to be employed while avoiding microphones that mostly contribute with noise.
- the wearable device 100 is fitted with microphones 200, 400 as in Fig. 1.
- the received sound by the microphones 200, 400 will be beamformed to allow for improved signal quality in the direction of the wanted signal source/s WSoS.
- possible sound sources can be visually identified based on the visual sensor reading V and the position of the wearable device 100 can be used to indicate that it is a wanted signal source WSoS.
- a simple realization would decide the sound source SoS as wanted if the total time for which the wearable device 100 is approximately centered on the sound source SoS exceeds such time estimates for other sound sources.
- a more complicated realization would take into account that a speaker in the vicinity of the user will normally take turns when speaking and, as a consequence, a wanted sound source WSoS would not so easily be removed during a time interval of silence.
- An even more advanced realization would also attempt to interpret the intent of the user and quickly add a new wanted sound source WSoS when a new speaker joins in the conversation, the latter of which could be indicated by an obvious change in position of the wearable device 100 as the user looks in the direction of the new speaker.
- the direction to the sound source/s SoS must be known relative to the wearable device 100. This estimate can be made based on the visual sensor reading V. Firstly, a computer vision algorithm can identify possible sound source/s SoS. Secondly, the total time each of the sound source/s SoS is in focus is estimated. The relative direction to the sound sources found in such a manner is easily found and tracked using SLAM methods known to the skilled person. For the unwanted sound source/s USoS, the direction is found in the same manner so that the beamformer can attenuate any sound coming from that direction.
- two unwanted sound sources USoS are identified.
- the beamforming will be applied to improve the signal quality in the direction of the wanted sound source/s WSoS while attenuating the unwanted sources USoS.
- the one or more sensor/s 300 can be used to create a three-dimensional (3D) map in which the sound source/s SoS can be positioned, and each sound source SoS classified as wanted or unwanted.
- the position of the wearable device 100 may be estimated simultaneously and tracked in the same 3D map using simultaneous localization and mapping (SLAM).
- SLAM simultaneous localization and mapping
- Sparse feature-based SLAM is used by resource- constrained devices such as the wearable device 100.
- Such algorithms create sparse 3D maps based on measurements of the relative pose changes between consecutive camera frames as well as relative measurements from the camera pose to features detected in the images and tracked across multiple images. Examples of such feature detectors include SIFT, ORB, and SURF.
- the relative pose changes are improved if an inertial measurement unit (IMU) measurement is also available.
- IMU inertial measurement unit
- the resulting 3D map is an optimized estimate of the position of features and poses of the camera, using techniques such as bundle adjustment or filtering.
- the sparse features overlapping the position of the sound source in the image can be attached with meta data to indicate that it corresponds to a sound source.
- 3D reconstruction can also be employed to create a dense representation of the sparse features.
- more elaborate techniques are available for creating dense 3D metric semantic maps. In those cases, similarly, the semantic information is based on the semantic segmentation of the 2D images.
- one or more sensor/s 300 operatively connected to the wearable device 100 can be used to map sound sources SoS with their appearance and beamforming direction maintained, even if the eye gaze of the user changes momentarily.
- a machine learning algorithm can be devised to interpret the intent of the user to decide if a new beamforming direction is wanted or the current one is to be maintained.
- the wearable device 100 further comprises means for displaying speech-to-text to the user, the speech-to-text being based on the at least one wanted sound source WSoS.
- Speech-to-text can generally be defined as speech recognition that allows spoken words to be converted into written text, often in real-time.
- the speech-to-text can be presented to the user in the wearable device 100 as subtitles in real-time of the speech spoken by the speaker that is currently in focus, i.e., by the speaker that the user is looking at, for example detected by an eye tracker 300 operatively connected to the wearable device 100 worn by the user.
- the earphones/hearing aids can be wirelessly connected to the wearable device 100 and present the beamformed sound to the user, so that the user can both hear the sound from the direction he/she focuses on and see subtitles when there is speech.
- the direction of the beam should be considered using head related transfer functions (HTRFs), the latter of which generally describe how an ear receives sound from a sound source.
- HRFs head related transfer functions
- the method 110 comprises acquiring 120, from at least one microphone 200 operatively connected to the wearable device 100, at least one sound signal S indicative of a sound environment in the vicinity of the user.
- the method 110 comprises acquiring 130, from at least one sensor 300 operatively connected to the wearable device 100, a visual sensor reading V indicative of a visual environment in the vicinity of the user and/or a movement sensor reading M indicative of a movement in the vicinity of the user.
- the method 110 further comprises identifying 140, from one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M, at least one sound source SoS in the vicinity of the user.
- the method 110 comprises classifying 150 the at least one sound source SoS as a wanted sound source WSoS or an unwanted sound source USoS from acquired input data, the acquired input data being based on one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M.
- the method 110 is performed by a processing circuitry in the wearable device 100 or a cloud operatively connected to the wearable device 100.
- the wearable device 100 may comprise one of a helmet, a hat, a headset, earphones, augmented reality glasses, and virtual reality glasses.
- Fig. 3 illustrates a computer program C comprising instructions, which when executed by processing circuitry or a cloud, carries out the method 110.
- Fig. 3 also illustrates a computer program product comprising a non-transitory storage medium including program code to be executed by a processing circuitry of a wearable device 100 or a cloud operatively connected to the wearable device 100, whereby execution of the program code causes the wearable device 100 to perform operations comprising acquiring, from at least one microphone 200 operatively connected to the wearable device 100, at least one sound signal S indicative of a sound environment in the vicinity of the user, as well as acquiring, from at least one sensor 300 operatively connected to the wearable device 100, a visual sensor reading V indicative of a visual environment in the vicinity of the user and/or a movement sensor reading M indicative of a movement in the vicinity of the user.
- the operations further comprising identifying, from one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M, at least one sound source SoS in the vicinity of the user.
- the operations comprising classifying the at least one sound source SoS as a wanted sound source WSoS or an unwanted sound source USoS from acquired input data, the acquired input data being based on one or more of: the at least one sound signal S, the visual sensor reading V and the movement sensor reading M.
- the wearable device 100 comprises one of a helmet, a hat, a headset, earphones, augmented reality glasses, and virtual reality glasses.
- the at least one microphone 200 is a microphone array 200, 400.
- the wearable device 100 e.g., AR glasses
- the wearable device 100 can cooperate with wireless earphones/hearing aids so that the user is both presented with sound from the direction where he/she focuses, as well as with subtitles in case there is speech from that direction.
- the AR glasses can work independently using at least one built-in microphone 200 or in cooperation with other devices with microphones to improve performance, e.g., mobile phones and dedicated devices with built-in microphone arrays used at parties and meetings. Visual cues from the AR glasses can be employed for selecting which devices with microphones to use.
- the AR glasses offer many opportunities for convenient user interfaces for the control, handling, or classification of different sound sources and modes of operation, by accelerometers detecting head movements, eye trackers detecting gaze direction and blinking, microphones detecting sounds, and cameras detecting hand gestures.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Otolaryngology (AREA)
- General Health & Medical Sciences (AREA)
- Neurosurgery (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/SE2023/050435 WO2024228650A1 (en) | 2023-05-04 | 2023-05-04 | Sound classification in noisy environments |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/SE2023/050435 WO2024228650A1 (en) | 2023-05-04 | 2023-05-04 | Sound classification in noisy environments |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024228650A1 true WO2024228650A1 (en) | 2024-11-07 |
Family
ID=93333208
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/SE2023/050435 Pending WO2024228650A1 (en) | 2023-05-04 | 2023-05-04 | Sound classification in noisy environments |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024228650A1 (en) |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170188173A1 (en) * | 2015-12-23 | 2017-06-29 | Ecole Polytechnique Federale De Lausanne (Epfl) | Method and apparatus for presenting to a user of a wearable apparatus additional information related to an audio scene |
| US9961435B1 (en) * | 2015-12-10 | 2018-05-01 | Amazon Technologies, Inc. | Smart earphones |
| US10361673B1 (en) * | 2018-07-24 | 2019-07-23 | Sony Interactive Entertainment Inc. | Ambient sound activated headphone |
| WO2019246562A1 (en) * | 2018-06-21 | 2019-12-26 | Magic Leap, Inc. | Wearable system speech processing |
| US10595149B1 (en) * | 2018-12-04 | 2020-03-17 | Facebook Technologies, Llc | Audio augmentation using environmental data |
| US20200327877A1 (en) * | 2019-04-09 | 2020-10-15 | Facebook Technologies, Llc | Acoustic transfer function personalization using sound scene analysis and beamforming |
| WO2021136962A1 (en) * | 2020-01-03 | 2021-07-08 | Orcam Technologies Ltd. | Hearing aid systems and methods |
| US20220159403A1 (en) * | 2019-08-06 | 2022-05-19 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | System and method for assisting selective hearing |
| EP4132010A2 (en) * | 2021-08-06 | 2023-02-08 | Oticon A/s | A hearing system and a method for personalizing a hearing aid |
| US20230329913A1 (en) * | 2022-03-21 | 2023-10-19 | Li Creative Technologies Inc. | Hearing protection and situational awareness system |
-
2023
- 2023-05-04 WO PCT/SE2023/050435 patent/WO2024228650A1/en active Pending
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9961435B1 (en) * | 2015-12-10 | 2018-05-01 | Amazon Technologies, Inc. | Smart earphones |
| US20170188173A1 (en) * | 2015-12-23 | 2017-06-29 | Ecole Polytechnique Federale De Lausanne (Epfl) | Method and apparatus for presenting to a user of a wearable apparatus additional information related to an audio scene |
| WO2019246562A1 (en) * | 2018-06-21 | 2019-12-26 | Magic Leap, Inc. | Wearable system speech processing |
| US10361673B1 (en) * | 2018-07-24 | 2019-07-23 | Sony Interactive Entertainment Inc. | Ambient sound activated headphone |
| US10595149B1 (en) * | 2018-12-04 | 2020-03-17 | Facebook Technologies, Llc | Audio augmentation using environmental data |
| US20200327877A1 (en) * | 2019-04-09 | 2020-10-15 | Facebook Technologies, Llc | Acoustic transfer function personalization using sound scene analysis and beamforming |
| US20220159403A1 (en) * | 2019-08-06 | 2022-05-19 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | System and method for assisting selective hearing |
| WO2021136962A1 (en) * | 2020-01-03 | 2021-07-08 | Orcam Technologies Ltd. | Hearing aid systems and methods |
| EP4132010A2 (en) * | 2021-08-06 | 2023-02-08 | Oticon A/s | A hearing system and a method for personalizing a hearing aid |
| US20230329913A1 (en) * | 2022-03-21 | 2023-10-19 | Li Creative Technologies Inc. | Hearing protection and situational awareness system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11531518B2 (en) | System and method for differentially locating and modifying audio sources | |
| US11632470B2 (en) | Methods and apparatus to assist listeners in distinguishing between electronically generated binaural sound and physical environment sound | |
| US11290826B2 (en) | Separating and recombining audio for intelligibility and comfort | |
| US10817251B2 (en) | Dynamic capability demonstration in wearable audio device | |
| US12032155B2 (en) | Method and head-mounted unit for assisting a hearing-impaired user | |
| US20230045237A1 (en) | Wearable apparatus for active substitution | |
| US20200128322A1 (en) | Conversation assistance audio device control | |
| CN114556972A (en) | System and method for assisting selective hearing | |
| US20180270571A1 (en) | Techniques for amplifying sound based on directions of interest | |
| US10636405B1 (en) | Automatic active noise reduction (ANR) control | |
| US11438710B2 (en) | Contextual guidance for hearing aid | |
| CN116324969A (en) | Hearing enhancement and wearable systems with positional feedback | |
| US20230035531A1 (en) | Audio event data processing | |
| US12229472B2 (en) | Hearing augmentation and wearable system with localized feedback | |
| KR20220143704A (en) | Hearing aid systems that can be integrated into eyeglass frames | |
| CN118672389A (en) | Modifying sounds in a user's environment in response to determining a shift in the user's attention | |
| KR20240040737A (en) | Processing of audio signals from multiple microphones | |
| WO2024228650A1 (en) | Sound classification in noisy environments | |
| JP2025131430A (en) | Sound source generating device, sound source generating method, and sound source generating program. | |
| US20240340603A1 (en) | Visualization and Customization of Sound Space | |
| US20250046329A1 (en) | Systems and methods for enhancing speech audio signals | |
| EP4378175A1 (en) | Audio event data processing | |
| CN115250646A (en) | Auxiliary listening method and device | |
| CN118020313A (en) | Processing audio signals from multiple microphones | |
| CN118020314A (en) | Audio event data processing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23935835 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023935835 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023935835 Country of ref document: EP Effective date: 20251204 |
|
| ENP | Entry into the national phase |
Ref document number: 2023935835 Country of ref document: EP Effective date: 20251204 |