WO2021118106A1 - Appareil électronique et procédé de commande associé - Google Patents
Appareil électronique et procédé de commande associé Download PDFInfo
- Publication number
- WO2021118106A1 WO2021118106A1 PCT/KR2020/016638 KR2020016638W WO2021118106A1 WO 2021118106 A1 WO2021118106 A1 WO 2021118106A1 KR 2020016638 W KR2020016638 W KR 2020016638W WO 2021118106 A1 WO2021118106 A1 WO 2021118106A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- signal
- audio
- intelligibility
- audio signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R5/00—Stereophonic arrangements
- H04R5/04—Circuit arrangements, e.g. for selective connection of amplifier inputs/outputs to loudspeakers, for loudspeaker detection, or for adaptation of settings to personal preferences or hearing impairments
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0324—Details of processing therefor
- G10L21/034—Automatic adjustment
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/057—Time compression or expansion for improving intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the disclosure relates to an electronic apparatus and a controlling method thereof. More particularly, the disclosure relates to an electronic apparatus performing an operation corresponding to a user’s speech and a controlling method of an electronic apparatus.
- a non-speech volume is measured by tracking a minimum value of the power for each frequency band, and it is impossible to properly measure the non-speech that abruptly increases, other than non-speech that is maintained constant, and non-speech cannot be accurately measured due to a sensitivity problem of a recording microphone, post-correction, or the like.
- the parameters related to a final output are adjusted by measuring the probability of speech by frequency bands, the speech and non-speech of the same band increase together and for output.
- an electronic apparatus capable of controlling speech intelligibility more accurately and a method for controlling thereof.
- an electronic apparatus capable of controlling speech intelligibility optimally, based on, reflecting, or in consideration of a producing intention of audio content, and a controlling method thereof.
- an electronic apparatus performing an operation corresponding to a user’s speech and a controlling method thereof.
- an electronic apparatus includes an inputter and a processor configured to, based on receiving an audio signal through the inputter, obtain a speech intelligibility included in the audio signal, and modify the audio signal so that the speech intelligibility becomes a target intelligibility that is set based on scene information regarding a type of audio included in the audio signal, and the type of audio includes at least one of a sound effect, shouting, music, or a speech.
- the processor may be further configured to calculate the speech intelligibility based on a speech signal and a non-speech signal other than the speech signal, included in the audio signal.
- the processor may be further configured to extract the speech signal included in the audio signal using an artificial intelligence model trained to extract speech signals included in audio signals, and to extract, from the audio signal, one or more remaining signals other than the extracted speech signal, as the non-speech signal.
- the speech intelligibility may be one of a signal to noise ratio (SNR) of the speech signal and the non-speech signal included in the audio signal and a speech intelligibility index (SII) based on the speech signal and the non-speech signal.
- SNR signal to noise ratio
- SII speech intelligibility index
- the speech intelligibility may be the SNR, and the processor may be further configured to adjust a gain of the speech signal by as much as a difference value between the target intelligibility and the obtained speech intelligibility to modify the audio signal.
- the speech intelligibility may be the SII
- the processor may be further configured to calculate a gain adjustment value and adjust a gain of the speech signal by as much as the calculated gain adjustment value to modify the audio signal
- SII target may denote the target intelligibility
- SII measurement may denote the obtained speech intelligibility
- ⁇ and ⁇ may denote constant values experimentally calculated through a change in a number of the SII over a change in the gain of the speech signal.
- the processor may be further configured to obtain at least one audio feature with respect to the audio signal and obtain the scene information based on the obtained at least one audio feature.
- the processor may be further configured to obtain the scene information using an artificial intelligence model trained to distinguish audio types included in audio signals.
- the target intelligibility may be set differently with respect to different audio types.
- the target intelligibility may be set to be higher than a case in which the audio type is the shouting.
- a method of controlling an electronic apparatus includes obtaining an audio signal, obtaining a speech intelligibility for the audio signal based on the audio signal, and modifying the audio signal so that the speech intelligibility becomes a target intelligibility that is set based on scene information regarding a type of audio included in the audio signal, and the type of audio includes at least one of a sound effect, shouting, music, or a speech.
- the obtaining the speech intelligibility may comprise calculating the speech intelligibility based on a speech signal and a non-speech signal other than the speech signal, included in the audio signal.
- the obtaining the speech intelligibility may comprise extracting the speech signal included in the audio signal using an artificial intelligence model trained to extract speech signals included in audio signals; and extracting, from the audio signal, one or more remaining signals other than the extracted speech signal, as the non-speech signal.
- the speech intelligibility may be one of a signal to noise ratio (SNR) of the speech signal and the non-speech signal included in the audio signal and a speech intelligibility index (SII) based on the speech signal and the non-speech signal.
- SNR signal to noise ratio
- SII speech intelligibility index
- the speech intelligibility may be the SNR
- the modifying may comprise adjusting a gain of the speech signal by as much as a difference value between the target intelligibility and the obtained speech intelligibility to modify the audio signal.
- the speech intelligibility may be the SII
- the modifying may comprise calculating a gain adjustment value and adjusting a gain of the speech signal by as much as the calculated gain adjustment value to modify the audio signal
- SII target may denote the target intelligibility
- SII measurement may denote the obtained speech intelligibility
- ⁇ and ⁇ may denote constant values experimentally calculated through a change in a number of the SII over a change in the gain of the speech signal.
- the method of controlling an electronic apparatus may further comprise obtaining at least one audio feature with respect to the audio signal and obtaining the scene information based on the obtained at least one audio feature.
- the method of controlling an electronic apparatus may further comprise obtaining the scene information using an artificial intelligence model trained to distinguish audio types included in audio signals.
- the target intelligibility may be set differently with respect to different audio types.
- an electronic apparatus includes a memory storing instructions; and a processor configured to execute the instructions to: obtain a speech intelligibility for an audio signal, and modify the audio signal so that the speech intelligibility becomes a target intelligibility, wherein the target intelligibility is set based on a determined type of audio included in the audio signal.
- a non-transitory computer-readable recording medium has recorded thereon instructions executable by at least one processor to perform the method of controlling the electronic apparatus.
- speech intelligibility can be more accurately controlled.
- optimal speech intelligibility can be adjusted by reflecting the producing intention of the audio content producer.
- the user can be provided with an optimal sound experience.
- FIG. 1 is a diagram illustrating an environment in which an audio content including an audio signal is provided to an electronic apparatus through a network, according to an embodiment
- FIG. 2 is a block diagram of an electronic apparatus according to an embodiment
- FIG. 3 is a functional block diagram of a processor according to an embodiment
- FIG. 4 is a graph illustrating speech recognition accuracy according to a speech intelligibility index
- FIG. 5 is a detailed block diagram of an electronic apparatus according to an embodiment.
- FIG. 6 is a flowchart illustrating a method for controlling an electronic apparatus according to an embodiment.
- first As used herein, the terms “first,” “second,” or the like may identify corresponding components, regardless of order and/or importance, and are used to distinguish a component from another without otherwise limiting the components.
- an element e.g., first element
- another element e.g., second element
- the element may be connected to the other element directly or through still another element (e.g., third element).
- one element e.g., first element
- another element e.g., second element
- there is no element e.g., third element
- FIG. 1 is a diagram illustrating an environment in which an audio content including an audio signal is provided to an electronic apparatus 100-1 to 100-4 through a network, according to an embodiment.
- an audio content may be provided to the electronic apparatus 100-1 to 100-4 from a broadcast transmitting station 1, a satellite 2, a content providing server 3, or the like, through a communication medium 5 (e.g., a network or the Internet).
- a communication medium 5 e.g., a network or the Internet
- the audio content may be composed of a multi-channel audio signal such as a stereo channel audio signal or a 5.1 channel audio signal, but is not limited thereto and may be composed of a single channel audio signal, a 7.1 channel audio signal, a 5.2 channel audio signal, etc.
- the audio content may be provided to the electronic apparatus 100 1 to 100-4 alone, depending on the type of content and/or the type of electronic apparatus, and may be provided to the electronic apparatuses 100-1 to 100-4 along with video content.
- the broadcast transmitting station 1 may include a transmitter or a repeater for transmitting terrestrial broadcast content.
- the satellite 2 may include a satellite for communicating data or satellite broadcast content.
- the content providing server 3 may be a server on a communication network that provides broadcast content for Internet Protocol television (IPTV), broadcast content for cable television (TV), various sound source content, a video on demand (VOD) content, etc.
- IPTV Internet Protocol television
- TV broadcast content for cable television
- VOD video on demand
- the communication medium 5 may include an air medium or a constructed communication network.
- the communication network may include a wireless cell network, Internet, wide area network (WAN), local area network (LAN), a wired phone network, a cable network, or the like.
- the electronic apparatuses 100-1 to 100-4 may include not only an audio device 100-3 capable of reproducing only an audio content but also a display device 100-1, 100-2, and 100-4 capable of reproducing video and audio together.
- the display devices 100-1, 100-2, and 100-4 are devices including a display for reproducing a video and outputting audio through a speaker, such as a smart TV, a monitor, a smartphone, a desktop computer, a laptop computer, a tablet, a navigation device, digital signage, or the like.
- the audio device 100-3 is an electronic apparatus configured to reproduce and output only audio and, for example, the audio device 100-3 may include a radio device, an audio device, a phonograph, a speech recognition speaker device, a compact disk player equipped with a speaker, a digital audio player (DAP), an audio device for a vehicle, a home appliance equipped with a speaker, a sound bar, various devices capable of performing an output operation of sound, or the like.
- a radio device an audio device, a phonograph, a speech recognition speaker device, a compact disk player equipped with a speaker, a digital audio player (DAP), an audio device for a vehicle, a home appliance equipped with a speaker, a sound bar, various devices capable of performing an output operation of sound, or the like.
- the electronic apparatus 100-1 to 100-4 may process the received audio signal to generate an output signal, and may output the generated output signal through at least one speaker.
- the at least one speaker may be provided in the electronic apparatuses 100-1 to 100-4, and/or may be separately disposed outside the electronic apparatuses 100-1 to 100-4 according to an embodiment.
- the electronic apparatuses 100-1 to 100-4 may identify (or obtain, determine, calculate, etc.) the intelligibility of speech (e.g., a speech intelligibility value) included in the received audio signal and correct or modify the audio signal so that the identified speech intelligibility becomes a target intelligibility (e.g., a target intelligibility value) and generate an output signal.
- the target intelligibility may be set based on scene information regarding a type of audio included in the received audio signal (e.g., sound effect, shouting, music, a speech, etc.)
- the electronic apparatuses 100-1 to 100-4 may separate the speech signal and the non-speech signal from the received audio signal and identify the intelligibility of the speech based on the separated speech signal and the non-speech signal.
- the electronic apparatuses 100-1 to 100-4 may, unlike the related art that measures probability of speech by frequency bands for adjusting parameters related to the final output, adjust speech intelligibility by performing a gain adjustment of at least one of a separated speech signal and non-speech signal, or performing various processing for the separated speech signal and the non-speech signal.
- the electronic apparatuses 100-1 to 100-4 may set the target intelligibility based on scene information regarding the type of audio included in the audio signal, unlike the related art that only performs the operation of increasing speech intelligibility for all kinds of content being input. Accordingly, the electronic apparatuses 100-1 to 100-4 may correct the audio signal such that the intelligibility of the speech of the received audio signal is the set target intelligibility.
- the speech signal and the non-speech signal may be separated from the audio signal to identify the speech intelligibility, and at least one of the separated speech signal and the non-speech signal is processed to adjust the speech intelligibility, so that the speech intelligibility can be more accurately adjusted.
- the target intelligibility is set based on the scene information
- adjustment of intelligibility of speech may be performed to be different for each audio type, and a producing intention of an audio content producer may be reflected therethrough.
- the audio content is provided through the communication medium 5 from the outside of the electronic apparatuses 100-1 to 100-4, it is understood that one or more other embodiments are not limited thereto.
- the audio content may be provided to the electronic apparatuses 100-1 to 100-4 through a portable storage medium such as a universal serial bus (USB), a secure digital (SD) memory card, or the like, various optical storage medium, or the like.
- a portable storage medium such as a universal serial bus (USB), a secure digital (SD) memory card, or the like, various optical storage medium, or the like.
- the audio content may be stored in a storage of the electronic apparatus 100-1 to 100-4 (e.g., a hard disk drive (HDD), a solid state drive (SSD), a system memory (ROM, BIOS, etc.), etc., and output by the electronic apparatuses 100-1 to 100-4 (e.g., in response to or based on a user’s request).
- a storage of the electronic apparatus 100-1 to 100-4 e.g., a hard disk drive (HDD), a solid state drive (SSD), a system memory (ROM, BIOS, etc.
- FIG. 2 is a block diagram of an electronic apparatus 100 according to an embodiment.
- the electronic apparatus 100 includes an inputter 110 and a processor 120.
- the inputter 110 may receive an audio signal and provide the received audio signal to the processor 120.
- the audio signal can be provided to the electronic apparatus 100 through a communication medium 5 or through an external portable storage medium. Accordingly, various wired and wireless communication interfaces for receiving an audio signal can perform functions of the inputter 110.
- the audio signal may be provided to the processor 120 from the storage included in the electronic apparatus 100 and in this case, the storage included in the electronic apparatus 100 may perform a function of the inputter 110.
- the processor 120 controls overall operations of the audio output device 100.
- the processor 120 based on receiving an audio signal through the inputter 110, may identify speech intelligibility included in the audio signal based on the received audio signal.
- the processor 120 may identify the speech intelligibility based on the speech signal included in the audio signal and the non-speech signal excluding the speech signal from the audio signal.
- the processor 120 may extract a speech signal from the audio signal and extract the remaining signal except the extracted speech signal as a non-speech signal.
- the processor 120 can extract a speech signal from an audio signal received through the inputter 110 by using an artificial intelligence model trained to extract a speech signal from an audio signal. It is understood, however, that one or more other embodiments are not limited thereto.
- the processor 120 may identify the speech intelligibility included in the audio signal based on the extracted speech signal and the non-speech signal.
- the processor 120 may calculate a signal to noise ratio (SNR) of the extracted speech signal and the non-speech signal, and can identify the calculated SNR as the speech intelligibility.
- the processor 120 can calculate a speech intelligibility index (SII) based on the extracted speech signal and the non-speech signal, and may identify the calculated speech intelligibility index as the speech intelligibility.
- SII speech intelligibility index
- the processor 120 can correct the audio signal so that the identified speech intelligibility becomes the target intelligibility.
- the target intelligibility also has the SNR value
- the target intelligibility can also have the speech intelligibility index value.
- the processor 120 may adjust a gain of the speech signal by as much as a difference value between the target intelligibility and the identified speech intelligibility to correct an audio signal.
- the processor 120 can calculate a gain adjustment value based on equation 1 below, and may adjust the gain of the speech signal by the calculated gain adjustment value to correct the audio signal.
- the SII target is a target intelligibility of the speech intelligibility index format
- the SII measurement is the identified speech intelligibility of the speech intelligibility index format
- ⁇ and ß are constant values experimentally calculated through a numerical change of the speech intelligibility index according to a gain change of a speech signal.
- the method for calculating the gain adjustment value is not limited to the above equation 1.
- the processor 120 can obtain a more sophisticated gain adjustment value by using a quadratic equation such as ⁇ 1 *(SII target -SII measurement ) 2 + ⁇ 2 *(SII target -SII measurement ) + ß, or a linear regression of higher order.
- the processor 120 may obtain the gain adjustment value from specific index values (for example, SII, speech transmission index (STI) described below, or the like).
- specific index values for example, SII, speech transmission index (STI) described below, or the like.
- the audio signal in which speech intelligibility is adjusted may be output through at least one speaker disposed inside or outside the electronic apparatus 100.
- the target intelligibility described above can have a specific value for each type of audio as a value set based on scene information regarding the type of audio included in the audio signal.
- the specific value may be directly set as the target intelligibility value, such as 0.6 for sound effect, 0.5 for shout, and 0.4 for music.
- target intelligibility may be set to a percentage value of intelligibility to be adjusted for each audio type.
- the percentage value of intelligibility to be adjusted may be set to target intelligibility, such as +10% for sound effect, -10% for shout, 0% for music.
- the processor 120 can calculate the actual target intelligibility value by calculating the percent value of the intelligibility to be adjusted in addition to the currently measured speech intelligibility.
- the target intelligibility may be stored as a mapping table in a storage preset by audio types, and the processor 120 may check the target intelligibility value corresponding to the scene information with reference to the mapping table.
- the scene information is a sub-concept of genre information and may include information on which type of audio included in the audio signal corresponds to any kind among sound effect, shout, music, and speech.
- audio content of a “movie” genre can include various kinds of audio such as sound effect, shout, and music, and at this time, each audio type such as speech, sound effect, shout, and music can be scenes included in the audio signal.
- the processor 120 can obtain at least one audio feature for an audio signal and obtain scene information based on the at least one obtained audio feature.
- the processor 120 can obtain scene information using an artificial intelligence model trained to identify the type of audio included in the audio signal.
- the target intelligibility can be set according to scene information of the obtained audio signal. Specifically, the target intelligibility can be set differently for each type of audio included in the scene information. For example, the target intelligibility can be set higher when the type of audio is a sound effect than when the type of audio is shout, although it is understood that one or more other embodiments are not limited thereto.
- a producing intention of a content producer may be reflected or considered in an intelligibility adjustment by setting a target intelligibility value based on scene information and adjusting an audio signal based thereon.
- FIG. 3 is a functional block diagram of a processor 120 according to an embodiment.
- the processor 120 may include a speech/non-speech separator 121, a speech intelligibility analyzer 122, a scene analyzer 123, and a speech intelligibility renderer 124.
- the speech/non-speech separator 121 may separate and/or extract a speech signal and a non-speech signal from an audio signal received through the inputter 110.
- the speech/non-speech separator 121 may extract a speech signal from an audio signal and extract the remaining signal(s) other than the extracted speech signal as a non-speech signal.
- the speech/non-speech separator 121 may extract a signal of the corresponding speech channel as a speech signal and extract a signal of the remaining channel(s) as a non-speech signal.
- the speech/non-speech separator 121 may extract a speech signal from a signal of a speech channel and may extract a remaining non-speech signal of the speech channel, excluding the extracted speech signal, and a signal of the remaining channel as a non-speech signal.
- many audio signals reproduced in an electronic apparatus are 5.1 channel audio signals or stereo channel audio signals.
- 5.1 channel audio signal speech is present in a center channel
- stereo channel audio signal speech is present in a signal in which a sound image angle is 0 degree.
- the speech/non-speech separator 121 upon or based on receiving the 5.1 channel, may extract a speech from a center channel signal. Since the center channel includes a non-speech signal in addition to a speech signal, the speech/non-speech separator 121 can extract a non-speech signal of the center channel excluding the extracted speech signal and a signal of a remaining channel (a sub-woofer channel, a front left channel, a front right channel, a rear left channel, and a rear light channel) as a non-speech signal.
- a remaining channel a sub-woofer channel, a front left channel, a front right channel, a rear left channel, and a rear light channel
- the speech/non-speech separator 121 can extract a speech signal from a signal having a sound phase angle of 0.
- a non-speech signal can be included in a signal having a sound image angle of 0 and therefore, the speech/non-speech separator 121 can extract a non-speech signal of a signal having a sound image angle of 0, excluding the extracted speech signal, and a signal of the remaining sound image angle (i.e., a signal at a different angle other than a zero-degree angle) as a non-speech signal.
- the speech/non-speech separator 121 may extract a speech signal using various existing speech signal extraction algorithms.
- the speech/non-speech separator 121 can extract a speech signal using an artificial intelligence-based algorithm trained to extract a speech signal.
- the artificial intelligence model can include at least one of a deep learning model, a convolutional neural network (CNN) model, and recurrent neural network (RNN) model.
- the artificial intelligence model trained to extract the speech signal may be included in a storage of the electronic apparatus 100 to be utilized by the speech/non-speech separator 121, and/or may be included in a server external to the electronic apparatus 100 and utilized by the speech/non-speech separator 121 through the communication of the server and the electronic apparatus 100.
- the speech/non-speech separator 121 may extract a speech signal from the audio signal using a simple noise canceling method or various speech extraction methods based on audio feature.
- the audio feature may include at least one of time domain feature such as short term energy (STE), zero crossing rate (ZCR), low short term energy ratio (LSTER), high zero crossing rate ratio (HZCRR), and frequency domain feature such as a Mel-frequency cepstral coefficient (MFCC), total power spectrum, sub-band powers, frequency centroid, bandwidth, pitch frequency, spectrum flux (SF), or the like.
- time domain feature such as short term energy (STE), zero crossing rate (ZCR), low short term energy ratio (LSTER), high zero crossing rate ratio (HZCRR), and frequency domain feature such as a Mel-frequency cepstral coefficient (MFCC), total power spectrum, sub-band powers, frequency centroid, bandwidth, pitch frequency, spectrum flux (SF), or the like.
- MFCC Mel-frequency cepstral coefficient
- the non-speech signal may denote the rest of the signal except for the extracted speech signal as described above in the entire audio signal.
- the non-speech signal can be extracted through Equation 2 below.
- the extracted speech signal and the non-speech signal are used to identify intelligibility of speech included in the audio signal by the speech intelligibility analyzer 122.
- the speech intelligibility analyzer 122 may identify the speech intelligibility included in the received audio signal based on at least one of the SNR, the SII, the STI, or the like.
- the speech intelligibility analyzer 122 can identify the SNR measured by the following Equation 3 as the intelligibility of the speech included in the received audio signal.
- the speech intelligibility analyzer 122 may identify the sound intelligibility index (SII), which may be measured using the US standard method, as the intelligibility of the speech included in the received audio signal.
- SII sound intelligibility index
- the sound intelligibility index is also measured on the basis of the speech signal and the non-speech signal (the remaining signal) separated from the audio signal.
- FIG. 4 is a graph illustrating speech recognition accuracy in accordance with the speech intelligibility index. Specifically, FIG. 4 illustrates the user’s speech recognition accuracy for three audio data sets, such as CID W-22, NU-6, and CST, where a horizontal axis of the graph indicates the SII, and a vertical axis indicates speech recognition accuracy.
- This numeral value (0.6) may be used as the level of target intelligibility by a speech intelligibility renderer 124.
- the speech intelligibility analyzer 122 may identify an objective number reflecting the degree of speech recognition such as STI as the intelligibility of the speech included in the received audio signal.
- the scene analyzer 123 may analyze the audio signal to obtain scene information. Specifically, the scene analyzer 123 may obtain at least one audio feature for a predetermined number of audio frames of the plurality of audio frames included in the audio signal, and obtain scene information for the predetermined number of audio frames based on the obtained audio features.
- the audio feature may include at least one time domain feature such as short term energy (STE), zero crossing rate (ZCR), low short term energy ratio (LSTER), high zero crossing rate ratio (HZCRR), and/or at least one frequency domain feature such as a Mel-frequency cepstral coefficient (MFCC), total power spectrum, sub-band powers, frequency centroid, bandwidth, pitch frequency, spectrum flux (SF), or the like.
- time domain feature such as short term energy (STE), zero crossing rate (ZCR), low short term energy ratio (LSTER), high zero crossing rate ratio (HZCRR)
- MFCC Mel-frequency cepstral coefficient
- the scene analyzer 123 may analyze the pair of L, R audio frames to extract at least one of the audio features, and based on the extracted audio features, may identify whether the L, R audio frames include a type of audio (and which type) among the sound effect, shout, and music.
- the scene analyzer 123 may obtain scene information using an artificial intelligence model trained to identify the type of audio included in the audio signal.
- the artificial intelligence model may include at least one of a deep learning model, a convolutional neural network (CNN) model, and a recurrent neural network (RNN) model.
- the scene analyzer 123 can identify whether the L, R audio frame includes a type of audio among the sound effect, shout, and music and which type of audio is included, whether the L, R audio frame contains sound, implicit, and music, via a pair of calculating a probability of matching for each audio type using the trained CNN model and a spectrogram pattern in which a pair of L, R audio frames are converted to two-dimensional axis.
- the artificial intelligence model trained to identify the type of audio may also be included in the storage of the electronic apparatus 100, to be used by the scene analyzer 123, like the artificial intelligence model trained to extract a speech signal, and/or may be included in a server existing outside the electronic apparatus 100 and may be used by the scene analyzer 123 through communication between the server and the electronic apparatus 100.
- the scene analyzer 123 directly analyzes or processes the audio signal to obtain scene information, it is understood that one or more other embodiments are not limited thereto.
- the scene analyzer 123 may receive scene information corresponding to the received audio signal from an external server that generates and manages scene information about the audio content.
- the speech intelligibility renderer 124 may control the speech intelligibility included in the audio signal by correcting at least one of the speech signal and remaining signals, by utilizing the speech intelligibility identified by the speech intelligibility analyzer 122 and the scene information obtained by the scene analyzer 123.
- the speech intelligibility renderer 124 may control the gain of the speech signal to control the speech intelligibility.
- the degree of intelligibility controlling can be identified through the intelligibility information of the speech received from the speech intelligibility analyzer 122 and the scene information received from the scene analyzer 123.
- the speech intelligibility index should be about 0.6. If the currently identified speech intelligibility index is 0.4, 0.2 should be raised to obtain the speech intelligibility of a desired level.
- How much gain value should be adjusted to raise the intelligibility index by 0.2 can be predicted or determined through conducting an experiment of a numeral change in the speech intelligibility index according to a change in gain of the speech signal. For example, whenever the gain of the speech signal is increased by 1 dB for various audio signals, a change in numeral value of the speech intelligibility index may be observed and calculated back to obtain ⁇ and ß of Equation 1 described above, and the gain adjustment value of the speech signal to raise the speech intelligibility index by 0.2 can be calculated through Equation 1.
- what % of recognizing speech by a user would be a target is identified by scene information.
- speech intelligibility can be adjusted by reflecting a manufacturing intention of an audio content manufacturer.
- the speech intelligibility renderer 124 may adjust the gain of the speech signal by a gain adjustment value calculated via Equation 1 to raise the user’s speech recognition accuracy up to 90%.
- the shouting sound during sports has a large impact on a sense of realness that a viewer may feel.
- the target intelligibility may be set to an appropriate number through experimentation.
- the appropriate value through the experimentation can be a value smaller than the target intelligibility when the type of audio is sound effect, although it is understood that one or more other embodiments are not limited thereto.
- the target intelligibility is set to the speech intelligibility index of 0.6, the speech of a commentator and an announcer can be clear, but remaining signals in which the shouting exists can be relatively small and thus, a viewer may not enjoy the sense of realness sufficiently.
- the target intelligibility can therefore be set to around the speech intelligibility index of 0.5 to maintain both the appropriate intelligibility and the sense of realness.
- the gain of the speech signal can be adjusted to a maximum of 3 dB even though the intelligibility value of the speech measured by the speech intelligibility analyzer 122 is low, or by not adjusting the gain of the speech signal, the intention of the music content producer can be reflected as much as possible.
- the speech intelligibility renderer 124 controls the gain of the speech signal to control the speech intelligibility, although it is understood that one or more other embodiments are not limited thereto.
- the speech intelligibility renderer 124 may control the gain of the non-speech signal, or may control the intelligibility of the speech by utilizing a technique such as a dynamic range compressor, a linear prediction coefficient (LPC) filter, a harmonic enhancer, or the like.
- LPC linear prediction coefficient
- the speech intelligibility renderer 124 may adjust the speech intelligibility included in the received audio signal and may generate the audio signal having the adjusted speech intelligibility as an output signal.
- the generated output signal may be output through at least one speaker.
- the processor 120 may include a central processing unit (CPU), a micro controller unit (MCU), micom (micro-processor), electronic control unit (ECU), an application processor (AP) and/or other electronic units (hereinafter, “CPU, etc.”) capable of processing various calculations and generating a control signal, to control an operation of the speech/non-speech separator 121, speech intelligibility analyzer 122, the scene analyzer 123, the speech intelligibility renderer 124.
- CPU central processing unit
- MCU micro controller unit
- micom micro-processor
- ECU electronic control unit
- AP application processor
- the CPU may be provided in a form integrated with at least one or a part of the speech/non-speech separator 121, the speech intelligibility analyzer 122, the scene analyzer 123, or the speech intelligibility renderer 124.
- the speech/non-speech separator 121, the speech intelligibility analyzer 122, the scene analyzer 123, and the speech intelligibility renderer 124 can be integrated into one or more functional modules and form the processor 120.
- the speech intelligibility analyzer 122, the scene analyzer 123, and the speech intelligibility renderer 124 may be integrated into a single signal processing module, or the speech/non-speech separator 121, the speech intelligibility analyzer 122, the scene analyzer 123, and the speech intelligibility renderer 124 may be integrated into a single signal processing module.
- the signal processing module may be, but is not limited to, a digital signal processor (DSP).
- FIG. 5 is a detailed block diagram of an electronic device 100 according to an embodiment.
- the electronic apparatus 100 may include a processor 120, a memory 130, a display 140, a user inputter 150, a communicator 180, and an audio outputter 170.
- some components of the electronic apparatus 100 shown in FIG. 5 may be omitted and/or other components may be included.
- the audio outputter 170 is configured to output an audio signal as an output signal.
- the audio outputter 170 may output an audio signal adjusted by the processor 120 as described above.
- the audio outputter 170 may include at least one speaker and/or a terminal or interface for outputting an audio signal to an external speaker or audio output device.
- the communicator 180 is configured to communicate with an external device.
- the communicator 180 may include a wireless communicator 181, a wired communicator 182, and an input interface 183.
- the wireless communicator 181 may communicate with the external broadcast transmitting station 1, a satellite 2, a content providing server 3, and other terminal devices using a wireless communication technology and/or a mobile communication technology.
- the wireless communication technologies include, for example, Bluetooth, Bluetooth low energy, CAN communication, Wi-Fi, Wi-Fi Direct, ultra-wide band (UWB), Zigbee, infrared data association (IrDA), near field communication (NFC), or the like, and the mobile communication technology may include 3GPP, Wi-Max, long term evolution (LTE), 5th generation (5G), or the like.
- the wireless communicator 181 may receive audio content from another terminal device or a server, and may transmit the received audio content to the processor 120.
- the wireless communicator 181 may be implemented using an antenna, a communication chip, a substrate, etc., which can transmit electromagnetic waves to the outside or receive electromagnetic waves transmitted from the outside.
- the wired communicator 182 can communicate with the external broadcast transmission station 1, the satellite 2, the content providing server 3, and other terminal devices on the basis of a wired communication network.
- the wired communication network may be implemented using a physical cable such as, for example, a pair cable, a coaxial cable, an optical fiber cable, or an Ethernet cable.
- the wired communicator 182 may receive audio content from another terminal device or a server and transmit the received audio content to the processor 120.
- the audio output device 100 may include only the wireless communicator 181 or only the wired communicator 182.
- the audio output device 100 may include an integrated communicator that supports both a wireless connection by the wireless communicator 181 and a wired connection by the wired communicator 182.
- the input interface 183 may be connected to another device, e.g., an external storage device, provided separately from the audio output device 100, and may receive audio content from another device and transmit or provide the received audio content to the processor 120.
- the input interface 183 may be a universal serial bus (USB) terminal, and may include at least one of various interface terminals, such as a high definition multimedia interface (HDMI) terminal, a Thunderbolt terminal, or the like.
- HDMI high definition multimedia interface
- Thunderbolt terminal or the like.
- the audio output unit 170 including at least one speaker is directly connected to the processor 120 of the electronic apparatus 100 (specifically, the speech intelligibility renderer 124 included in the processor 120) and embedded in the electronic apparatus 100, but it is understood that one or more other embodiments are not limited thereto.
- the output signal generated by the processor 120 may be output through a separate speaker installed or provided outside the electronic apparatus 100.
- a separate speaker installed outside the electronic apparatus 100 can be connected to the electronic apparatus 100 through the communicator 180, and the output signal generated by the processor 120 can be output to the separate speaker installed outside the electronic apparatus 100 through the communicator 180.
- the communicator 180 may communicate with an external server generating and managing scene information for audio content, an external server generating and managing an artificial intelligence model trained to extract a speech signal, and/or an external server generating and managing a trained artificial intelligence model to identify the type of audio included in the audio signal, and may receive scene information or various artificial intelligence models from an external server.
- the memory 130 may temporarily or non-temporarily store the audio content and may forward the audio content to the processor 120 in the form of an audio signal in accordance with the call of the processor 120.
- the memory 130 may store various information necessary for operation, processing, or control operations of the processor 120 in an electronic format.
- the memory 130 may store all or a portion of various data, applications, filters, algorithms, or the like, necessary for operation of the processor 120, and may provide the same to the processor 120 as needed.
- the application may be obtained through an electronic software distribution network accessible through the wireless communicator 181 or the wired communicator 182.
- the memory 130 may include, for example, at least one of a main memory device and an auxiliary memory device.
- the main memory may be implemented using semiconductor storage media such as read only memory (ROM) and/or random access memory (RAM).
- ROM read only memory
- RAM random access memory
- the ROM may include, for example, a conventional ROM, EPROM, EEPROM, and/or mask-ROM.
- the RAM may include, for example, DRAM and/or SRAM.
- the auxiliary memory device may be implemented using at least one storage medium capable of permanently or semi-permanently storing data, such as a flash memory device, a secure digital memory (SD) card, a solid state drive (SSD), a solid state drive (SSD), a hard disk drive (HDD), a magnetic drum, a compact disc (CD), an optical media such as a digital video disc (DVD) or a laser disc, a magnetic tape, a magneto-optical disk, and/or a floppy disk.
- a flash memory device such as a flash memory device, a secure digital memory (SD) card, a solid state drive (SSD), a solid state drive (SSD), a hard disk drive (HDD), a magnetic drum, a compact disc (CD), an optical media such as a digital video disc (DVD) or a laser disc, a magnetic tape, a magneto-optical disk, and/or a floppy disk.
- SD secure digital memory
- SSD solid state drive
- SSD solid state
- the inputter 110 is configured to receive an audio signal and provide the same to the processor 120.
- an audio signal may be provided to the processor 120 through the communicator 180 or the memory 130.
- the communicator 180 and the memory 130 may correspond to the inputter 110 as described in FIG. 2.
- the display 140 displays various images.
- the processor 120 can reproduce the video through the display 140.
- the display 140 may include various types of display panels, such as, but not limited to, a liquid crystal display (LCD) panel, an organic light emitting diode (OLED) panel, a plasma display panel (PDP) panel, an inorganic LED panel, a micro LED panel, and the like.
- the display 140 may constitute a touch screen with a touch panel.
- the user inputter 150 is configured to receive various user inputs.
- the user inputter 150 may include various buttons or touch panels, but is not limited thereto.
- the processor 120 controls overall operations of the audio output device 100.
- the processor 120 may perform operations of the electronic apparatus 100, the processor 120, or the functional blocks of the processor 120 as described above with reference to FIGS. 1 to 4.
- the processor 120 may decode the audio content and convert the content into an uncompressed format.
- decoding refers to a process of restoring an audio signal compressed by an audio compression format such as an MPEG layer-3 (MP3), advanced audio coding (AAC), audio codec-3 (AC-3), digital theater system (DTS), free lossless audio codec (FLAC), windows media audio (WMA), or the like, into an uncompressed audio signal. If the audio content is not compressed, this decoding process may be omitted.
- the restored audio signal may include one or more channels.
- FIG. 6 is a flowchart illustrating a method of controlling an electronic apparatus 100 according to an embodiment.
- the electronic apparatus 100 may receive an audio signal in operation S610 and identify the speech intelligibility included in the audio signal based on the received audio signal in operation S620.
- the electronic apparatus 100 may calculate the speech intelligibility based on a speech signal and a non-speech signal other than the speech signal, included in the received audio signal.
- the electronic apparatus 100 may extract a speech signal included in the audio signal using an artificial intelligence model trained to extract a speech signal included in the audio signal, and may extract a remaining signal except the extracted speech signal from the audio signal as a non-speech signal.
- the electronic apparatus 100 can adjust the audio signal so that the identified speech intelligibility becomes the target intelligibility in operation S630.
- the target intelligibility is a value set based on scene information related to the type of audio included in the audio signal, and the type of audio included in the scene information can include at least one of a sound effect, shout, and music.
- the electronic apparatus 100 may obtain at least one audio feature for the audio signal and obtain scene information based on the obtained at least one audio feature. Alternatively, the electronic apparatus 100 may obtain scene information using an artificial intelligence model trained to identify the type of audio included in the audio signal.
- the target intelligibility can be set differently for each type of audio.
- the target intelligibility can be set relatively higher if the type of audio is a sound effect, but is not limited thereto.
- the intelligibility described above may be any one of a signal-to-noise ratio (SNR) of the speech signal and the non-speech signal included in the audio signal, and a speech intelligibility index (SII) based on the speech signal and the non-speech signal.
- SNR signal-to-noise ratio
- SII speech intelligibility index
- the electronic apparatus 100 can adjust the gain of the speech signal by the difference between the target intelligibility and the identified speech intelligibility to correct the audio signal.
- the electronic apparatus 100 may calculate a gain adjustment value based on Equation 1 below, and adjust the gain of the speech signal by the calculated gain adjustment value to correct the audio signal.
- the SII target is the target intelligibility
- the SII measurement is the identified speech intelligibility
- ⁇ and ß are constant values experimentally calculated through a change in the number of the speech intelligibility index according to a gain change of the speech signal.
- speech intelligibility can be more accurately controlled.
- optimal speech intelligibility can be adjusted by reflecting the producing intention of the audio content producer.
- the user can be provided with an optimal sound experience.
- Various embodiments may be implemented in software including instructions stored in a machine-readable storage media readable by a machine (e.g., a computer).
- the apparatus is a device calling stored instructions from a storage medium and operates according to the called instructions and can include an electronic apparatus 100, 100-1 to 100-4 according to embodiments.
- the processor may perform a function corresponding to the instructions directly or by using other components under the control of the processor.
- the instructions may include a code generated by a compiler or a code executable by an interpreter.
- a machine-readable storage medium may be provided in the form of a non-transitory storage medium.
- non-transitory only denotes that a storage medium is not limited to a signal but is tangible, and does not distinguish the case in which data is semi-permanently stored in a storage medium from the case in which data is temporarily stored in a storage medium.
- the computer program product may be traded as a product between a seller and a consumer.
- the computer program product may be distributed online in the form of machine-readable storage media (e.g., compact disc read only memory (CD-ROM)) or through an application store (e.g., PLAY STORETM and APP STORETM) or distributed online (e.g., downloaded or uploaded) directly between to users (e.g., smartphones).
- an application store e.g., PLAY STORETM and APP STORETM
- distributed online e.g., downloaded or uploaded
- at least a portion of the computer program product may be at least temporarily stored or temporarily generated in a server of the manufacturer, a server of the application store, or a machine-readable storage medium such as memory of a relay server.
- the respective elements (e.g., module or program) of the elements mentioned above may include a single entity or a plurality of entities. Furthermore, at least one element or operation from among the corresponding elements mentioned above may be omitted, or at least one other element or operation may be added. Alternatively or additionally, a plurality of components (e.g., module or program) may be combined to form a single entity. As such, the integrated entity may perform functions of at least one function of an element of each of the plurality of elements in the same manner as or in a similar manner to that performed by the corresponding element from among the plurality of elements before integration.
- the module, a program module, or operations executed by other elements may be executed consecutively, in parallel, repeatedly, or heuristically, or at least some operations may be executed according to a different order, may be omitted, or the other operation may be added thereto.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Quality & Reliability (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
La présente invention concerne un appareil électronique et un procédé de commande associé. L'appareil électronique comprend un élément d'entrée et un processeur configuré pour obtenir, sur la base de la réception d'un signal audio via l'élément d'entrée, une intelligibilité de paroles pour le signal audio, et pour modifier le signal audio de sorte que l'intelligibilité de paroles devient une intelligibilité cible qui est réglée sur la base d'informations de scène concernant un type de contenu audio compris dans le signal audio, le type de contenu audio comprenant un effet sonore et/ou un cri et/ou de la musique et/ou des paroles.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020190162644A KR102845224B1 (ko) | 2019-12-09 | 2019-12-09 | 전자 장치 및 이의 제어 방법 |
| KR10-2019-0162644 | 2019-12-09 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021118106A1 true WO2021118106A1 (fr) | 2021-06-17 |
Family
ID=73747876
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/KR2020/016638 Ceased WO2021118106A1 (fr) | 2019-12-09 | 2020-11-24 | Appareil électronique et procédé de commande associé |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US12051437B2 (fr) |
| EP (1) | EP3836140B1 (fr) |
| KR (1) | KR102845224B1 (fr) |
| CN (1) | CN113038344B (fr) |
| WO (1) | WO2021118106A1 (fr) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113993036B (zh) * | 2021-10-19 | 2023-05-05 | 广州番禺巨大汽车音响设备有限公司 | 基于高清模式的具备hdmi arc功能音响的控制方法、装置 |
| CN118175377A (zh) * | 2022-01-27 | 2024-06-11 | 海信视像科技股份有限公司 | 显示设备及音频处理方法 |
| EP4571741A4 (fr) | 2022-08-18 | 2025-07-30 | Samsung Electronics Co Ltd | Procédé pour séparer une source sonore cible d'une source sonore mélangée et son dispositif électronique |
| WO2025014685A1 (fr) * | 2023-07-07 | 2025-01-16 | Dolby Laboratories Licensing Corporation | Compensation de bruit environnemental dans une téléconférence |
| US20250078859A1 (en) * | 2023-08-29 | 2025-03-06 | Bose Corporation | Source separation based speech enhancement |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6889186B1 (en) * | 2000-06-01 | 2005-05-03 | Avaya Technology Corp. | Method and apparatus for improving the intelligibility of digitally compressed speech |
| US20080255829A1 (en) * | 2005-09-20 | 2008-10-16 | Jun Cheng | Method and Test Signal for Measuring Speech Intelligibility |
| KR20120064105A (ko) * | 2009-09-14 | 2012-06-18 | 에스알에스 랩스, 인크. | 적응적 음성 가해성 처리 시스템 |
| US9536536B2 (en) * | 2012-05-04 | 2017-01-03 | 2236008 Ontario Inc. | Adaptive equalization system |
| US9578436B2 (en) * | 2014-02-20 | 2017-02-21 | Bose Corporation | Content-aware audio modes |
Family Cites Families (32)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB9714001D0 (en) * | 1997-07-02 | 1997-09-10 | Simoco Europ Limited | Method and apparatus for speech enhancement in a speech communication system |
| US7953219B2 (en) * | 2001-07-19 | 2011-05-31 | Nice Systems, Ltd. | Method apparatus and system for capturing and analyzing interaction based content |
| AU2007210334B2 (en) * | 2006-01-31 | 2010-08-05 | Telefonaktiebolaget Lm Ericsson (Publ). | Non-intrusive signal quality assessment |
| EP2118885B1 (fr) | 2007-02-26 | 2012-07-11 | Dolby Laboratories Licensing Corporation | Enrichissement vocal en audio de loisir |
| KR20110062594A (ko) | 2009-12-03 | 2011-06-10 | (주)웰리브솔루션 | 음성 명료도 향상방법 및 그 장치, 음성통신 단말기 |
| US20140278392A1 (en) | 2013-03-12 | 2014-09-18 | Motorola Mobility Llc | Method and Apparatus for Pre-Processing Audio Signals |
| CN107093991B (zh) | 2013-03-26 | 2020-10-09 | 杜比实验室特许公司 | 基于目标响度的响度归一化方法和设备 |
| CN104078050A (zh) | 2013-03-26 | 2014-10-01 | 杜比实验室特许公司 | 用于音频分类和音频处理的设备和方法 |
| US9031838B1 (en) * | 2013-07-15 | 2015-05-12 | Vail Systems, Inc. | Method and apparatus for voice clarity and speech intelligibility detection and correction |
| EP2884766B1 (fr) | 2013-12-13 | 2018-02-14 | GN Hearing A/S | Prothèse auditive d'apprentissage de localisation |
| US9648430B2 (en) | 2013-12-13 | 2017-05-09 | Gn Hearing A/S | Learning hearing aid |
| WO2015097826A1 (fr) | 2013-12-26 | 2015-07-02 | 株式会社東芝 | Dispositif électronique, procédé de commande, et programme |
| CN105336341A (zh) * | 2014-05-26 | 2016-02-17 | 杜比实验室特许公司 | 增强音频信号中的语音内容的可理解性 |
| WO2016126767A1 (fr) * | 2015-02-03 | 2016-08-11 | Dolby Laboratories Licensing Corporation | Segmentation de conférence selon une dynamique de conversation |
| CN105280195B (zh) | 2015-11-04 | 2018-12-28 | 腾讯科技(深圳)有限公司 | 语音信号的处理方法及装置 |
| US9812149B2 (en) | 2016-01-28 | 2017-11-07 | Knowles Electronics, Llc | Methods and systems for providing consistency in noise reduction during speech and non-speech periods |
| KR102417047B1 (ko) | 2016-06-24 | 2022-07-06 | 삼성전자주식회사 | 잡음 환경에 적응적인 신호 처리방법 및 장치와 이를 채용하는 단말장치 |
| KR101886775B1 (ko) | 2016-10-31 | 2018-08-08 | 광운대학교 산학협력단 | Ptt 기반 음성 명료성 향상 장치 및 방법 |
| KR102410820B1 (ko) * | 2017-08-14 | 2022-06-20 | 삼성전자주식회사 | 뉴럴 네트워크를 이용한 인식 방법 및 장치 및 상기 뉴럴 네트워크를 트레이닝하는 방법 및 장치 |
| CN107564538A (zh) | 2017-09-18 | 2018-01-09 | 武汉大学 | 一种实时语音通信的清晰度增强方法及系统 |
| EP3471440B1 (fr) * | 2017-10-10 | 2024-08-14 | Oticon A/s | Dispositif auditif comprenant un estimateur d'intelligibilité de la parole pour influencer un algorithme de traitement |
| WO2019084214A1 (fr) * | 2017-10-24 | 2019-05-02 | Whisper.Ai, Inc. | Séparation et recombinaison audio pour l'intelligibilité et le confort |
| KR101986905B1 (ko) | 2017-10-31 | 2019-06-07 | 전자부품연구원 | 신호 분석 및 딥 러닝 기반의 오디오 음량 제어 방법 및 시스템 |
| KR101958664B1 (ko) | 2017-12-11 | 2019-03-18 | (주)휴맥스 | 멀티미디어 콘텐츠 재생 시스템에서 다양한 오디오 환경을 제공하기 위한 장치 및 방법 |
| US10991379B2 (en) * | 2018-06-22 | 2021-04-27 | Babblelabs Llc | Data driven audio enhancement |
| US10923141B2 (en) | 2018-08-06 | 2021-02-16 | Spotify Ab | Singing voice separation with deep u-net convolutional networks |
| CN109065067B (zh) | 2018-08-16 | 2022-12-06 | 福建星网智慧科技有限公司 | 一种基于神经网络模型的会议终端语音降噪方法 |
| CN109448755A (zh) | 2018-10-30 | 2019-03-08 | 上海力声特医学科技有限公司 | 人工耳蜗听觉场景识别方法 |
| KR102691543B1 (ko) | 2018-11-16 | 2024-08-02 | 삼성전자주식회사 | 오디오 장면을 인식하는 전자 장치 및 그 방법 |
| US11017778B1 (en) * | 2018-12-04 | 2021-05-25 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
| CN109859768A (zh) * | 2019-03-12 | 2019-06-07 | 上海力声特医学科技有限公司 | 人工耳蜗语音增强方法 |
| KR102712458B1 (ko) * | 2019-12-09 | 2024-10-04 | 삼성전자주식회사 | 오디오 출력 장치 및 오디오 출력 장치의 제어 방법 |
-
2019
- 2019-12-09 KR KR1020190162644A patent/KR102845224B1/ko active Active
-
2020
- 2020-11-24 WO PCT/KR2020/016638 patent/WO2021118106A1/fr not_active Ceased
- 2020-12-02 US US17/109,753 patent/US12051437B2/en active Active
- 2020-12-07 EP EP20212328.7A patent/EP3836140B1/fr active Active
- 2020-12-09 CN CN202011426799.2A patent/CN113038344B/zh active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6889186B1 (en) * | 2000-06-01 | 2005-05-03 | Avaya Technology Corp. | Method and apparatus for improving the intelligibility of digitally compressed speech |
| US20080255829A1 (en) * | 2005-09-20 | 2008-10-16 | Jun Cheng | Method and Test Signal for Measuring Speech Intelligibility |
| KR20120064105A (ko) * | 2009-09-14 | 2012-06-18 | 에스알에스 랩스, 인크. | 적응적 음성 가해성 처리 시스템 |
| US9536536B2 (en) * | 2012-05-04 | 2017-01-03 | 2236008 Ontario Inc. | Adaptive equalization system |
| US9578436B2 (en) * | 2014-02-20 | 2017-02-21 | Bose Corporation | Content-aware audio modes |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113038344B (zh) | 2025-01-07 |
| EP3836140B1 (fr) | 2025-10-15 |
| KR102845224B1 (ko) | 2025-08-12 |
| CN113038344A (zh) | 2021-06-25 |
| US20210174821A1 (en) | 2021-06-10 |
| US12051437B2 (en) | 2024-07-30 |
| EP3836140A1 (fr) | 2021-06-16 |
| KR20210072384A (ko) | 2021-06-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2021118106A1 (fr) | Appareil électronique et procédé de commande associé | |
| WO2021118107A1 (fr) | Appareil de sortie audio et procédé de commande de celui-ci | |
| CN1973536A (zh) | 视频-音频同步 | |
| WO2018056780A1 (fr) | Procédé et appareil de traitement de signal audio binaural | |
| WO2018056624A1 (fr) | Dispositif électronique et procédé de commande associé | |
| WO2019107868A1 (fr) | Appareil et procédé de sortie de signal audio, et appareil d'affichage l'utilisant | |
| WO2020050509A1 (fr) | Dispositif de synthèse vocale | |
| WO2019139301A1 (fr) | Dispositif électronique et procédé d'expression de sous-titres de celui-ci | |
| WO2018012727A1 (fr) | Appareil d'affichage et support d'enregistrement | |
| WO2022154641A1 (fr) | Procédé et dispositif de synchronisation de signal audio et de signal vidéo d'un contenu multimédia | |
| CN103716568A (zh) | 电视音量调整方法及系统 | |
| WO2013187688A1 (fr) | Procédé de traitement de signal audio et appareil de traitement de signal audio l'adoptant | |
| WO2018066731A1 (fr) | Dispositif terminal et procédé pour réaliser une fonction d'appel | |
| CN114845212B (zh) | 音量优化方法、装置、电子设备及可读存储介质 | |
| EP3607760A1 (fr) | Appareil électronique et procédé de commande associé | |
| WO2020122554A1 (fr) | Appareil d'affichage et son procédé de commande | |
| WO2021125784A1 (fr) | Dispositif électronique et son procédé de commande | |
| CN117133296A (zh) | 显示设备及多路语音信号的混音处理方法 | |
| WO2024167167A1 (fr) | Procédé et système de normalisation de signal à l'aide de métadonnées de sonie pour traitement audio | |
| CN106293607A (zh) | 自动切换音频输出模式的方法及系统 | |
| WO2021172713A1 (fr) | Dispositif électronique et son procédé de commande | |
| WO2024080723A1 (fr) | Dispositif électronique et procédé de traitement de signal audio | |
| WO2025095410A1 (fr) | Appareil de sortie audio et procédé associé | |
| CN111050261A (zh) | 听力补偿方法、装置及计算机可读存储介质 | |
| WO2022225180A1 (fr) | Dispositif électronique pour traiter des signaux audio et procédé de commande de celui-ci |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20899748 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20899748 Country of ref document: EP Kind code of ref document: A1 |