[go: up one dir, main page]

EP4167229A1 - Masquage audio de locuteurs - Google Patents

Masquage audio de locuteurs Download PDF

Info

Publication number
EP4167229A1
EP4167229A1 EP22201974.7A EP22201974A EP4167229A1 EP 4167229 A1 EP4167229 A1 EP 4167229A1 EP 22201974 A EP22201974 A EP 22201974A EP 4167229 A1 EP4167229 A1 EP 4167229A1
Authority
EP
European Patent Office
Prior art keywords
signal
speech
masking
spectral
speech signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
EP22201974.7A
Other languages
German (de)
English (en)
Other versions
EP4167229C0 (fr
EP4167229B1 (fr
Inventor
Thomas Stottan
Thomas Hatheier
Alois Sontacchi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Audio Mobil Elektronik GmbH
Original Assignee
Audio Mobil Elektronik GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Audio Mobil Elektronik GmbH filed Critical Audio Mobil Elektronik GmbH
Priority to CN202280070252.4A priority Critical patent/CN118140266A/zh
Priority to US18/702,209 priority patent/US20250239248A1/en
Priority to KR1020247014966A priority patent/KR20240089343A/ko
Priority to JP2024524500A priority patent/JP2024542967A/ja
Priority to PCT/EP2022/078926 priority patent/WO2023066908A1/fr
Priority to EP22803245.4A priority patent/EP4420115A1/fr
Publication of EP4167229A1 publication Critical patent/EP4167229A1/fr
Application granted granted Critical
Publication of EP4167229C0 publication Critical patent/EP4167229C0/fr
Publication of EP4167229B1 publication Critical patent/EP4167229B1/fr
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • G10K11/1752Masking
    • G10K11/1754Speech masking

Definitions

  • the present disclosure relates to the generation of a masking signal for speech in a zone-based audio system.
  • Modern means of communication and their ever-increasing coverage enable communication to take place almost anywhere, for example in the form of telephone calls.
  • public spaces other people can often overhear such conversations and understand their content. This is a problem in particular when it comes to confidential private or business calls.
  • Such a scenario exists in public transportation, such as trains or planes, but also in private vehicles, such as taxis or rented limousines.
  • private vehicles such as taxis or rented limousines.
  • there are other people in fixed places for example in assigned seats.
  • Such seats often have an associated audio system or at least components thereof.
  • speakers for individual playback of audio content can be provided in these seats, for example integrated into headrests, which is also referred to as a zone-based audio system.
  • This document deals with the technical task of generating a masking signal in a zone-based audio system that reduces unwanted eavesdropping on a conversation and at the same time does not cause any unpleasant interference.
  • a method for masking a speech signal in a zone-based audio system includes capturing a speech signal to be masked in an audio zone, for example by means of one or more conveniently placed microphones, which can be arranged, for example, in a neckrest of a seat.
  • the speech signal can originate from the local speaker of a telephone conversation or it can belong to a conversation between people present.
  • the captured speech signal is then transformed into spectral bands, which can be done using an FFT and mel filters, for example.
  • the method also includes the swapping of spectral values from at least two spectral bands, as a result of which the spectral structure of the speech signal is changed without its overall energy content being changed.
  • a noise signal is generated based on the swapped spectral values.
  • the generated noise signal has Although it shows a certain similarity with the spectrum of the speech signal, it does not completely agree with it, since the spectral structure of the speech signal is no longer completely preserved due to the swapping of the bands.
  • Such a noise signal with a similar but not the same spectrum as the speech signal is well suited as a masking signal for the speech signal.
  • any number of bands can be interchanged (e.g. all of them), with more variation in the noise spectrum being produced as more bands are interchanged.
  • the noise signal is output as a masking signal with the lowest possible energy input in another audio zone in order to make it more difficult for a person at the listening location to overhear the conversation by reducing speech intelligibility for this person.
  • Generating a noise signal based on the swapped spectral values may include generating a broadband noise signal, e.g., by a noise generator, and transforming the generated noise signal into the frequency domain. Furthermore, the frequency representation of the noise signal can be multiplied by a frequency representation of the speech signal, taking into account the swapped spectral values. The multiplication in the frequency domain generates a noise spectrum that essentially corresponds to that of the speech signal after the spectral bands have been swapped, ie is similar to the speech spectrum but not the same. A similar effect can also be achieved by convolution in the time domain.
  • the frequency representation of the speech signal can be generated by interpolating the spectral values of the bands (for example in the mel range in the present case) after swapping the spectral values.
  • the interpolation generates the necessary values at the frequency base values for multiplication with the noise spectrum from the (relatively few) spectral values of the bands.
  • the method can further include estimating a background noise spectrum (preferably at the listening location) and comparing spectral values of the speech signal with the background noise spectrum.
  • the spectral values are preferably (but not necessarily) compared in the range of the spectral bands (eg Mel bands), which means that the background noise spectrum must also be represented in the spectral bands.
  • only spectral values of the speech signal that are greater than the corresponding spectral values of the background noise spectrum (or are in a predetermined relationship thereto) can be taken into account for the further procedure (eg the above-mentioned interpolation).
  • Spectral parts of the speech signal that are already hidden by the background noise do not need to be taken into account for generating the masking signal and can be masked out (eg by being set to zero).
  • the background noise can be taken into account both before and after the swapping of spectral values.
  • the spectral bands to be compared still match exactly and the background noise is correctly taken into account.
  • additional variation is introduced into the Noise spectrum introduced, which can lead to increased masking. This enables a masking signal adapted to the background or the environment, which can be output with the lowest possible energy input in the audio zone of the listener.
  • the transformation of the recorded speech signal into spectral bands can take place for blocks of the speech signal and by means of a mel filter bank.
  • a temporal smoothing of the spectral values for the mel bands e.g. in the form of a moving average.
  • the noise signal can be represented spatially during output by means of a multi-channel (i.e. at least 2-channel) reproduction.
  • a multi-channel representation of the masking signal which enables a spatial reproduction of the masking signal, can be generated.
  • this can preferably be done by multiplying with binaural spectra of an acoustic transfer function.
  • the spatial reproduction increases the effect of the masking signal in masking the speech at the listening location, especially if the noise signal in the other audio zone is emitted spatially in such a way that it appears to come from the direction of the speaker of the speech signal to be masked.
  • the method can include determining a point in time in the speech signal that is relevant for speech intelligibility (e.g. the presence of consonants in the speech signal) and generating a suitable deflection signal for the specific point in time.
  • the deflection signal can then be output at the specific point in time as a further masking signal in the other audio zone, as a result of which the conversation content is selectively additionally concealed (masking) in the event of speech onsets. Since the distraction signal is only emitted at certain relevant points in time, it does not significantly increase the overall sound level and does not lead to any significant impairment.
  • the point in time relevant for speech intelligibility can be determined using extreme values (e.g. local maxima, onsets) of a spectral function of the speech signal, with the spectral function being determined based on an addition of spectral values over the frequency axis.
  • the spectral values can be smoothed beforehand in the direction of time and/or in the direction of frequency. After the addition of the spectral values via the frequency axis, the sum values can optionally be logarithmized.
  • the (optionally logarithmic) total values can be time-differentiated.
  • the points in time relevant for speech intelligibility can be determined using parameters of the speech signal, such as zero crossing rate, short-term energy and/or spectral center of gravity, be verified. It is also possible to take time restrictions for extreme values into account, so that these must have a predetermined minimum time interval, for example.
  • the deflection signal for a specific point in time can then be selected at random from a set of predefined deflection signals. These can be kept in a memory for selection. It has turned out to be advantageous if the deflection signal is adapted to the speech signal with regard to its spectral characteristics and/or its energy. In this way, the spectral focus of the deflection signal can be adapted to the spectral focus of the corresponding speech section at the specific point in time, e.g. by means of single-sideband modulation. A speech section with a high spectral focus can thus be masked with a deflection signal with a likewise high spectral focus (possibly even with the same spectral focus), which leads to greater effectiveness of the masking. Also, the energy of the deflection signal can be matched to the energy of the speech section so as not to generate a masking signal that is too loud and excessively disruptive.
  • the deflection signal can be represented when it is output by means of a multi-channel spatial reproduction, preferably by multiplication with binaural spectra of an acoustic transfer function, whereby a multi-channel (at least 2-channel) representation of the deflection signal is generated which provides a spatial reproduction of the Deflection signal allows.
  • the spatial reproduction increases the effect of the distraction signal to disguise speech at the listening location, especially if the distraction signal is emitted spatially in the other audio zone in such a way that it appears to come from a random direction and/or near the listener's head in the other audio zone appears. This spatialization reduces the ability to distinguish between the speech signal and the distraction signal, or it makes it more difficult to overhear the speech signal due to the distraction signal, and the energy for the distraction signal can thus be reduced.
  • the processing of the speech signal presented above and the generation of a masking signal are preferably carried out in the digital domain. This requires steps that are not described in detail, such as an analog-to-digital conversion and a digital-to-analog conversion, which, however, are self-evident to those skilled in the art after studying the present disclosure. Furthermore, the above method can be implemented in whole or in part by means of a programmable device which, in particular, has a digital signal processor and the required analog/digital converter.
  • a device for generating a masking signal in a zone-based audio system which receives a speech signal to be masked and generates the masking signal based on the speech signal.
  • the device comprises means for transforming the detected speech signal into spectral bands; means for interchanging spectral values of at least two spectral bands; and means to Generation of a noise signal as a masking signal based on the swapped spectral values.
  • the device can further have: means for determining a point in time in the speech signal that is relevant for speech intelligibility; means for generating a deflection signal for the point in time of interest; and means for adding the noise signal and the deflection signal and for outputting the sum signal as a masking signal.
  • the device also includes means for generating a multi-channel representation of the masking signal, which enables spatial reproduction of the masking signal.
  • a zone-based audio system having a plurality of audio zones, wherein at least one audio zone has a microphone for detecting a voice signal and another audio zone has at least one loudspeaker.
  • Microphone and speakers may be located in neckrests of seats for occupants of a vehicle. It is also possible for both audio zones to have a microphone and loudspeaker.
  • the audio system has a device for generating a masking signal, as illustrated above, which receives a speech signal from a microphone in one audio zone and sends the masking signal to the speaker or speakers in the other audio zone.
  • a further aspect of the present disclosure relates to the above-described generation of a deflection signal as a masking signal independently of the noise signal mentioned.
  • a corresponding method for masking a speech signal in a zone-based audio system comprises: detecting a speech signal to be masked in an audio zone; determining a point in time in the speech signal that is relevant for speech intelligibility; generating a deflection signal for the specific point in time, it being possible for the deflection signal to be adapted to the speech signal with regard to a spectral characteristic and/or its energy; and outputting the deflection signal at the specified time as a masking signal in the other audio zone.
  • the possible configurations of the method correspond to the configurations presented above in combination with the noise signal generated.
  • a corresponding device for generating a distraction signal as a masking signal in a zone-based audio system which receives a speech signal to be masked and generates the masking signal based on the speech signal, is also disclosed.
  • This has means for determining a point in time in the speech signal that is relevant for speech intelligibility; Means for generating a deflection signal for the relevant point in time, wherein the deflection signal can be adapted to the speech signal with regard to a spectral characteristic and/or its energy; and means for outputting the deflection signal as a masking signal.
  • means for generating a multi-channel representation of the masking signal, which enable spatial reproduction of the masking signal can be provided.
  • the following exemplary embodiments allow vehicle occupants in any seating position to conduct undisturbed private conversations, such as telephone calls with other people outside the vehicle.
  • an audio masking signal is generated and played to other vehicle occupants so that their perception of the conversation is disturbed in order to make it more difficult for them to understand the private conversation and, at best, to make it impossible.
  • the conversation can be, for example, a telephone conversation or a conversation between vehicle occupants. In the latter case, there are two speakers who alternately emit speech signals that other occupants should not understand if possible, with the intelligibility of speech between the two participants in the conversation obviously not being impaired.
  • Similar scenarios are generally present when people are in acoustic zones or acoustic surroundings of a room, which are each exposed to sound from separate acoustic playback devices.
  • acoustic zones can exist, for example, in means of transport, such as vehicles, trains, buses, planes, ferries, etc., in which passengers are seated at seats that are each provided with acoustic reproduction means.
  • the proposed approach to creating private acoustic zones is not limited to these examples. It can be applied in general to situations in which people are at different locations in a room (e.g. in theater or cinema seats) and can be exposed to sound through individual acoustic reproduction means and there is the possibility of the speech signals of a speaker whose speech is from should not be understood by other people.
  • a zone-based audio system for generating private acoustic zones at each passenger seat in a vehicle or, more generally, in an acoustic environment.
  • the individual components of the audio system are networked with each other and can exchange information/signals interactively.
  • figure 1 1 schematically shows an example of such a zone-based audio system 1.
  • a user or passenger is at a seat 2 with a neckrest 3, which has two loudspeakers 4 and two microphones 5.
  • Such a zone-based audio system has one, preferably at least two, loudspeakers 4 for the active acoustic reproduction of personal and individual audio signals, which should not be heard or should only be heard to a small extent by the neighboring zones.
  • the / the speaker 4 can be mounted in the neckrest 3, the seat 2 itself or in the roof lining of the vehicle.
  • the loudspeakers have an adequate acoustic design and can be controlled via appropriate signal processing in order to be able to realize the acoustic influence of neighboring zones as little as possible.
  • an audio zone has the option of recording the voice of the occupant of the primary acoustic zone, independently of the neighboring zones and the signals actively reproduced therein.
  • one or more microphones 5 can be integrated in the seat 2 or the neckrest 3 or mounted in the direct acoustic environment of the zone and the occupant, as in figure 2 is shown schematically.
  • the microphones 5 are preferably arranged in such a way that they enable the voice of the occupant making the call to be recorded as well as possible.
  • Can a microphone be placed in close proximity to the speaker's mouth (like the center microphone in figure 2 ) a single microphone is generally sufficient to capture the speaker's audio signals with sufficient quality.
  • the microphone of a telephone headset can be used to pick up the voice signals. Otherwise, two or more microphones for capturing the speech are advantageous in order to record them better and, above all, in a more targeted manner using digital signal processing, as will be explained below.
  • the audio zone of the speaker can have appropriate signal processing in order to record the speech signals of the primary occupant with as little interference as possible and unaffected by the neighboring zones and the disturbances prevailing in the environment (wind, rolling noise, ventilation, etc.).
  • the speech signal of the vehicle occupant making a call is thus captured at the seating position (either directly by an appropriately arranged microphone or indirectly by means of one or more remote microphones with appropriate signal processing) and separated from any interference signals, such as background noise.
  • a masking signal can be generated from this speech signal for a listening passenger.
  • a broadband masking signal adapted to the speech to be veiled is generated for this passenger.
  • distraction signals can also be generated at the individual speech onsets within the speech of the primary speaker. This is to be understood as meaning short interference signals which are output at specific parts of speech which are important for speech intelligibility and which can likewise be adapted to the speech to be veiled. These distraction signals are output with a temporal overlap with the speech sections relevant for speech intelligibility in order to reduce the information content for the listener and to impair the intelligibility of the speech or its interpretation (informational masking) without significantly increasing the overall sound level.
  • these veiling signals can be played back in a spatial manner (multi-channel), so that a spatial perception of the veiling signals arises. In this way, eavesdropping at the seating positions of the eavesdropping persons can be avoided as best as possible.
  • the proposed approach ensures that the overall sound pressure level at the seating positions of the listening passengers increases only minimally and the annoyance or impairment (annoyance) of the passengers is not increased or the local listening comfort is preserved as best as possible, in contrast to an approach in which simply a loud background noise is emitted to mask the speech (energetic masking).
  • figure 3 shows the functionality and the basic system structure of an embodiment for two audio zones as an example.
  • the speech signals of the occupant of the primary acoustic zone I are recorded using the microphones 5 of this zone arranged in the neckrest 3 of the speaker and subjected to a first digital signal processing A in order to convert the speech signals of the primary occupant with as few disturbances as possible and unaffected by the neighboring zones and the prevailing disturbances in the area (wind, rolling noise, ventilation, etc.).
  • the microphone or microphones 5 can also be arranged in front of the speaker, as in figure 2 shown, for example, in the rear part of the front occupant's neckrest or in the headliner, steering wheel or dashboard.
  • the eavesdropper is in the seat directly in front of the speaker, but this need not be the case and the eavesdropper can be anywhere within the vehicle.
  • the speech signals processed in this way are then fed to a second signal processing unit B, which generates appropriate speech scrambling signals, so that the intelligibility of the listening occupant's speech is reduced.
  • the scrambling signals are then output by means of the loudspeakers 4' in the second acoustic zone II. These are arranged, for example, in the neckrest 3' of the occupant who is listening in, in order to achieve the most direct and undisturbed reproduction possible of the speech scrambling signals.
  • a speech scrambling signal can have a broadband masking signal adapted to the speech signal of the primary occupant and/or a distraction signal starting at individual speech onsets. In this way, acoustic zones can be made private in such a way that unwanted overhearing across the boundary of an acoustic zone is made significantly more difficult.
  • the estimated speech signals at the respective listening or microphone location are reduced by actively feeding in adaptive canceling signals.
  • the listening position is slightly variable in practice and at the same time the listening and microphone locations are a few centimeters apart, only speech signal components up to about 1.5 kHz can be actively reduced.
  • speech intelligibility is primarily dominated by consonants and thus signal components with frequencies above 2 kHz, this approach alone is insufficient or at most critical, since in the event of insufficient coordination (e.g. incorrect adjustment to the head position), the cancellation signals are exactly the relevant ones can carry private information and even amplify it so that speech intelligibility is increased rather than reduced.
  • the proposed approach is less sensitive to the exact head positions of the speaker and the listener and enables a reduction in speech intelligibility even of higher-frequency parts of speech such as consonants.
  • figure 4 shows such a multi-zone approach based on a multi-row vehicle schematically, in which 6 acoustic zones are provided.
  • speakers and microphones are integrated into the passengers' neck restraints, although the microphones may be placed in other positions in front of the respective speakers to provide a convenient location for capturing the speech signals. Similar to in figure 3 it is assumed in this example that the speaker is sitting behind the unwanted listener (here the driver).
  • the speech signals of the speaking occupant can equally be used to generate masking signals for occupants other than the driver and also for several unwanted eavesdroppers.
  • the speaker can also be in a different location in the vehicle than in the in figure 4 shown example.
  • the approach disclosed here can be applied very generally to all scenarios in which the speech of a speaker is detected and generated speech scrambling signals can be output in a targeted manner to the unwanted listener or listeners.
  • the speech signals can be a telephone conversation that the speaker has with an external person outside of the room in which the acoustic signals are located zones are located.
  • the conversation can also be conducted between people in the room, for example between the in figure 4 shown speaker and the occupant to his right.
  • the same signal processing as for the speaker shown is also to be provided for the second speaker in the zone-based audio system, so that his speech is also recorded and processed in order to generate suitable concealment signals for the listener or listeners. If the two speakers speak alternately, only the current speaker needs to be determined and the masking signals associated with this speaker need to be output. If both speakers speak at the same time, both masking signals can also be output at the same time.
  • a "rear-left" seated vehicle passenger as the internal speaker, has a telephone conversation with a person outside the vehicle.
  • the speech of the external speaker far end speaker signal
  • the speech of the external speaker can also be detected as speech to be veiled. This is retouched or disguised for the listening position "front left” for the listening vehicle driver.
  • this is only one possible scenario and the proposed methods can be used in general for all possible configurations of speaker position and listening position.
  • the signal sig est estimated by means of the digital signal processing A for the speech signal to be concealed provides the base variable for the subsequent generation of the masking or concealment signal.
  • the speech signal to be veiled can be the active internal speaker in the vehicle compartment and/or the external speaker outside.
  • the concealment signal can be a broadband masking signal and/or deflection signals. These generated signals ( send to: out LS-Left & LS-Right ) are played back via the active neckrest at the listening position.
  • both scrambling signals are generated, added and played back together in order to have an increased effect on the listener and to impair his intelligibility.
  • the combination of the two concealment signals results in a synergetic effect of these signals in reducing speech intelligibility.
  • the continuous broadband masking signal generates background noise, and the volume (energy) of the signal can be reduced compared to outputting only a noise signal, so that a less disturbing effect is achieved.
  • figure 5 shows a schematic block diagram for the generation of a broadband speech-signal-dependent masking.
  • the input signal is the speech signal sig est to be concealed.
  • the resulting two-channel output signals (out LS-Left & LS-Right) are sent to the active neckrest at the listening position, overlaid with deflection signals if necessary, and output to the listening person via loudspeakers attached to/in the neckrest.
  • the speech signal sig est is transformed into the frequency domain and smoothed both in terms of time and frequency.
  • the filter bank can consist of overlapping bands with a triangular frequency response. The center frequencies of the bands are divided equidistantly across the mel scale. The bottom frequency band of the filter bank starts at 0 Hz and the top frequency band ends at half the sample rate (fs).
  • these dynamic loudness curves are interchanged in the immediate frequency environment (scrambling).
  • the loudness values of the bands are swapped according to the table below, whereby the assignment of the band "in” results from the corresponding position in the line "out” below it.
  • the loudness value of band number 2 is assigned to band number 4 and the value of band 4 is assigned to band 5, its value is assigned to band 3, and so on
  • the difference between a mel band and a swapped band is a maximum of two mel bands.
  • the table shown is only one possible example of the swapping of bands, and other implementations are possible.
  • the loudness values are "scrambled” (scrambled), so that a certain "disorder" arises in the distribution of the loudness values for an associated speech section, whereby the description of its spectral energy or its loudness distribution is changed without the entire energy or loudness of the speech section is changed. For example, a particularly pronounced energy content in one band is shifted to another band, or a low level of energy (loudness) in one band is transformed to an adjacent band. It has been shown that by redistributing the energy into adjacent bands, a particularly effective broadband noise signal can be generated, which reduces the intelligibility of the associated speech section more than without band swapping.
  • the transmission of speech information in the noise signal is avoided. If the speech energy were recorded in frequency bands (e.g. mel bands as described above) and the amplitude of these temporal energy curves was modulated directly onto a noise signal, also divided into equal frequency bands, then the speech content would be audible - all the more understandable if narrow frequency bands are used . This effect is significantly reduced by the band swapping of the loudness values.
  • frequency bands e.g. mel bands as described above
  • the possibly interchanged dynamic loudness curves can be adjusted using the current background spectra (including all background noise) in section 130 of the block diagram in order to evaluate background noise and the environmental situation.
  • the background noise is recorded at the listening position, for example, and similar to the speech signal, the background spectra are determined by means of frequency transformation and time and frequency averaging.
  • a microphone arranged at the monitoring position is preferably used for this purpose.
  • microphones located elsewhere can be used to capture the background noise at the listening position. Only those bands of the speech signal that lie above the background spectrum need to be considered when generating the masking signal.
  • Speech bands whose energy is below the energy of the corresponding background noise band can be neglected, since they play no role in speech intelligibility or are already covered by the background noise. This can be done, for example, by setting the loudness value of such speech bands to zero. In other words, if a frequency band is already masked by strong background noise, no additional masking signal is generated in this frequency band. In this way, a situational decision is made as to which signal components of the broadband masking noise are fed in to disguise the speech.
  • the resulting masking thresholds (frequency axis sampled at 24 frequencies which correspond to the 24 center frequencies of the mel filter bank) are interpolated at all frequency support points of the Fourier transformation.
  • the frequency support points are multiplied point by point (or a convolution in the time domain) of the frequency values thus generated with a noise spectrum.
  • a noise generator (not shown), the noise signal of which runs analogously to the speech signal sig est through block segmentation 145 and Fourier transformation 150 with the same dimensions.
  • a broadband noise signal is generated as a masking signal with a similar frequency characteristic (apart from interchanging and zeroing sections 125 and 130) as the speech signal.
  • the masking signal can also be generated in the time domain by convolving the noise signal with the spectral values of the speech signal processed as described above (see sections 100 to 135) that have been transformed back into the time domain.
  • section 160 is followed by spatial processing by point-by-point multiplication of the frequency support points (or convolution in the time domain, see above) with binaural spectra of an acoustic transfer function that excludes the source direction of the speaker (or the dominant direction of the energy focus of the speech signal to be masked). corresponds to the point of view of the person listening.
  • the source direction of the speaker is known from the spatial arrangement of the acoustic zones. in the in figure 4 example shown is the source direction of the speaker directly behind the person listening.
  • multi-channel playback eg using two loudspeakers
  • a single-channel reproduction is sufficient, which preferably also takes place by means of two loudspeakers arranged in the neckrest of the person listening in.
  • the broadband masking signal can be reproduced spatially and adjusted to the target direction of incidence of the direct signal or the prominently perceived direction of the speaker. Due to the binaural loudness addition, there is a significantly improved masking at lower excess levels of the masking noise.
  • section 165 the two resulting spectra (in the case of spatial reproduction) (per block) are transformed back (IFFT) into the time domain and the blocks are superimposed using the overlap-add method (see section 170). It is noted that a multi-channel signal is created for spatial reproduction, which can be played back, for example, by stereo reproduction. If the previous steps have already been carried out in the time domain, the inverse transformation and the superimposition of the blocks are of course unnecessary.
  • the resulting timing signals are sent to the listener's respective active neckrest.
  • the masking signals can be summed with the deflection signals before being output via the loudspeakers of the neckrest.
  • some of the signal processing can be performed in the frequency domain or in the time domain, although it is also possible to perform all of the processing in the frequency domain.
  • the specific values mentioned above are only examples of a possible configuration and can be changed in many ways.
  • a frequency resolution of the FFT transformation with less than 1024 points or a division of the mel filter with more or less than 24 filters is possible.
  • the frequency transformation of the noise signal takes place with a different configuration of the block size and/or the FFT than that of the speech signal. In this case, the interpolation in section 135 would have to be adjusted accordingly to produce suitable frequency values.
  • the masking noises calculated in blocks are first transformed back into the time domain after interpolation and then brought back into the frequency domain in order to take into account the spatialization there - possibly with a different spectral resolution.
  • deflection signals with a short duration are used, which are adapted in terms of time and/or frequency to sections in the speech signal that are particularly relevant for intelligibility.
  • FIG 6 shows schematically an example of a block diagram for generating speech-signal-dependent distraction signals. The listener is distracted at signal-dependent defined times.
  • the critical points in time (t i,distract ) are determined using three information parameters in the speech signal: spectral centroid "SC" (roughly corresponds to the pitch), short-term energy "RMS” (roughly corresponds to the volume) and number of zero crossings "ZCR" (for differentiation speech signal / background noise).
  • Suitable distraction signals preferably have the following properties: On the one hand, they are natural signals that the listener is familiar with from other situations/from everyday life and are therefore not connected to the signal and context to be covered up. Furthermore, they are characterized by the fact that they represent acoustically distinctive signals of short duration and have the broadest possible spectrum. Further examples of such signals are water dripping noises or water waves or brief gusts of wind. Usually, the distraction signals are longer than the relevant speech sections (e.g. consonants) and completely cover them. It is also possible to store deflection signals of different lengths and select them to match the duration of the current critical point in time.
  • a deflection signal is selected and adjusted in time and frequency to the current segment of speech.
  • the adjusted deflection signal can then be played back to the listener from a virtual spatial position.
  • BRTF spatialization
  • short impulse responses 256 points
  • Multi-channel e.g. in stereo
  • playback is required for spatial playback.
  • the curves of the short-term energy RMS and the zero crossing rate ZCR can also be filtered using signal-dependent threshold values and areas that do not meet these threshold values can be hidden (e.g. set to zero).
  • the thresholds can be chosen such that a certain percentage of the signal values are above or below.
  • an onset detection function is first determined in section 235 .
  • the spectrally and temporally averaged spectra are added over the frequency axis.
  • the resulting signal is logarithmized and time differentiated, with negative values set to zero.
  • a regularization e.g. adding a small number to all frequency support points
  • This onset detection function is examined for local maxima, which must be at least a predetermined number of blocks apart.
  • the maxima found in this way can be further filtered by means of a signal-dependent threshold value, so that only particularly pronounced maxima remain.
  • Local maxima of the onset detection function determined in this way are candidates for perception-relevant sections of the speech signal that are to be selectively disturbed by means of a deflection signal.
  • the maxima of the onset detection function determined in this way are checked for plausibility in section 240 via a logic unit using the parameters: ZCR, RMS and SC. Only when these values are within a defined range are these maxima defined as relevant, critical points in time ti,distract . This can be done, for example, by the values of RMS, SC and/or ZCR having to meet certain logical conditions at the times of determined maxima of the onset detection function (e.g. RMS>X1;X2 ⁇ SC ⁇ X3;ZCR>X4 with X1 to X4 specified threshold values).
  • maxima are taken into account, for example, which lie in time segments that satisfy the above-mentioned filter conditions for RMS and ZCR (ie not lie in blanked-out areas).
  • the condition that ZCR and RMS must simultaneously meet certain threshold conditions can also be used to filter the course of SC by retaining the values of SC when the threshold conditions are met and interpolating or extrapolating values in between, resulting in the function SC int arises.
  • the parameters of this frequency transformation can be different and independent of the above embodiment for the speech signal to be masked.
  • the frequency representation of a deflection signal could also be stored directly in the frequency domain.
  • the resulting spectra can be adjusted in section 265 depending on the signal from sig est at the respective point in time t i,distract using the SC parameter ratios in the frequency position (eg by single-sideband modulation) and/or using the RMS parameter ratios in the amplification.
  • SC parameter ratios in the frequency position eg by single-sideband modulation
  • RMS parameter ratios in the amplification e.g., the ratio of the spectral focus SC of the respective speech signal section at an onset time t i,distract and the associated distraction signal is formed and the frequency position of the distraction signal is adjusted so that it matches that of the speech signal as closely as possible.
  • the energy (RMS) of the distraction signal is also adapted to the energy of the speech signal portion, so that a predetermined energy ratio for the distraction signal to the speech signal is achieved. Due to the high effectiveness in reducing speech intelligibility, the distraction signals can be reproduced at a low volume, so that the overall sound pressure level at the seating positions of the passengers listening in increases only minimally and the annoyance or impairment of the passengers is not increased and the local hearing comfort is preserved as best as possible remains.
  • the resulting modified spectra of the distraction signals are mapped in a spatially variable manner in section 270 by a binaural space transfer function (BRTF) by means of point-by-point multiplication of the frequency support points (or convolution in the time domain) of the corresponding spectra, depending on a random selection of direction per t i, distract point in time .
  • BRTF binaural space transfer function
  • a direction is randomly selected in section 275 for a deflection signal.
  • Binaural spatial transfer functions (BRTF) matching the possible directions are located in memory 280 .
  • spatialization can be performed in the frequency or time domain.
  • a convolution with the impulse response of a selected outer ear transfer function is carried out in the time domain.
  • the spatialization of the distraction signals is preferably carried out in such a way that the distraction signals from the listener are localized as close and present as possible to the head, so that they achieve a strong distraction effect.
  • a multi-channel (eg in stereo) reproduction is required for the spatial reproduction, otherwise a single-channel reproduction would be sufficient, which, however, preferably also takes place by means of two loudspeakers integrated in the neck support.
  • the convolution results are transformed back into the time domain by an inverse Fourier transform (IFFT) with NFFT 2 points.
  • IFFT inverse Fourier transform
  • the inverse-transformed time blocks are put together in section 290 with the aid of the overlap-add method in the correct order and value to form a time signal. If the previous steps have already been carried out in the time domain, the inverse transformation and the superimposition of the blocks are of course unnecessary.
  • the resulting timing signals are sent to the listener's respective active neckrest.
  • the masking signals can be summed with the deflection signals prior to being output via the neckrest speakers.
  • the speech signal-adapted deflection signal generates random spatially distributed stimulus/trigger information and improves the disguise of the speech target signal without significant permanent signal levels.
  • some of the signal processing can be performed in the frequency domain or in the time domain.
  • the specific values mentioned above are only examples of a possible configuration of the frequency transform and can be changed in many ways.
  • the energy and frequency-adjusted spectra are first transformed back into the time domain and then brought back into the frequency domain in order to take into account the spatialization there - possibly with a different spectral resolution.
  • Those skilled in the art of digital signal processing will recognize such variations of the inventive approach to generating speech signal dependent deflection signals after reading the present disclosure.
  • both concealment signals are summed before output and reproduced together.
  • the masking noise which is primarily perceived from the direction of the speaker, generates a broadband noise signal that is adapted to the spectral properties of the respective speech segment, on which short distraction signals (in terms of time and frequency) are superimposed at particularly relevant points. These distraction signals are perceived spatially close to the head and lead to a particularly effective reduction in speech intelligibility, even if they are played back at low volume or energy. Due to the combination with the broadband masking noise, however, the brief switching on and off of the deflection signals is perceived as less of a nuisance or impairment.
  • the overall sound pressure level at the seating positions of the listening passengers increases only minimally and the annoyance or impairment of the passengers is not increased or the local hearing comfort is maintained as best as possible.

Landscapes

  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Stereophonic System (AREA)
EP22201974.7A 2021-10-18 2022-10-17 Masquage audio de locuteurs Active EP4167229B1 (fr)

Priority Applications (6)

Application Number Priority Date Filing Date Title
US18/702,209 US20250239248A1 (en) 2021-10-18 2022-10-18 Audio masking of language
KR1020247014966A KR20240089343A (ko) 2021-10-18 2022-10-18 음성의 오디오 마스킹
JP2024524500A JP2024542967A (ja) 2021-10-18 2022-10-18 発話の音声マスキング
PCT/EP2022/078926 WO2023066908A1 (fr) 2021-10-18 2022-10-18 Masquage audio de la voix
CN202280070252.4A CN118140266A (zh) 2021-10-18 2022-10-18 语音的音频掩蔽
EP22803245.4A EP4420115A1 (fr) 2021-10-18 2022-10-18 Masquage audio de la voix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP21203247.8A EP4167228B1 (fr) 2021-10-18 2021-10-18 Masquage audio des haut-parleurs

Publications (3)

Publication Number Publication Date
EP4167229A1 true EP4167229A1 (fr) 2023-04-19
EP4167229C0 EP4167229C0 (fr) 2025-01-01
EP4167229B1 EP4167229B1 (fr) 2025-01-01

Family

ID=78500398

Family Applications (2)

Application Number Title Priority Date Filing Date
EP21203247.8A Active EP4167228B1 (fr) 2021-10-18 2021-10-18 Masquage audio des haut-parleurs
EP22201974.7A Active EP4167229B1 (fr) 2021-10-18 2022-10-17 Masquage audio de locuteurs

Family Applications Before (1)

Application Number Title Priority Date Filing Date
EP21203247.8A Active EP4167228B1 (fr) 2021-10-18 2021-10-18 Masquage audio des haut-parleurs

Country Status (2)

Country Link
EP (2) EP4167228B1 (fr)
ES (1) ES3013982T3 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69130687T2 (de) * 1990-05-28 1999-09-09 Matsushita Electric Industrial Co. Sprachsignalverarbeitungsvorrichtung zum Herausschneiden von einem Sprachsignal aus einem verrauschten Sprachsignal
US20120016665A1 (en) * 2007-03-22 2012-01-19 Yamaha Corporation Sound masking system and masking sound generation method
EP2877991A2 (fr) * 2012-07-24 2015-06-03 Koninklijke Philips N.V. Masquage de son directionnel
DE102014214052A1 (de) * 2014-07-18 2016-01-21 Bayerische Motoren Werke Aktiengesellschaft Virtuelle Verdeckungsmethoden

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69130687T2 (de) * 1990-05-28 1999-09-09 Matsushita Electric Industrial Co. Sprachsignalverarbeitungsvorrichtung zum Herausschneiden von einem Sprachsignal aus einem verrauschten Sprachsignal
US20120016665A1 (en) * 2007-03-22 2012-01-19 Yamaha Corporation Sound masking system and masking sound generation method
EP2877991A2 (fr) * 2012-07-24 2015-06-03 Koninklijke Philips N.V. Masquage de son directionnel
DE102014214052A1 (de) * 2014-07-18 2016-01-21 Bayerische Motoren Werke Aktiengesellschaft Virtuelle Verdeckungsmethoden

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ALVARSSON JESPER J ET AL: "Aircraft noise and speech intelligibility in an outdoor living space", THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, AMERICAN INSTITUTE OF PHYSICS, 2 HUNTINGTON QUADRANGLE, MELVILLE, NY 11747, vol. 135, no. 6, 6 June 2014 (2014-06-06), pages 3455 - 3462, XP012186211, ISSN: 0001-4966, [retrieved on 19010101], DOI: 10.1121/1.4874625 *

Also Published As

Publication number Publication date
ES3013982T3 (en) 2025-04-16
EP4167228A1 (fr) 2023-04-19
EP4167228B1 (fr) 2025-12-10
EP4167229C0 (fr) 2025-01-01
EP4167229B1 (fr) 2025-01-01

Similar Documents

Publication Publication Date Title
DE10308414B4 (de) Verfahren zur Steuerung eines Akustiksystems im Fahrzeug
DE102014214052A1 (de) Virtuelle Verdeckungsmethoden
DE102014210105A1 (de) Zonenbasierte Tonwiedergabe in einem Fahrzeug
EP3375204B1 (fr) Traitement de signal audio dans un véhicule
WO2023066908A1 (fr) Masquage audio de la voix
DE112018001454T5 (de) Vorrichtung und verfahren zur verbesserung der privatsphäre
DE112017004568B4 (de) Fahrzeuginternes Privatsphärensystem, Verfahren zum Maskieren von Sprache und Fahrzeug umfassend ein fahrzeuginternes Privatsphärensystem
DE102017203630B3 (de) Verfahren zur Frequenzverzerrung eines Audiosignals und nach diesem Verfahren arbeitende Hörvorrichtung
DE102014210760B4 (de) Betrieb einer Kommunikationsanlage
EP4167229B1 (fr) Masquage audio de locuteurs
DE102014214053A1 (de) Autogenerative Maskierungssignale
WO2020035198A1 (fr) Procédé et dispositif pour l'adaptation d'une sortie audio à l'utilisateur d'un véhicule
DE102015014916A1 (de) Verfahren zur Ausgabe von Audiosignalen
DE102013221127A1 (de) Betrieb einer Kommunikationsanlage in einem Kraftfahrzeug
CN118140266A (zh) 语音的音频掩蔽
DE102022202390A1 (de) Verfahren zum Betreiben eines Audiosystems in einem Fahrzeug und zugehörige Vorrichtung
DE102018207530A1 (de) Vorrichtung und Verfahren für Verbesserung der Privatsphäre
EP4460959B1 (fr) Procédé de commande d'un relais de communication entre un dispositif mains libres dans un véhicule automobile et un utilisateur, et dispositif mains libres
DE102016107799B3 (de) Verfahren zur Verarbeitung eines FM-Stereosignals
DE102014013524B4 (de) Kommunikationsanlage für Kraftfahrzeuge
DE10327053A1 (de) Audiosystem zum parallelen Hören unterschiedlicher Audioquellen
DE19823007A1 (de) Verfahren und Einrichtung zum Betrieb einer Telefonanlage, insbesondere in Kraftfahrzeugen
DE102022134954A1 (de) Verbesserungssystem für fahrzeugaudio
DE3737873A1 (de) Verfahren und vorrichtung zur verbesserung der sprachverstaendlichkeit bei kommunikationseinrichtungen
WO2007009505A1 (fr) Protection de la sphere privee lors d'emissions acoustiques par haut-parleurs

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20231019

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

INTG Intention to grant announced

Effective date: 20240717

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE PATENT HAS BEEN GRANTED

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

Free format text: NOT ENGLISH

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 502022002565

Country of ref document: DE

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

Free format text: LANGUAGE OF EP DOCUMENT: GERMAN

REG Reference to a national code

Ref country code: NL

Ref legal event code: FP

U01 Request for unitary effect filed

Effective date: 20250127

U07 Unitary effect registered

Designated state(s): AT BE BG DE DK EE FI FR IT LT LU LV MT NL PT RO SE SI

Effective date: 20250131

REG Reference to a national code

Ref country code: ES

Ref legal event code: FG2A

Ref document number: 3013982

Country of ref document: ES

Kind code of ref document: T3

Effective date: 20250416

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20250101

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20250501

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20250401

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20250101

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20250402

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20250101

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SM

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20250101

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20250101

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20251002

U20 Renewal fee for the european patent with unitary effect paid

Year of fee payment: 4

Effective date: 20251031