Disclosure of Invention
In order to solve the technical problem that in the prior art, in the environment of multi-sound-source sounding, the input of noise except for the sound of a target object cannot be reduced or eliminated, the invention provides a method and equipment for recording sound.
According to a first aspect of the present invention, there is provided a method of recording sound, the method comprising:
determining the direction of the target object;
adjusting the direction of the recorded sound to the determined direction;
and recording the sound signal from the determined direction in real time, and denoising the recorded sound signal.
Preferably, the denoising process includes at least one of: echo cancellation processing, beamforming processing, noise suppression processing, and dereverberation processing.
As an embodiment, determining the direction of the target object comprises the following steps:
s1: acquiring an image of an area of a current recording environment, wherein the current recording environment comprises a plurality of areas;
s2: comparing the image of the target object with the acquired image of the area, and outputting a comparison result, wherein the image of the target object comprises a face image of the target object;
s3: judging whether the image of the region contains the image of the target object according to the comparison result,
when the comparison result indicates that the image of the region contains the image of the target object, executing step S4, otherwise, updating the acquired region and returning to step S1;
s4: and determining the direction of the target object under the world coordinate system according to the image of the area and a camera calibration algorithm.
As another embodiment, determining the direction of the target object includes:
when the target object emits sound, determining the position of the target object in a world coordinate system based on the time difference and the intensity difference of the sound emitted by the target object reaching two sound recording devices;
and determining the direction of the target object in the world coordinate system based on the position of the target object in the world coordinate system.
Preferably, adjusting the direction of recording the sound to the determined direction includes:
adjusting the recording direction of the recording equipment according to the direction of the target object in the world coordinate system so that the adjusted recording direction of the recording equipment is consistent with the direction of the target object in the world coordinate system,
wherein the recording device includes a directional microphone.
Preferably, the echo cancellation process includes:
converting the recorded sound signal to a frequency domain by a conventional fast fourier transform algorithm or a modulated complex lapped transform algorithm;
echo portions of the sound signal in the frequency domain are filtered out by a plurality of adaptive acoustic echo cancellation filters,
the beamforming process comprises:
recording the sound signals through a recording device array consisting of a plurality of recording devices;
performing phase delay compensation on the sound signals recorded by each recording device in the recording device array;
and performing aliasing processing on the sound signals recorded by the plurality of sound recording devices after the phase delay compensation, so that the amplitude of the sound signal in the direction of the target object after the aliasing processing is increased.
Preferably, the noise suppression processing includes:
determining a frequency band in which a noise signal in the sound signal is located by using spectral subtraction;
filtering the noise signal of the frequency band in which the noise signal is located,
the dereverberation process includes:
estimating a frequency spectrum of a reverberations part in the sound signal using a delayed frequency spectrum of the sound signal and a parameter indicative of the decay of the reverberations part over time,
filtering out the reverberations part of the spectrum of the reverberations part with a filter.
Preferably, the method further comprises:
inputting or pre-storing a historical sound signal of a target object;
comparing the historical sound signal of the target object with the sound signal after denoising by utilizing voiceprint recognition to obtain the frequency band of the sound signal of the target object in the sound signal,
and filtering out the sound signals of the frequency bands except the frequency band of the sound of the target object in the sound signals subjected to the denoising processing by a filter.
Preferably, the method is applied to an apparatus for recording audio or video.
According to a second aspect of the present invention, there is provided an apparatus for recording sound, comprising:
a sound capture device;
a processor; and
the memory is stored with executable codes, and the executable codes can realize determining the direction of a target object when executed by the processor, sending instructions to the sound capture device for controlling the sound capture device to adjust the direction of the recorded sound to the determined direction and recording the sound signal from the determined direction in real time, and also realizing the denoising processing of the recorded sound signal.
Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:
the method and the device for recording the sound are applicable to a noisy recording environment with multiple sound sources, and the recording direction is adjusted, so that the recording device only collects or records the sound in the direction of the target object and does not record the sound in other directions except the direction of the target object, most of the noise is effectively eliminated during recording, and the recorded noise such as echo is also filtered through denoising processing, and the recording quality is ensured.
Further, in order to enable the recorded sound to be more accurate, the embodiment of the invention performs fine screening on the sound signal subjected to denoising processing through voiceprint recognition, and achieves the effect of completely filtering out other sound signals except the sound signal of the target object.
Further, the embodiment of the invention can also process the recorded audio or video containing the sound signals of a plurality of generating sources (target objects) and other noises to obtain the sound signal of each single generating source after the noises are removed. Meanwhile, the embodiment of the invention can mark the sound signal of each single generating source, distinguish the target object to which the sound signal of each single generating source belongs, and is favorable for later-stage quick editing.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Detailed Description
The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.
The invention provides a method and equipment for recording sound, aiming at solving the technical problem that in the prior art, in the environment of multi-sound-source sounding, the noise except the sound of a target object cannot be reduced or eliminated.
Fig. 1 is a flowchart of a method of recording sound according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step S11: determining the direction of the target object;
step S12: adjusting the direction of the recorded sound to the determined direction;
step S13: and recording the sound signal from the determined direction in real time, and denoising the recorded sound signal.
First, it should be noted that the direction in which the target object is located refers to a direction of the target object relative to a center point of the sound recording apparatus or the sound recording apparatuses, for example, the direction in which the target object is located is 45 degrees to the left relative to the center point of the sound recording apparatus or the sound recording apparatuses.
As an embodiment, in step S11, for example, a face image of the target object may be input or stored in advance, and the stored face image of the target object may be compared with an image of the recording environment acquired by the image acquisition device through image recognition, so as to find the target object.
Considering that the range of the current recording environment may be large in practice, and the image acquisition device may only acquire images of a part of the area of the current recording environment at a time, the embodiment of the present invention preferably determines the direction of the target object in the world coordinate system by using a loop iteration method.
Specifically, step S11 includes the steps of:
s1: acquiring an image of an area of a current recording environment, wherein the current recording environment comprises a plurality of areas;
s2: comparing the image of the target object with the acquired image of the area, and outputting a comparison result, wherein the image of the target object comprises a face image of the target object;
s3: judging whether the image of the region contains the image of the target object according to the comparison result,
when the comparison result indicates that the image of the region contains the image of the target object, executing step S4, otherwise, updating the acquired region and returning to step S1;
s4: and determining the direction of the target object under the world coordinate system according to the image of the area and a camera calibration algorithm.
Taking a camera with a recording apparatus as an example, in step S1, an image of only one area of the current recording environment is captured at a time using the camera of the camera. In step S2, the stored face image of the target object is compared with the image of the one region (the region) of the current sound recording environment acquired in step S1, for example, by an image recognition technique, specifically, for example, by a processor of a camera, and the comparison result is output. In step S3, it is determined whether the image of the region includes the image of the target object according to the comparison result, that is, it is determined whether the image of the region includes the target object, when the comparison result indicates that the image of the region includes the image of the target object, step S4 is executed, otherwise, the collected region is updated and the step S1 is returned until the target object is found. In step S4, based on the image of the region containing the target object, the embodiment of the present invention preferably determines the direction of the target object by using a camera calibration algorithm. Specifically, for example, the direction of the target object in the world coordinate system is determined by converting the image coordinate system and the world coordinate system.
It should be noted that, for example, the current recording environment includes three areas, for example, the current recording environment includes a first area, a second area, and a third area. Updating the acquired region in step S3 refers to updating to the second region or the third region, for example, when the image of the first region has no target object.
As another implementation manner, in step S11, the embodiment of the present invention preferably determines the direction of the target object in the world coordinate system according to the time difference and the intensity difference between the sounds emitted by the target object and reaching the two sound recording devices.
Specifically, the recording apparatus may employ two or more microphones having a focusing function. When a target object sounds, the position of the target object (specified sound source) is determined using the binaural effect. The binaural effect is a spatial localization technique, for example, for a human, the human ears are symmetrically distributed on two sides of the head, and the auricle and the head of the human play an effective role in masking the sound. When the time and frequency intensity distribution of direct sound and reflected sound of a sound source sent to ears are different, the time and intensity difference of the same sound source sent to the ears is obvious, and the phenomenon results in that the position of the sound source, namely the 'binaural effect', can be clearly and accurately judged.
The embodiment of the invention preferably determines the position of the target object under the world coordinate system through a binaural effect algorithm. After the position of the target object is determined, the direction of the target object relative to the central points of the two sound recording devices in the world coordinate system can be determined according to the positions of the central points of the two sound recording devices and the position of the target object.
Returning to fig. 1, in step S12, the recording apparatus is controlled to adjust the recording direction to coincide with the direction of the target object in the world coordinate system.
Specifically, the sound recording apparatus includes, for example, directional microphones including, for example, a cardioid microphone and an ultracardioid microphone. In step S12, the recording angle or direction of the directional microphone is adjusted to be consistent with the direction of the target object according to the direction of the target object in the world coordinate system. Taking a camera with a directional microphone as an example, for example, a processor sends a rotation instruction to a rotation mechanism of the camera to control the rotation mechanism to rotate, so that the recording direction of the directional microphone on the camera body is consistent with the direction of the target object. Therefore, during recording, the directional microphone can only collect or record the sound in the direction of the target object, and does not record the sound in other directions except the direction of the target object, so that most of noise is effectively eliminated in recording.
In step S13, the sound signal from the determined direction is recorded in real time, and the recorded sound signal is subjected to a denoising process.
In order to filter out noise in the recorded sound, the embodiment of the present invention preferably performs a denoising process on the recorded sound using one or more of an echo cancellation process, a beamforming process, a noise suppression process, and a dereverberation process in step S13.
The echo cancellation process, the beamforming process, the noise suppression process, and the dereverberation process employed in the embodiment of the present invention will be described one by one below.
1) Echo cancellation processing
In the process of recording sound by the microphone, the sound captured by the microphone includes sound directly emitted by a sound source (target object) and echoes of the sound emitted by the sound source and/or the speaker after one or more reflections. Such as noisy conference rooms or lounges and hands-free telephones in automobiles, many high noise environments require effective echo cancellation.
In the embodiment of the present invention, the recorded sound signal is converted into the frequency domain by, for example, a conventional fast fourier transform FFT algorithm or a modulated complex lapped transform MCLT algorithm. The sound signal in the frequency domain is then processed by a plurality of adaptive acoustic echo cancellation filters to cancel the echo in the frequency domain.
Specifically, for the recorded sound signal, it is converted into a frequency domain sound signal by a conventional fast fourier transform algorithm or a modulated complex lapped transform algorithm. For each frequency in the frequency domain sound signal, a plurality of acoustic echo cancellation filters, e.g., N filters, are computed, each using different parameters of a different adaptation technique. For each frequency in the frequency domain sound signal, a linear combination of the outputs of the N filters is calculated. The linear combination of the N filter outputs for each frequency is then combined for all frequencies, converting it back to the time domain.
In an embodiment of the present invention, for the N acoustic echo cancellation filters, the momentum normalized least mean square MNLMS algorithm and the normalized least mean square NLMS algorithm are preferably used to provide the adaptation. The momentum normalization least mean square algorithm has a smooth characteristic and is suitable for a static environment, such as an environment in which nothing in a room moves excessively. The NLMS algorithm is applied in a dynamic environment, for example, in an environment where people often move.
2) Beamforming processing
Since the sound emitted from the generation source (target object) gradually weakens as the distance increases, the sound may already be weak when it reaches the microphone, resulting in a less than ideal recording effect. In addition, even if the recording direction of the microphone is adjusted to the direction in which the target object is located, it is impossible to completely eliminate interference of noise in other directions or other areas with the sound emitted from the target object.
Therefore, in order to overcome the loss of the sound propagation path and reduce interference with noise in directions other than the intended direction, the present invention preferably employs beamforming in recording the sound.
Specifically, for example, a microphone array composed of a plurality of microphones is used, and the gain of the relative phase and amplitude of the sound signal received by each microphone is concentrated in one direction (i.e., the direction in which the target object is located) based on the principle of mutual interference of waves. For each microphone, a specific phase delay is added to compensate for the phase of the sound signal received by that microphone. After phase compensation, the effective signals in the sound signals received by each microphone can be aligned in phase, so that the effective signals received by different microphones become large in amplitude after being added. On the other hand, when the interference signals propagating in other directions reach the microphone array, the delay corresponding to each microphone does not coincide with the time difference of arrival of the signals at the microphones, and thus the amplitude does not become large after the summation. In this way, during the recording process, the microphone array can increase the strength of the effective signal in the direction of the target object by matching a plurality of microphones with a specific delay, and simultaneously weaken the interference signals in other directions, thereby effectively blocking the interference signals from other directions except the intended direction.
3) Noise suppression processing
Noise suppression, as the name implies, is the removal of noise components from an audio signal. Specifically, the frequency band of the noise is determined, then the noise in the noise frequency band is filtered out, and the effective signal is reserved. In the embodiment of the present invention, it is preferable to determine the frequency band where the noise exists by using spectral subtraction and filter the noise in the noise frequency band.
4) Dereverberation processing
As described above, during the recording of sound by a microphone, the sound signal captured by the microphone may include reverberation or echoes from different surfaces. For example, in a room, sound signals (such as speech or music) are reflected by walls, ceiling and floor. Thus, the sound signal captured by the microphone is an acoustic signal that is a combination of the desired signal (received directly from the sound source) and the interfering signal (reflected via the reflecting surface). This interfering signal is referred to as the reverberations part of the sound signal.
In the embodiment of the invention, the process of dereverberation is as follows: the method comprises the steps of first estimating a frequency spectrum of a reverberations part of the received signal using a delayed frequency spectrum of the received signal and a parameter indicating the decay of the reverberations part over time, then filtering out the frequency spectrum of the reverberations part with a filter while estimating a frequency spectrum of an effective part using the frequency spectrum of the reverberations part by spectral subtraction, and reconstructing the effective signal from the estimated frequency spectrum of the effective part.
In an embodiment of the invention, the parameter indicative of the decay of the reverberation part over time is preferably:
where a denotes a parameter indicating the decay of the reverberations part over time, e is a mathematical constant (about 2.718), fs is the sampling frequency, 1n is the natural logarithm (31n10 about 6.9), T60 is the reverberation time, i.e. the length of time after which the signal level has dropped by 60 db relative to the original signal level, and k denotes the number of samples contained in each frame, e.g. splitting a recorded sound signal into n frames, each frame being divided into k samples at a preset frequency segment.
Through the steps S11 to S13, the embodiment of the invention can weaken and substantially filter other noises except for the sound of the target object, and only retain and amplify the sound of the target object in the noisy environment.
Further, in order to make the recorded sound more accurate, the embodiment of the present invention preferably performs fine screening on the sound signal subjected to the denoising processing by using voiceprint recognition, so as to completely filter out other sound signals except the sound signal of the target object.
For example, a historical sound signal of a target object is input or stored in advance, and then the stored historical sound signal of the target object is compared with a sound signal subjected to denoising processing by utilizing voiceprint recognition, for example, the comparison is performed based on one or more of the frequency, the loudness, the tone and the tone of the sound of the target object, so as to determine the frequency band where the sound signal of the target object in the sound signal subjected to denoising processing is located, and then the sound signal of a non-target object in other frequency bands except the frequency band where the sound signal of the target object is located is filtered by a filter.
Preferably, a prototype spectral model of the target object is established in advance, and then the denoised acoustic signal is subjected to contrast matching by using the prototype spectral model, for example, including spectral contrast and spectral analysis, and then the acoustic signal not in the prototype spectral model is filtered out according to the result of the contrast and analysis.
Therefore, the voice signal of the target object can be more accurately and clearly identified through voiceprint identification, and noise or other noise except the voice signal of the target object is removed.
As a more preferable implementation manner, the embodiment of the present invention may further process the recorded audio or video including the sound signals of multiple generating sources (target objects) and other noises, and obtain the sound signal of each single generating source after removing the noises.
Specifically, for example, the frequency band of the sound signal of each single generation source in the audio or video is determined by voiceprint recognition, and the sound signal of the frequency band of the sound signal of each single generation source is extracted, so as to obtain the sound signal of each single generation source. Subsequently, the sound signal of each single generation source is subjected to a denoising process using one or more of an echo cancellation process, a beamforming process, a noise suppression process, and a dereverberation process. Of course, the audio or video may be subjected to denoising processing first, and then the acoustic signal of each single generation source is extracted by utilizing voiceprint recognition, or the acoustic signal may not be subjected to denoising processing, and is flexibly selected according to the actual situation, which is not limited in the present invention.
Preferably, the extracted sound signal of each single generation source can be marked to distinguish which target sound is, so as to be beneficial to later-stage quick editing.
Fig. 2 schematically shows an apparatus for recording sound according to an embodiment of the present invention. As shown in fig. 2, the apparatus includes:
a sound capture device 201;
a processor 202; and
a memory 203, on which executable code is stored, which when executed by the processor enables determining the direction in which the target object is located, and issuing instructions to the sound capturing device 201 to control the sound capturing device to adjust the direction of the recorded sound to the determined direction and record the sound signal from the determined direction in real time, and also enables denoising the recorded sound signal.
The sound capturing device 201 includes a microphone, specifically, for example, a directional microphone, and the directional microphone includes, for example, a cardioid microphone and a hypercardioid microphone.
The device further includes a filter for receiving and executing instructions from the processor 202 to filter signals in a specified frequency band (e.g., a frequency band in which noise is present).
For detailed details of the operation of the sound capturing device 201, the processor 202 and the memory 203, reference is made to the above description of the method of the present invention with reference to fig. 1, and detailed description thereof is omitted here.
In summary, embodiments of the present invention provide a method and an apparatus for recording sound, which are applicable to a noisy recording environment with multiple sound sources, and enable a recording apparatus to only acquire or record sound in a direction in which a target object is located by adjusting a recording direction, but not record sound in other directions except the direction in which the target object is located, so as to ensure that most of noise is effectively removed during recording, and simultaneously enable recorded noise, such as echo, to be filtered by denoising processing, thereby ensuring the quality of recording.
Further, in order to enable the recorded sound to be more accurate, the embodiment of the invention performs fine screening on the sound signal subjected to denoising processing through voiceprint recognition, and achieves the effect of completely filtering out other sound signals except the sound signal of the target object.
Further, the embodiment of the invention can also process the recorded audio or video containing the sound signals of a plurality of generating sources (target objects) and other noises to obtain the sound signal of each single generating source after the noises are removed. Meanwhile, the embodiment of the invention can mark the sound signal of each single generating source, distinguish the target object to which the sound signal of each single generating source belongs, and is favorable for later-stage quick editing.
Those skilled in the art will appreciate that the modules or steps of the invention described above can be implemented in a general purpose computing device, centralized on a single computing device or distributed across a network of computing devices, and optionally implemented in program code that is executable by a computing device, such that the modules or steps are stored in a memory device and executed by a computing device, fabricated separately into integrated circuit modules, or fabricated as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.