CN116564329A - Real-time conversation voiceprint noise reduction method, electronic equipment and storage medium - Google Patents
Real-time conversation voiceprint noise reduction method, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN116564329A CN116564329A CN202310462873.3A CN202310462873A CN116564329A CN 116564329 A CN116564329 A CN 116564329A CN 202310462873 A CN202310462873 A CN 202310462873A CN 116564329 A CN116564329 A CN 116564329A
- Authority
- CN
- China
- Prior art keywords
- audio
- noise reduction
- voiceprint
- feature vector
- voiceprint feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
Abstract
实时通话声纹降噪方法、电子设备和存储介质,其中,实时通话声纹降噪方法,包括:获取实时通话音频、当前说话人的第一声纹特征向量和当前说话人的注册音频;将所述注册音频输入至与预训练降噪网络同步训练的声纹特征提取网络中得到第二声纹特征向量;将所述实时通话音频输入至所述预训练降噪网络的编码部分得到第三声纹特征向量,其中,所述预训练降噪网络包括编码部分和其他部分;将所述第一声纹特征向量、所述第二声纹特征向量和所述第三声纹特征向量拼接后输入至所述预训练降噪网络的其他部分。从而使其输出的音频可以更好地保留说话人的音频,有效的抑制其他人声的干扰。
A real-time call voiceprint noise reduction method, electronic equipment, and storage medium, wherein the real-time call voiceprint noise reduction method includes: obtaining real-time call audio, the first voiceprint feature vector of the current speaker, and the registration audio of the current speaker; The registration audio is input to the voiceprint feature extraction network that is trained synchronously with the pre-training noise reduction network to obtain the second voiceprint feature vector; the real-time call audio is input to the encoding part of the pre-training noise reduction network to obtain the third A voiceprint feature vector, wherein the pre-trained noise reduction network includes an encoding part and other parts; after splicing the first voiceprint feature vector, the second voiceprint feature vector and the third voiceprint feature vector Input to other parts of the pretrained denoising network. In this way, the output audio can better retain the audio of the speaker, and effectively suppress the interference of other human voices.
Description
技术领域technical field
本申请实施例涉及语音处理技术领域,特别是涉及一种实时通话声纹降噪方法及电子设备和存储介质。The embodiments of the present application relate to the technical field of speech processing, and in particular, to a real-time call voiceprint noise reduction method, electronic equipment, and a storage medium.
背景技术Background technique
声纹通话降噪顾名思义即是在通话降噪的技术上加上声纹信息,即使身处比较嘈杂的环境,或者是有多个干扰人讲话的复杂场景下都可以提取清晰的主讲人声音信息,过滤掉其他说话人的语音和背景噪声。该技术在实际生产生活中有非常广泛的应用范围。Voiceprint call noise reduction, as the name suggests, is to add voiceprint information to the call noise reduction technology, even in a noisy environment, or in a complex scene where there are multiple interfering people talking, you can extract clear speaker voice information , to filter out the speech of other speakers and background noise. This technology has a very wide range of applications in actual production and life.
现有技术中,声纹降噪方案包括注册阶段和测试阶段。In the prior art, the voiceprint noise reduction solution includes a registration phase and a testing phase.
其中,注册阶段:主讲人先在安静场景根据用户界面(UI,User Interface)上的提示注册一段20-30s的音频,用来提取主讲人的声纹信息。由于在真实使用的过程中用户可能说话不太清晰,环境背景噪声比较大,还有注册说话人语速过快,录制的时间过短等都会影响主讲人的信息收集,进而就会影响后面算法的正常使用。为此,在注册阶段对音频的质量是否合格做了限制。具体的,语音的质量:采用语音活动检测(Voice ActivityDetection,VAD)对注册音频进行检测,根据音频和背景算出注册音频的信噪比;字准:注册音频进行语音识别检测后将识别的文字和正确的文本进行校验;最后就是vad后的音频长度必须要满足一定的长度。达到这些条件后,才能正确注册声纹信息。Among them, the registration stage: the speaker first registers a 20-30s audio in a quiet scene according to the prompt on the user interface (UI, User Interface), which is used to extract the voiceprint information of the speaker. In the process of real use, the user may not speak clearly, the environmental background noise is relatively large, and the registered speaker speaks too fast, and the recording time is too short, etc., which will affect the speaker's information collection, which will affect the subsequent algorithm. normal use. For this reason, there are restrictions on the quality of the audio during the registration phase. Specifically, voice quality: Use voice activity detection (Voice Activity Detection, VAD) to detect the registration audio, and calculate the signal-to-noise ratio of the registration audio according to the audio and background; The correct text is verified; finally, the audio length after vad must meet a certain length. Only when these conditions are met can the voiceprint information be registered correctly.
测试阶段:由于目前市面上大多数PC产品的收音往往不止一个麦克风,多个麦克风组成的麦克风阵列能够更好地对语音信号进行个性化增强。声纹降噪一般要求在笔记本正前方的主讲人进行增强,其他方向的语音信号都进行抑制,同方向的其他干扰他人的语音也要进行抑制,从而会用到以下技术。回声消除、麦克风阵列技术、声纹降噪技术和自动增益控制技术。Testing phase: Since most PC products currently on the market often have more than one microphone for sound collection, a microphone array composed of multiple microphones can better personalize voice signals. Voiceprint noise reduction generally requires the speaker in front of the notebook to be enhanced, the voice signals in other directions should be suppressed, and other voices in the same direction that interfere with others should also be suppressed, so the following technologies will be used. Echo cancellation, microphone array technology, voiceprint noise reduction technology and automatic gain control technology.
目前可以拿来体验的设备除了笔记本和手机外其他的产品种类较少且性能不好,其主要缺陷有对注册说话人的声音抑制,主讲人声音忽大忽小甚至出现丢字现象,主要原因是模型把主讲人的声音和干扰人混淆,都进行了抑制;以及对干扰人声抑制不完全,会出现音频的残留,甚至不能消掉干扰人声等。但是根据我们离线处理后的音频效果,发现我们的算法可以很好地保留注册说话人的音频,有效地抑制其他说话人的干扰。At present, the equipment that can be used to experience is less in variety except notebooks and mobile phones, and the performance is not good. The main defect is the suppression of the voice of the registered speaker. It is the model that confuses the voice of the speaker with the interfering person, and suppresses them; and the suppression of the interfering human voice is not complete, there will be audio residues, and even the interfering human voice cannot be eliminated. However, according to the audio effect after our offline processing, we found that our algorithm can well preserve the audio of the registered speaker and effectively suppress the interference of other speakers.
声纹降噪和其他降噪同样也会面临主讲人消不干净或主讲人过度消除的问题,但是真实使用的过程中我们会尽可能保证主讲人的声音能够尽可能的保留,即使干扰人会存在一些残留。Voiceprint noise reduction and other noise reduction will also face the problem of unclean or excessive elimination of the speaker, but in the process of actual use, we will try our best to ensure that the voice of the speaker can be preserved as much as possible, even if the interfering person will Some residue is present.
发明内容Contents of the invention
本发明实施例提供了一种实时通话声纹降噪方法以及装置,用于至少解决上述技术问题之一。Embodiments of the present invention provide a real-time call voiceprint noise reduction method and device, which are used to solve at least one of the above technical problems.
第一方面,本发明实施例提供了实时通话声纹降噪方法,包括:获取实时通话音频、当前说话人的第一声纹特征向量和当前说话人的注册音频;将所述注册音频输入至与预训练降噪网络同步训练的声纹特征提取网络中得到第二声纹特征向量;将所述实时通话音频输入至所述预训练降噪网络的编码部分得到第三声纹特征向量,其中,所述预训练降噪网络包括编码部分和其他部分;将所述第一声纹特征向量、所述第二声纹特征向量和所述第三声纹特征向量拼接后输入至所述预训练降噪网络的其他部分。In the first aspect, an embodiment of the present invention provides a real-time call voiceprint noise reduction method, including: acquiring real-time call audio, the first voiceprint feature vector of the current speaker, and the registration audio of the current speaker; inputting the registration audio to The second voiceprint feature vector is obtained in the voiceprint feature extraction network synchronously trained with the pre-training noise reduction network; the real-time call audio is input to the encoding part of the pre-training noise reduction network to obtain a third voiceprint feature vector, wherein , the pre-trained noise reduction network includes an encoding part and other parts; the first voiceprint feature vector, the second voiceprint feature vector and the third voiceprint feature vector are concatenated and input to the pre-training Other parts of the denoising network.
第二方面,本发明实施例提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明上述任一项实时通话声纹降噪方法。In a second aspect, an embodiment of the present invention provides an electronic device, which includes: at least one processor, and a memory connected to the at least one processor in communication, wherein the memory stores information that can be used by the at least one processor Executable instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute any one of the above-mentioned real-time call voiceprint noise reduction methods of the present invention.
第三方面,本发明实施例提供一种存储介质,所述存储介质中存储有一个或多个包括执行指令的程序,所述执行指令能够被电子设备(包括但不限于计算机,服务器,或者网络设备等)读取并执行,以用于执行本发明上述任一项实时通话声纹降噪方法。In the third aspect, the embodiment of the present invention provides a storage medium, and one or more programs including execution instructions are stored in the storage medium, and the execution instructions can be read by electronic devices (including but not limited to computers, servers, or network equipment, etc.) to read and execute any one of the above-mentioned real-time call voiceprint noise reduction methods of the present invention.
第四方面,本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述任一项实时通话声纹降噪方法。In a fourth aspect, an embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the The computer executes any one of the above real-time call voiceprint noise reduction methods.
本申请的方法通过获取实时通话音频、当前说话人第一声纹特征向量与当前说话人注册音频,然后将注册音频输入至他同步训练的声纹特征提取网络获取第二声纹特征向量,再将实时通话音频输入至预训练降噪网络的编码器部分获取第三音频特征向量,最后将第一声纹特征向量、二声纹特征向量和第三音频特征向量进行拼接在输出至预训练降噪网络的其他部分处理,从而使其输出的音频可以更好地保留说话人的音频,有效的抑制其他人声的干扰。The method of the present application obtains the real-time call audio, the first voiceprint feature vector of the current speaker and the registration audio of the current speaker, and then inputs the registration audio to his synchronously trained voiceprint feature extraction network to obtain the second voiceprint feature vector, and then The real-time call audio is input to the encoder part of the pre-training noise reduction network to obtain the third audio feature vector, and finally the first voiceprint feature vector, the second voiceprint feature vector and the third audio feature vector are spliced and output to the pre-training denoising network. Other parts of the noise network are processed, so that the output audio can better retain the speaker's audio, and effectively suppress the interference of other human voices.
附图说明Description of drawings
图1为本发明一实施例提供的一种实时通话声纹降噪方法的流程图;Fig. 1 is a flow chart of a real-time call voiceprint noise reduction method provided by an embodiment of the present invention;
图2为本发明一实施例提供的另一种实时通话声纹降噪方法的流程图;Fig. 2 is a flow chart of another real-time call voiceprint noise reduction method provided by an embodiment of the present invention;
图3为本发明一实施提供的现有技术的一个具体示例的说话人过度消除示意图;Fig. 3 is a schematic diagram of overspeaker elimination of a specific example of the prior art provided by an implementation of the present invention;
图4为本发明一实施提供的现有技术的一个具体示例的干扰人未完全消除示意图;Fig. 4 is a schematic diagram of a specific example of the prior art provided by an implementation of the present invention, where the disturber is not completely eliminated;
图5为本发明一实施提供的一种实时通话声纹降噪方法的一个具体示例的实时通话声纹降噪网络框架图;Fig. 5 is a real-time call voiceprint noise reduction network framework diagram of a specific example of a real-time call voiceprint noise reduction method provided by an implementation of the present invention;
图6为本发明一实施提供的一种实时通话声纹降噪方法的一个具体示例的实时通话声纹降噪注册阶段流程框架图;6 is a flow chart of the real-time call voiceprint noise reduction registration stage of a specific example of a real-time call voiceprint noise reduction method provided by an implementation of the present invention;
图7为本发明一实施提供的一种实时通话声纹降噪方法的一个具体示例的实时通话声纹降噪测试阶段流程框架图;Fig. 7 is a flow frame diagram of a real-time call voiceprint noise reduction test stage flow chart of a specific example of a real-time call voiceprint noise reduction method provided by an implementation of the present invention;
图8本发明一实施例提供的电子设备的结构示意图。FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.
请参考图1,其示出了本发明一实施例提供的一种实时通话声纹降噪方法的流程图。Please refer to FIG. 1 , which shows a flow chart of a real-time call voiceprint noise reduction method provided by an embodiment of the present invention.
如图1所示,在步骤101中,获取实时通话音频、当前说话人的第一声纹特征向量和当前说话人的注册音频;As shown in Figure 1, in step 101, obtain real-time call audio, the first voiceprint feature vector of the current speaker and the registration audio of the current speaker;
在步骤102中,将所述注册音频输入至与预训练降噪网络同步训练的声纹特征提取网络中得到第二声纹特征向量;In step 102, input the registration audio into the voiceprint feature extraction network that is trained synchronously with the pre-trained noise reduction network to obtain a second voiceprint feature vector;
在步骤103中,将所述实时通话音频输入至所述预训练降噪网络的编码部分得到第三声纹特征向量,其中,所述预训练降噪网络包括编码部分和其他部分;In step 103, the real-time call audio is input to the encoding part of the pre-trained noise reduction network to obtain a third voiceprint feature vector, wherein the pre-trained noise reduction network includes an encoding part and other parts;
在步骤104中,将所述第一声纹特征向量、所述第二声纹特征向量和所述第三声纹特征向量拼接后输入至所述预训练降噪网络的其他部分。In step 104, the first voiceprint feature vector, the second voiceprint feature vector and the third voiceprint feature vector are concatenated and input to other parts of the pre-trained denoising network.
在本实施例中,对于步骤101,获取实时通话音频、当前说话人的第一声纹特征向量和当前说话人的注册音频,例如,在通话时通过设备获取实时的通话音频,以及获取提前准备好的第一声纹特征向量与注册音频,并且,所获取的第一声纹特征向量与注册音频都属于同一说话人。In this embodiment, for step 101, the real-time call audio, the first voiceprint feature vector of the current speaker and the registration audio of the current speaker are obtained, for example, the real-time call audio is obtained through the device during the call, and the preparation in advance A good first voiceprint feature vector and registration audio, and the obtained first voiceprint feature vector and registration audio belong to the same speaker.
然后,对于步骤102,将所述注册音频输入至与预训练降噪网络同步训练的声纹特征提取网络中得到第二声纹特征向量,例如,通过预训练降噪网络同步训练的声纹特征提取网络提取注册音频的声纹信息从而获取第二声纹特征向量。Then, for step 102, the registration audio is input into the voiceprint feature extraction network that is trained synchronously with the pre-trained denoising network to obtain a second voiceprint feature vector, for example, the voiceprint features that are synchronously trained through the pre-trained denoising network The extraction network extracts the voiceprint information of the registration audio to obtain the second voiceprint feature vector.
然后,对于步骤103,将所述实时通话音频输入至所述预训练降噪网络的编码部分得到第三声纹特征向量,其中,所述预训练降噪网络包括编码部分和其他部分,例如,将设备获取到的实时通话音频输入至预训练降噪网络通过编码器获取到第三声纹特征向量,并且,预训练降噪网络包含了编码部分与其他部分。Then, for step 103, the real-time call audio is input to the encoding part of the pre-trained noise reduction network to obtain a third voiceprint feature vector, wherein the pre-trained noise reduction network includes an encoding part and other parts, for example, The real-time call audio obtained by the device is input to the pre-trained noise reduction network to obtain the third voiceprint feature vector through the encoder, and the pre-training noise reduction network includes the encoding part and other parts.
最后,对于步骤104,将所述第一声纹特征向量、所述第二声纹特征向量和所述第三声纹特征向量拼接后输入至所述预训练降噪网络的其他部分,例如,将第一第二第三声纹特征向量拼接,然后将拼接后的声纹特征向量输入预训练降噪网络的其他部分进行处理。Finally, for step 104, the first voiceprint feature vector, the second voiceprint feature vector and the third voiceprint feature vector are concatenated and input to other parts of the pre-trained denoising network, for example, The first, second, and third voiceprint feature vectors are spliced, and then the spliced voiceprint feature vectors are input to other parts of the pre-trained denoising network for processing.
本实施例的方法是获取实时通话音频、当前说话人第一声纹特征向量与当前说话人注册音频,然后将注册音频输入至他同步训练的声纹特征提取网络获取第二声纹特征向量,再将实时通话音频输入至预训练降噪网络的编码器部分获取第三音频特征向量,最后将第一声纹特征向量、二声纹特征向量和第三音频特征向量进行拼接在输出至预训练降噪网络的其他部分处理,从而使其输出的音频可以更好地保留说话人的音频,有效的抑制其他人声的干扰。The method of this embodiment is to obtain the real-time call audio, the first voiceprint feature vector of the current speaker and the registration audio of the current speaker, and then input the registration audio to his synchronously trained voiceprint feature extraction network to obtain the second voiceprint feature vector, Then input the real-time call audio to the encoder part of the pre-training noise reduction network to obtain the third audio feature vector, and finally splice the first voiceprint feature vector, the second voiceprint feature vector and the third audio feature vector and output it to the pre-training Other parts of the noise reduction network are processed so that the output audio can better retain the speaker's audio and effectively suppress the interference of other human voices.
在一些可选的实施例中,所述当前说话人的第一声纹特征向量通过,将当前说话人的注册音频输入至固定的说话人特征提取器进行特征提取操作从而获取第一声纹特征向量,从而提前获取到第一声纹特征向量方便后续操作。In some optional embodiments, the first voiceprint feature vector of the current speaker is obtained by inputting the registration audio of the current speaker to a fixed speaker feature extractor for feature extraction operation, thereby obtaining the first voiceprint feature Vector, so that the first voiceprint feature vector is obtained in advance to facilitate subsequent operations.
进一步请参考图2,其示出了本发明一实施例的另一种实时通话声纹降噪的流程图,该流程图主要是对流程图1“所述预训练降噪网络与所述声纹特征提取网络的同步训练”训练步骤的进一步限定的步骤图。Further please refer to FIG. 2 , which shows another flow chart of real-time call voiceprint noise reduction according to an embodiment of the present invention. A further defined step-by-step diagram of the "synchronous training of texture feature extraction network" training step.
如图2所示,在步骤201中,将带噪音频输入至所述预训练降噪网络的编码部分得到编码后的结果,其中,所述带噪音频具有对应的干净音频和与所述带噪音频属于同一说话人的注册音频;As shown in Figure 2, in step 201, input the noisy audio to the encoding part of the pre-trained denoising network to obtain an encoded result, wherein the noisy audio has a corresponding clean audio and the The noise audio belongs to the registration audio of the same speaker;
在步骤202中,至少将所述注册音频经过所述声纹特征提取网络得到的声纹提取结果与所述编码后的结果进行拼接得到拼接后的结果;In step 202, at least splicing the voiceprint extraction result obtained by passing the registration audio through the voiceprint feature extraction network with the coded result to obtain a spliced result;
在步骤203中,将所述拼接后的结果继续输入至所述预训练降噪网络的其他部分进行处理得到所述预训练降噪网络的输出;In step 203, continue to input the spliced result to other parts of the pre-trained denoising network for processing to obtain the output of the pre-trained denoising network;
在步骤204中,计算所述预训练降噪网络的输出和所述干净音频的损失,基于所述损失训练所述预训练降噪网络和所述声纹提取模型。In step 204, the output of the pre-trained denoising network and the loss of the clean audio are calculated, and the pre-trained denoising network and the voiceprint extraction model are trained based on the losses.
在本实施例中,对于步骤201,将带噪音频输入至所述预训练降噪网络的编码部分得到编码后的结果,其中,所述带噪音频具有对应的干净音频和与所述带噪音频属于同一说话人的注册音频,例如,在训练阶段将以及提前准备好的带噪音频输入进预训练降噪网络通过其编码器进行处理从而获取编码后的结果,并且带噪音频具有与其对应的干净音频以及与带噪音频属于同一说话人的注册音频。In this embodiment, for step 201, the noisy audio is input to the encoding part of the pre-trained denoising network to obtain an encoded result, wherein the noisy audio has a corresponding clean audio and is the same as the noisy audio The registered audio frequency belongs to the same speaker. For example, in the training phase, the pre-prepared noisy audio input is input into the pre-trained denoising network for processing through its encoder to obtain the encoded result, and the noisy audio has its corresponding The clean audio of , and the registration audio belonging to the same speaker as the noisy audio.
然后,对于步骤202,至少将所述注册音频经过所述声纹特征提取网络得到的声纹提取结果与所述编码后的结果进行拼接得到拼接后的结果,例如,注册音频至少需要通过声纹网络获取到声纹提取的结果,在将其声纹提取的结果与编码后的结果拼接从而得到拼接后的结果。Then, for step 202, at least the voiceprint extraction result obtained by passing the registration audio through the voiceprint feature extraction network is spliced with the encoded result to obtain the spliced result. For example, the registration audio needs to pass through the voiceprint at least The network obtains the result of voiceprint extraction, and splices the result of voiceprint extraction with the encoded result to obtain the spliced result.
然后,对于步骤203,将所述拼接后的结果继续输入至所述预训练降噪网络的其他部分进行处理得到所述预训练降噪网络的输出,例如,在将声纹提取的结果与编码后的结果拼接后输入至预训练降噪网络的其他部分处理后输出得到其输出的音频。Then, for step 203, continue to input the spliced result to other parts of the pre-training noise reduction network for processing to obtain the output of the pre-training noise reduction network, for example, after combining the voiceprint extraction result with the encoding The final results are spliced and input to other parts of the pre-trained noise reduction network for processing and output to obtain the output audio.
最后,对于步骤204,计算所述预训练降噪网络的输出和所述干净音频的损失,基于所述损失训练所述预训练降噪网络和所述声纹提取模型,例如,将预训练降噪网络输出的音频与干净音频进行比对计算,获取到处理后音频的损失,通过音频的损失训练预训练降噪网络和声纹提取模型。Finally, for step 204, calculate the output of the pre-training noise reduction network and the loss of the clean audio, and train the pre-training noise reduction network and the voiceprint extraction model based on the loss, for example, reduce the pre-training The audio output by the noise network is compared with the clean audio to obtain the loss of the processed audio, and the pre-trained noise reduction network and voiceprint extraction model are trained through the audio loss.
本实施例的方法将有对应的干净音频与同一说话人的注册音频的带噪音频输入至预训练声纹特征提取网络的编码器部分获取编码后的结果,然后将注册音频输入至声纹提取器获取提取后的结果,获取声纹提取后的结果后将其与编码后的结果拼接后输入至预训练降噪网络的其他部分处理后输出得到其输出的音频,最后将预训练降噪网络输出的音频与干净音频进行比对计算,获取到处理后音频的损失,通过音频的损失训练预训练降噪网络和声纹提取模型。The method of this embodiment inputs the corresponding clean audio and the noisy audio of the registration audio of the same speaker to the encoder part of the pre-trained voiceprint feature extraction network to obtain the encoded result, and then inputs the registration audio to the voiceprint extraction The device obtains the extracted result, obtains the voiceprint extracted result, splices it with the encoded result, and inputs it to other parts of the pre-trained noise reduction network to process and output the output audio. Finally, the pre-trained noise reduction network The output audio is compared and calculated with the clean audio, the loss of the processed audio is obtained, and the pre-trained noise reduction network and voiceprint extraction model are trained through the audio loss.
在一些可以的实施例中,可以先将注册音频输入至固定的说话人特征向量提取器获得固定的提取结果,然后将注册音频输入声纹特征提取网络得到的声纹提取结果,最后将固定的提取结果、声纹特征提取网络提取结果以及编码后的结果进行拼接得到拼接后的结果,从而使其拼接的音频更加准确。In some possible embodiments, the registration audio can be input into a fixed speaker feature vector extractor to obtain a fixed extraction result, then the registration audio can be input into the voiceprint extraction result obtained by the voiceprint feature extraction network, and finally the fixed The extraction results, the voiceprint feature extraction network extraction results, and the encoded results are spliced to obtain the spliced result, so that the spliced audio is more accurate.
在一些可选的实施例中,预训练降噪网络的输出和干净音频的损失是通过尺度不变信噪比(sisnr loss,scale-invariant source-to-noise ratio)损失函数来计算,之所以使用尺度不变信噪比损失函数计算是因为在模型的训练阶段没有添加说话人相关的损失函数,这样可以使该预训练降噪网络只具备消除干扰人声的功能,然后通过尺度不变信噪比将低维声纹信息充分利用,从而达到消除无关声纹信息并且只保留注册说话人声纹信息的的功能。In some optional embodiments, the output of the pre-trained denoising network and the loss of clean audio are calculated by a scale-invariant signal-to-noise ratio (sisnr loss, scale-invariant source-to-noise ratio) loss function, the reason The reason for using the scale-invariant signal-to-noise ratio loss function calculation is that no speaker-related loss function is added in the training stage of the model, so that the pre-trained noise reduction network can only have the function of eliminating interfering human voices, and then through the scale-invariant signal-to-noise ratio loss function. The noise ratio makes full use of the low-dimensional voiceprint information, thereby achieving the function of eliminating irrelevant voiceprint information and retaining only the voiceprint information of the registered speaker.
在一些可选的实施例中,带噪音频是由当前说话人的干净音频与至少一个干扰人音频混合而得到带噪音频,可以直接混合。也可以先将干净音频叠加不同的房间冲击响应混合,并且后续还可以为说话人和干扰人添加不同的信噪比噪音,从而使带噪音频可以更好的模拟真实的通话场景。In some optional embodiments, the noisy audio is obtained by mixing the clean audio of the current speaker with at least one interfering audio, which can be directly mixed. It is also possible to superimpose the clean audio with different room impulse responses first, and then add different signal-to-noise ratio noises for the speaker and the interferer, so that the noisy audio can better simulate the real call scene.
在一些可选的实施例中,将实时通话音频输入至预训练降噪网络的编码部分之前,需要对实时通话音频进行回声消除(AEC,Acoustic Echo Cancellation)和波束成形(BF,Beam Forming)处理,处理后的音频输入至预训练降噪网络处理经过处理的音频在使用自动增益控制(AGC,Automatic Gain Control)进行自动增益,从而使最后输出音频听感更加的舒适平稳。In some optional embodiments, before the real-time call audio is input to the encoding part of the pre-trained noise reduction network, it is necessary to perform echo cancellation (AEC, Acoustic Echo Cancellation) and beamforming (BF, Beam Forming) processing on the real-time call audio , the processed audio is input to the pre-trained noise reduction network for processing, and the processed audio is automatically gained using AGC (Automatic Gain Control), so that the final output audio is more comfortable and stable.
在一些可选的实施例中,第一声纹特征向量、第二声纹特征向量和第三声纹特征向量拼接后经过预训练降噪网络的其他部分处理,从而可以将处理完成的音频发送给实时通话的远端人,也可以输入至语音识别引擎。In some optional embodiments, the first voiceprint feature vector, the second voiceprint feature vector and the third voiceprint feature vector are concatenated and processed by other parts of the pre-trained noise reduction network, so that the processed audio can be sent to For the remote person in the real-time conversation, it can also be input into the speech recognition engine.
请参考图3,其示出了本发明一实施例提供的现有技术的一个具体示例的说话人过度消除示意图。Please refer to FIG. 3 , which shows a schematic diagram of overspeaker elimination in a specific example of the prior art provided by an embodiment of the present invention.
如图3所示,Mix音频为带噪音频,Ref音频为与带噪音频对应的干净音频,降噪后的音频可以看出注册说话人的带噪音频经过降噪网络处理后出现过度消除的音频,这是由于降噪网络将注册说话人的声音与干扰人的声音混淆所造成。As shown in Figure 3, the Mix audio is the noisy audio, and the Ref audio is the clean audio corresponding to the noisy audio. From the noise-reduced audio, it can be seen that the noisy audio of the registered speaker is over-eliminated after being processed by the noise reduction network. Audio, which is caused by the noise reduction network confusing the registered speaker's voice with that of the interferer.
请参考图4,其示出了本发明一实施提供的现有技术的一个具体示例的干扰人未完全消除示意图。Please refer to FIG. 4 , which shows a schematic diagram of an incomplete elimination of disturbers according to a specific example of the prior art provided by an implementation of the present invention.
如图4所示,Mix音频为带噪音频,Ref音频为与带噪音频对应的干净音频,降噪后的音频可以看出注册说话人的带噪音频经过降噪网络处理后出现对干扰人的音频抑制不完全导致噪音残留。As shown in Figure 4, the Mix audio is the noisy audio, and the Ref audio is the clean audio corresponding to the noisy audio. From the noise-reduced audio, it can be seen that the noisy audio of the registered speaker is processed by the noise reduction network and appears to interfere with the interfering person. Incomplete audio suppression leads to residual noise.
发明人在实现本申请的过程中尝试过以下技术方案:其中一种方案使用固定说话人向量,在模型的训练过程中不对该模块更新,这种方案的结果是其他人的人声消除的不够彻底。另一种方案是,说话人模型和语音增强模型联合训练,该方法在模型训练的过程中较为复杂,并且未能考虑在真实场景使用中实时性的问题。The inventor has tried the following technical solutions in the process of implementing this application: one of the solutions uses a fixed speaker vector, and the module is not updated during the training process of the model. The result of this solution is that the vocals of other people are not eliminated enough thorough. Another solution is to jointly train the speaker model and the speech enhancement model. This method is more complicated in the process of model training, and fails to consider the real-time problem in real-world scenarios.
本申请的技术方案主要从以下几个方面入手进行设计和优化:The technical scheme of this application is mainly designed and optimized from the following aspects:
本申请方案的声纹提取包含两个部分,一部提取文本无关声纹识别的嵌入特征(embedding)模块,该模块是模型训练好的,在声纹降噪模型训练的过程中不需要重新训练。The voiceprint extraction of the application scheme includes two parts, one extracts the embedding feature (embedding) module for text-independent voiceprint recognition, this module is trained by the model, and does not need to be retrained during the training process of the voiceprint denoising model .
另一部分的声纹特征提取是随着声纹降噪模型训练网络权重变化的,这样能够做到尽可能多的提取主讲人的信息从而保证在真实使用的过程主讲人不丢字。The other part of the voiceprint feature extraction changes with the weight of the voiceprint noise reduction model training network, so that as much information as possible can be extracted from the speaker to ensure that the speaker does not lose words in the actual use process.
在数据准备阶段,先准备好每个说话人的注册音频,长度为30s,每个说话人的测试音频拼接成一条长音频。在模型训练阶段会实时读取数据,每次选择一个主讲人10s的安静音频,干扰人每次可以选择0-3个干扰人。并且为主讲人和干扰人加上不同信噪比的噪声。为了更好的模拟真实场景,我们可以将干净音频叠加不同房间的冲击响应。In the data preparation stage, the registration audio of each speaker is prepared first, with a length of 30s, and the test audio of each speaker is spliced into a long audio. During the model training phase, the data will be read in real time, and each time a speaker will be selected for 10 seconds of quiet audio, and the disturber can choose 0-3 disturbers each time. And add noise with different signal-to-noise ratios for the speaker and the interferer. In order to better simulate the real scene, we can superimpose the clean audio with the impact response of different rooms.
请参考图5,其示出了本发明一实施提供的一种实时通话声纹降噪方法的一个具体示例的实时通话声纹降噪网络框架图。Please refer to FIG. 5 , which shows a real-time call voiceprint noise reduction network framework diagram of a specific example of a real-time call voiceprint noise reduction method provided by an implementation of the present invention.
如图5所示,我们将语音增强和说话人特征向量提取模型相结合,步骤1:提取emb,其中,emb指的是前面提到的嵌入特征。将准备好的数据,注册音频、干净音频、混合音频均转换成频域,我们这里采用的是汉宁窗、帧长512帧移256,fft的长度为512;As shown in Figure 5, we combine the speech enhancement and speaker feature vector extraction model, step 1: extract emb, where emb refers to the embedding feature mentioned earlier. Convert the prepared data, registered audio, clean audio, and mixed audio into the frequency domain. Here we use the Hanning window, the frame length is 512, the frame shift is 256, and the length of fft is 512;
步骤2:将带噪音频按照图中右边的降噪网络的流程进行操作,计算网络的输出与干净音频的sisnr loss;Step 2: Operate the noisy audio according to the process of the noise reduction network on the right in the figure, and calculate the sisnr loss between the output of the network and the clean audio;
步骤3:将注册音频经过特征提取后,输入网络然后在时间维度做进一步处理,然后和噪声音频做相同的操作,并且注册音频经过网络的输出拼接在带噪音频输入网络之后,其中,I-Feature指的是将语音从频域转换到时域。Step 3: After feature extraction, the registered audio is input into the network and then further processed in the time dimension, and then the same operation is performed with the noise audio, and the output of the registered audio through the network is spliced after the noise audio input network, where, I- Feature refers to converting speech from the frequency domain to the time domain.
请参考图6,其示出了本发明一实施提供的一种实时通话声纹降噪方法的一个具体示例的实时通话声纹降噪注册阶段流程框架图。Please refer to FIG. 6 , which shows a flow chart of the real-time call voiceprint noise reduction registration stage of a specific example of a real-time call voiceprint noise reduction method provided by an implementation of the present invention.
如图6所示,步骤1:用户提前注册20-30s左右的注册音频;步骤2对注册音频进行质检;步骤3:提取该说话人的声纹特征1;步骤4:提取该说话人的声纹特征2.其中,声纹特征1由固定的说话人特征提取器提取的固定特征,声纹特征2是由预训练降噪网络同步训练的声纹特征提取网络进行提取。As shown in Figure 6, step 1: the user registers about 20-30s of registration audio in advance; step 2 conducts quality inspection on the registration audio; step 3: extracts the speaker’s voiceprint feature 1; step 4: extracts the speaker’s Voiceprint feature 2. Wherein, voiceprint feature 1 is a fixed feature extracted by a fixed speaker feature extractor, and voiceprint feature 2 is extracted by a voiceprint feature extraction network synchronously trained by a pre-trained noise reduction network.
请参考图7,其示出了本发明一实施提供的一种实时通话声纹降噪方法的一个具体示例的实时通话声纹降噪测试阶段流程框架图。Please refer to FIG. 7 , which shows a flow chart of a real-time call voiceprint noise reduction test phase flow chart of a specific example of a real-time call voiceprint noise reduction method provided by an implementation of the present invention.
如图7所示,步骤1:用户选择人声分离模式。步骤2:麦克风录制混合音频;步骤3:获取混合人声提取其声纹特征;步骤4:使用回声消除(AEC,Acoustic Echo Cancellation)处理音频;步骤5:使用波束成形(BF,Beam Forming)处理音频;步骤6:将音频输入声纹降噪网络处理;As shown in Figure 7, step 1: the user selects the human voice separation mode. Step 2: The microphone records the mixed audio; Step 3: Obtains the mixed human voice to extract its voiceprint features; Step 4: Processes the audio using AEC (Acoustic Echo Cancellation); Step 5: Processes it using Beam Forming (BF, Beam Forming) Audio; Step 6: Input the audio into the voiceprint noise reduction network for processing;
步骤7:处理后的音频输入自动增益控制(AGC,Automatic Gain Control)对音频进行增益然后输出音频。Step 7: The processed audio is input to the automatic gain control (AGC, Automatic Gain Control) to gain the audio and then output the audio.
在另一些实施例中,本发明实施例还提供了一种非易失性计算机存储介质,计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的用于销售电话的线索标签识别方法;In some other embodiments, the embodiments of the present invention also provide a non-volatile computer storage medium, the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute any of the above method embodiments for Lead tag identification method for sales calls;
作为一种实施方式,本发明的非易失性计算机存储介质存储有计算机可执行指令,计算机可执行指令设置为:As an implementation mode, the non-volatile computer storage medium of the present invention stores computer-executable instructions, and the computer-executable instructions are set to:
获取实时通话音频、当前说话人的第一声纹特征向量和当前说话人的注册音频;Obtain real-time call audio, the first voiceprint feature vector of the current speaker and the registration audio of the current speaker;
将所述注册音频输入至与预训练降噪网络同步训练的声纹特征提取网络中得到第二声纹特征向量;Inputting the registration audio into a voiceprint feature extraction network synchronously trained with a pre-trained noise reduction network to obtain a second voiceprint feature vector;
将所述实时通话音频输入至所述预训练降噪网络的编码部分得到第三声纹特征向量,其中,所述预训练降噪网络包括编码部分和其他部分;Inputting the real-time call audio to the encoding part of the pre-trained noise reduction network to obtain a third voiceprint feature vector, wherein the pre-trained noise reduction network includes an encoding part and other parts;
将所述第一声纹特征向量、所述第二声纹特征向量和所述第三声纹特征向量拼接后输入至所述预训练降噪网络的其他部分。The first voiceprint feature vector, the second voiceprint feature vector and the third voiceprint feature vector are concatenated and input to other parts of the pre-trained denoising network.
非易失性计算机可读存储介质可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据用于销售电话的线索标签识别装置的使用所创建的数据等。此外,非易失性计算机可读存储介质可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至用于销售电话的线索标签识别装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile computer-readable storage medium can include a storage program area and a storage data area, wherein the storage program area can store an operating system, an application program required by at least one function; Data created by the use of tag identification devices, etc. In addition, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some embodiments, the non-transitory computer-readable storage medium optionally includes memory located remotely from the processor, and the remote memory may be connected to the lead tag identification device for sales calls via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
本发明实施例还提供一种计算机程序产品,计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序,计算机程序包括程序指令,当程序指令被计算机执行时,使计算机执行上述任一项用于实时通话声纹降噪方法。An embodiment of the present invention also provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by the computer, the computer executes the above-mentioned Any one is used for the noise reduction method of real-time call voiceprint.
图8是本发明实施例提供的电子设备的结构示意图,如图8所示,该设备包括:一个或多个处理器810以及存储器820,图8中以一个处理器810为例。用于销售电话的线索标签识别方法的设备还可以包括:输入装置830和输出装置840。处理器810、存储器820、输入装置830和输出装置840可以通过总线或者其他方式连接,图8中以通过总线连接为例。存储器820为上述的非易失性计算机可读存储介质。处理器810通过运行存储在存储器820中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例用于销售电话的线索标签识别方法。输入装置830可接收输入的数字或字符信息,以及产生与实施例用于销售电话的线索标签识别装置的用户设置以及功能控制有关的键信号输入。输出装置840可包括显示屏等显示设备。FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 8 , the device includes: one or more processors 810 and memory 820 , and one processor 810 is taken as an example in FIG. 8 . The device for the lead tag identification method for sales calls may further include: an input device 830 and an output device 840 . The processor 810, the memory 820, the input device 830, and the output device 840 may be connected through a bus or in other ways. In FIG. 8, connection through a bus is taken as an example. The memory 820 is the above-mentioned non-volatile computer-readable storage medium. The processor 810 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 820 , that is, realizes the lead label identification method for sales calls in the above method embodiment. The input device 830 can receive input numbers or character information, and generate key signal input related to user settings and function control of the lead label recognition device for sales calls of the embodiment. The output device 840 may include a display device such as a display screen.
上述产品可执行本发明实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本发明实施例所提供的方法。The above-mentioned products can execute the methods provided by the embodiments of the present invention, and have corresponding functional modules and beneficial effects for executing the methods. For technical details not described in detail in this embodiment, refer to the method provided in the embodiment of the present invention.
作为一种实施方式,上述电子设备应用于用于销售电话的线索标签识别装置中,包括:至少一个处理器;以及,与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够:As an implementation, the above-mentioned electronic device is applied to a lead label identification device for sales calls, and includes: at least one processor; and a memory connected to at least one processor; Processor-executable instructions, the instructions being executed by at least one processor to enable at least one processor to:
获取实时通话音频、当前说话人的第一声纹特征向量和当前说话人的注册音频;Obtain real-time call audio, the first voiceprint feature vector of the current speaker and the registration audio of the current speaker;
将所述注册音频输入至与预训练降噪网络同步训练的声纹特征提取网络中得到第二声纹特征向量;Inputting the registration audio into a voiceprint feature extraction network synchronously trained with a pre-trained noise reduction network to obtain a second voiceprint feature vector;
将所述实时通话音频输入至所述预训练降噪网络的编码部分得到第三声纹特征向量,其中,所述预训练降噪网络包括编码部分和其他部分;Inputting the real-time call audio to the encoding part of the pre-trained noise reduction network to obtain a third voiceprint feature vector, wherein the pre-trained noise reduction network includes an encoding part and other parts;
将所述第一声纹特征向量、所述第二声纹特征向量和所述第三声纹特征向量拼接后输入至所述预训练降噪网络的其他部分。The first voiceprint feature vector, the second voiceprint feature vector and the third voiceprint feature vector are concatenated and input to other parts of the pre-trained denoising network.
本申请实施例的电子设备以多种形式存在,包括但不限于:Electronic devices in the embodiments of this application exist in various forms, including but not limited to:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones (such as iPhone), multimedia phones, feature phones, and low-end phones.
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, with computing and processing functions, and generally also has the characteristics of mobile Internet access. Such terminals include: PDA, MID and UMPC equipment, such as iPad.
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players (such as iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The composition of a server includes a processor, hard disk, memory, system bus, etc. The server is similar to a general-purpose computer architecture, but due to the need to provide high-reliability services, it is important in terms of processing power and stability. , Reliability, security, scalability, manageability and other aspects have high requirements.
(5)其他具有数据交互功能的电子装置。(5) Other electronic devices with data interaction function.
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place , or can also be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic Discs, optical discs, etc., include several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods of various embodiments or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310462873.3A CN116564329A (en) | 2023-04-26 | 2023-04-26 | Real-time conversation voiceprint noise reduction method, electronic equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310462873.3A CN116564329A (en) | 2023-04-26 | 2023-04-26 | Real-time conversation voiceprint noise reduction method, electronic equipment and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116564329A true CN116564329A (en) | 2023-08-08 |
Family
ID=87495754
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310462873.3A Pending CN116564329A (en) | 2023-04-26 | 2023-04-26 | Real-time conversation voiceprint noise reduction method, electronic equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116564329A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114255782A (en) * | 2021-12-21 | 2022-03-29 | 思必驰科技股份有限公司 | Speaker voice enhancement method, electronic device and storage medium |
-
2023
- 2023-04-26 CN CN202310462873.3A patent/CN116564329A/en active Pending
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114255782A (en) * | 2021-12-21 | 2022-03-29 | 思必驰科技股份有限公司 | Speaker voice enhancement method, electronic device and storage medium |
| CN114255782B (en) * | 2021-12-21 | 2024-08-23 | 思必驰科技股份有限公司 | Speaker voice enhancement method, electronic device, and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11894014B2 (en) | Audio-visual speech separation | |
| CN103391347B (en) | A kind of method and device of automatic recording | |
| CN106486131B (en) | Method and device for voice denoising | |
| CN107910014B (en) | Echo cancellation test method, device and test equipment | |
| EP4394761A1 (en) | Audio signal processing method and apparatus, electronic device, and storage medium | |
| CN113763977A (en) | Method, apparatus, computing device and storage medium for eliminating echo signal | |
| CN114255782B (en) | Speaker voice enhancement method, electronic device, and storage medium | |
| US9390725B2 (en) | Systems and methods for noise reduction using speech recognition and speech synthesis | |
| CN110956957A (en) | Training method and system of speech enhancement model | |
| US20140278417A1 (en) | Speaker-identification-assisted speech processing systems and methods | |
| CN104427068B (en) | A kind of audio communication method and device | |
| CN115668366A (en) | A method and system for acoustic echo cancellation | |
| CN111710344A (en) | A signal processing method, apparatus, device and computer-readable storage medium | |
| JP2024532748A (en) | Combined acoustic echo cancellation, speech enhancement, and voice separation for automatic speech recognition. | |
| WO2025031102A9 (en) | Method and apparatus for training speech enhancement network, and storage medium, device and product | |
| CN118899005B (en) | Audio signal processing method, device, computer equipment and storage medium | |
| CN118985025A (en) | General automatic speech recognition for joint acoustic echo cancellation, speech enhancement and speech separation | |
| CN113299306B (en) | Echo cancellation method, apparatus, electronic device, and computer-readable storage medium | |
| CN112331187B (en) | Multi-task speech recognition model training method and multi-task speech recognition method | |
| CN114220451A (en) | Audio noise canceling method, electronic device and storage medium | |
| CN116564329A (en) | Real-time conversation voiceprint noise reduction method, electronic equipment and storage medium | |
| WO2024017110A1 (en) | Voice noise reduction method, model training method, apparatus, device, medium, and product | |
| CN114758668B (en) | Speech enhancement model training method and speech enhancement method | |
| CN119889310B (en) | Methods, systems, and electronic devices for generating real-time audio based on dialogue content | |
| CN118301518A (en) | Voiceprint noise reduction method, electronic device and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |