[go: up one dir, main page]

CN111310836B - A defense method and defense device for an integrated model of voiceprint recognition based on a spectrogram - Google Patents

A defense method and defense device for an integrated model of voiceprint recognition based on a spectrogram Download PDF

Info

Publication number
CN111310836B
CN111310836B CN202010105807.7A CN202010105807A CN111310836B CN 111310836 B CN111310836 B CN 111310836B CN 202010105807 A CN202010105807 A CN 202010105807A CN 111310836 B CN111310836 B CN 111310836B
Authority
CN
China
Prior art keywords
voiceprint recognition
spectrogram
sample
model
integrated model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010105807.7A
Other languages
Chinese (zh)
Other versions
CN111310836A (en
Inventor
陈晋音
叶林辉
王雪柯
郑喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202010105807.7A priority Critical patent/CN111310836B/en
Publication of CN111310836A publication Critical patent/CN111310836A/en
Application granted granted Critical
Publication of CN111310836B publication Critical patent/CN111310836B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于声谱图的声纹识别集成模型的防御方法,包括:(1)采集音频文件,并对音频文件转化为声谱图,该声谱图作为良性样本;(2)利用良性样本训练多个声纹识别模型,获得训练好的多个声纹识别模型;(3)采用投票机制从训练好的多个声纹识别模型从筛选获得较优的多个声纹识别模型进行集成,形成声纹识别集成模型,利用良性样本重新训练声纹识别集成模型;(4)采集布谷鸟搜索算法分别攻击多个声纹识别模型,生成对抗样本;(5)利用对抗样本和良性样本对步骤(3)获得的声纹识别集成模型进行再训练,获得能够抵抗攻击的声纹识别集成模型;(6)利用步骤(5)获得的声纹识别集成模型对音频文件对应的声谱图进行防御识别。

The invention discloses a defense method for an integrated model of voiceprint recognition based on a spectrogram, comprising: (1) collecting an audio file, and converting the audio file into a spectrogram, and the spectrogram is used as a benign sample; (2) Use benign samples to train multiple voiceprint recognition models to obtain multiple trained voiceprint recognition models; (3) use a voting mechanism to obtain better multiple voiceprint recognition models from multiple trained voiceprint recognition models Carry out integration to form an integrated voiceprint recognition model, and use benign samples to retrain the integrated voiceprint recognition model; (4) collect cuckoo search algorithms to attack multiple voiceprint recognition models respectively to generate adversarial samples; (5) use adversarial samples and benign The sample retrains the integrated voiceprint recognition model obtained in step (3) to obtain an integrated voiceprint recognition model that can resist attacks; Figure for defensive identification.

Description

一种基于声谱图的声纹识别集成模型的防御方法及防御装置A defense method and defense device for an integrated model of voiceprint recognition based on a spectrogram

技术领域technical field

本发明属于信息安全研究领域,具体涉及一种基于声谱图的声纹识别集成模型的防御方法及防御装置。The invention belongs to the field of information security research, and in particular relates to a defense method and a defense device of an integrated model of voiceprint recognition based on a spectrogram.

背景技术Background technique

由于每个人的发声器官—舌,牙齿,肺等在尺寸和形态上存在很大差异,因此每个人说话的声音都不同,其声谱图都存在差异,实际上就是每个人的声音都带有独特的身份信息,声纹识别就是利用了声音的这一特性来识别说话人的身份。声纹识别是生物识别技术的一种,分为文本相关和文本无关的声纹识别。文本无关的声纹识别:指声纹识别系统对于语音文本内容是没有任何要求,说话人的说话内容比较自由随意。文本相关的声纹识别:指说话人识别系统,要求用户必须按照事先指定的内容进行发音。文本相关声纹识别模型要求用户按照规定的文本发音,一旦用户的发音有误就会造成身份无法识别的情况,应用面较窄。文本无关的声纹识别模型对用户的发声内容没有要求,识别方便,其应用面较为广泛,但实现难度较高。Since everyone's vocal organs—tongue, teeth, lungs, etc., are very different in size and shape, everyone's voice is different, and their spectrograms are different. In fact, everyone's voice has a Unique identity information, voiceprint recognition is to use this characteristic of voice to identify the identity of the speaker. Voiceprint recognition is a kind of biometric technology, which is divided into text-related and text-independent voiceprint recognition. Text-independent voiceprint recognition: The voiceprint recognition system does not have any requirements for the voice text content, and the speaker's speech content is relatively free. Text-related voiceprint recognition: Refers to a speaker recognition system that requires users to pronounce according to pre-specified content. The text-related voiceprint recognition model requires the user to pronounce according to the specified text. Once the user's pronunciation is wrong, the identity cannot be recognized, and the application is narrow. The text-independent voiceprint recognition model has no requirements on the content of the user's voice, and is easy to recognize. It has a wide range of applications, but it is difficult to implement.

深度神经网络可以充分利用语音特征之间的关联性,将连续帧的语音特征合并后进行训练,使声纹识别系统的识别率大幅度提高。基于深度神经网络的声纹识别系统在提高识别准确率为人们带来便利的同时,也带来了相应的风险。深度神经网络容易受到对输入数据添加细微扰动形式的对抗攻击,攻击者在获得某一目标说话人的特征后,可以给某个说话人音频添加精心计算的扰动,使得生成的对抗样本被声纹识别模型错误的识别为目标说话人,这给声纹识别系统以及个人的财产安全带来了极大的安全隐患。The deep neural network can make full use of the correlation between speech features, combine the speech features of consecutive frames and train them, so that the recognition rate of the voiceprint recognition system can be greatly improved. While the voiceprint recognition system based on deep neural network improves the recognition accuracy and brings convenience to people, it also brings corresponding risks. Deep neural networks are vulnerable to adversarial attacks in the form of adding subtle perturbations to the input data. After obtaining the characteristics of a target speaker, the attacker can add carefully calculated perturbations to the audio of a speaker, so that the generated adversarial samples are voiceprinted. The recognition model wrongly identifies the target speaker, which brings great security risks to the voiceprint recognition system and personal property safety.

已有的声纹识别攻击方法主要分为白盒和黑盒攻击。白盒攻击是攻击者在已知模型内部参数的情况下进行的,通过反向传播计算模型关于噪声的梯度,通过迭代不断优化所要添加的噪声,以达到生成对抗样本的目的。黑盒攻击是攻击者在未知模型参数的情况下进行的,可以利用遗传算法、粒子群算法等优化算法优化所需要添加的扰动,从而生成对抗样本。白盒攻击和黑盒攻击都可以对声纹识别系统进行攻击,使声纹识别系统错误的将对抗样本识别为目标说话人。Existing voiceprint recognition attack methods are mainly divided into white-box and black-box attacks. The white-box attack is carried out by the attacker when the internal parameters of the model are known. The gradient of the model with respect to noise is calculated through backpropagation, and the noise to be added is continuously optimized through iteration to achieve the purpose of generating adversarial samples. The black-box attack is carried out by the attacker without knowing the model parameters. Optimization algorithms such as genetic algorithm and particle swarm optimization algorithm can be used to optimize the disturbances that need to be added to generate adversarial samples. Both white-box and black-box attacks can attack the voiceprint recognition system, making the voiceprint recognition system mistakenly identify the adversarial sample as the target speaker.

发明内容Contents of the invention

针对目前声纹识别系统存在精度不高,鲁棒性差,容易受到对抗样本攻击的安全性问题,本发明提供了一种基于声谱图的声纹识别集成模型的防御方法及防御装置,该防御方法及防御装置可以提高声纹识别的精度及鲁棒性,并抵御对抗样本的攻击,提高了声纹势识别的安全性。Aiming at the security problems of low accuracy, poor robustness, and vulnerability to adversarial sample attacks in the current voiceprint recognition system, the present invention provides a defense method and defense device based on a voiceprint recognition integrated model based on a spectrogram. The method and defense device can improve the accuracy and robustness of voiceprint recognition, resist the attack of adversarial samples, and improve the security of voiceprint recognition.

本发明的技术方案为:Technical scheme of the present invention is:

一种基于声谱图的声纹识别集成模型的防御方法,包括以下步骤:A defense method for an integrated model of voiceprint recognition based on a spectrogram, comprising the following steps:

(1)采集音频文件,并将音频文件转化为声谱图,该声谱图作为良性样本;(1) collect the audio file, and convert the audio file into a spectrogram, which is used as a benign sample;

(2)利用良性样本训练多个图像识别模型,使图像识别模型达到声纹识别的效果,从而获得训练好的多个基于图像的声纹识别模型;(2) Using benign samples to train a plurality of image recognition models, so that the image recognition models can achieve the effect of voiceprint recognition, thereby obtaining a plurality of trained image-based voiceprint recognition models;

(3)采用投票机制将步骤(2)中训练好的多个基于图像的声纹识别模型进行集成,形成声纹识别集成模型,利用良性样本重新训练声纹识别集成模型;(3) Using a voting mechanism to integrate a plurality of image-based voiceprint recognition models trained in step (2) to form a voiceprint recognition integration model, and to retrain the voiceprint recognition integration model with benign samples;

(4)采集布谷鸟搜索算法分别攻击多个声纹识别模型,生成对抗样本,并将对抗样本转化为声谱图,作为恶性样本;(4) Collect the cuckoo search algorithm to attack multiple voiceprint recognition models respectively, generate adversarial samples, and convert the adversarial samples into spectrograms as malicious samples;

(5)利用恶性样本和良性样本对步骤(3)获得的基于图像的的声纹识别集成模型进行再训练,获得能够抵抗攻击的声纹识别集成模型;(5) retraining the integrated voiceprint recognition model based on the image obtained in step (3) by using the malignant samples and the benign samples to obtain an integrated voiceprint recognition model capable of resisting attacks;

(6)利用步骤(5)获得的声纹识别集成模型对音频文件对应的声谱图进行防御识别。(6) Use the integrated voiceprint recognition model obtained in step (5) to perform defense recognition on the spectrogram corresponding to the audio file.

优选地,将音频文件转化为声谱图的具体步骤为:Preferably, the specific steps of converting an audio file into a spectrogram are:

对音频进行分帧,并对每帧语音信号加窗处理后进行短时傅里叶变换;Frame the audio, and perform short-time Fourier transform after windowing each frame of speech signal;

计算短时傅里叶变换结果的功率谱,并对功率谱进行归一化处理,获得声谱图,将声谱图与对应的说话者组成一个良性样本。Calculate the power spectrum of the short-time Fourier transform result, and normalize the power spectrum to obtain a spectrogram, and form a benign sample with the spectrogram and the corresponding speaker.

优选地,所述图像识别模型采用VGG16或VGG19。Preferably, the image recognition model adopts VGG16 or VGG19.

优选地,所述利用良性样本训练多个声纹识别模型的具体过程为:Preferably, the specific process of using benign samples to train multiple voiceprint recognition models is:

对声谱图进行预处理,将声谱图大小设置为224×224×3,获得声谱图样本;Preprocess the spectrogram, set the spectrogram size to 224×224×3, and obtain the spectrogram sample;

声谱图样本xi经过声纹识别模型输出的置信度为yipre,用交叉熵作损失函数,利用损失函数L(xi)优化声纹识别模型的参数;The confidence degree output by the spectrogram sample x i through the voiceprint recognition model is y ipre , using cross entropy as the loss function, and using the loss function L(xi ) to optimize the parameters of the voiceprint recognition model;

L(xi)=-[yilogyipre+(1-yi)log(1-yipre)]L(x i )=-[y i logy ipre +(1-y i )log(1-y ipre )]

利用测试集中的声谱图测试训练的声纹识别模型的准确率,在识别精度达不到要求时,重新训练声纹识别模型,直到识别精度达到要求为止。Use the spectrogram in the test set to test the accuracy of the trained voiceprint recognition model. When the recognition accuracy does not meet the requirements, retrain the voiceprint recognition model until the recognition accuracy meets the requirements.

步骤(3)的具体过程为:The concrete process of step (3) is:

利用投票机制将多个基于图像的声纹识别模型进行集成,获得声纹识别集成模型;Using the voting mechanism to integrate multiple image-based voiceprint recognition models to obtain an integrated voiceprint recognition model;

投票前先将各声纹识别模型返回的预测置信度转化为预测类别,即最高置信度对应的类别标记作为该声纹识别模型的预测结果;Before voting, the prediction confidence returned by each voiceprint recognition model is converted into a prediction category, that is, the category mark corresponding to the highest confidence level is used as the prediction result of the voiceprint recognition model;

各个声纹识别模型得到声谱图样本的预测结果后,若某预测类别获得一半以上声纹识别模型投票,则预测类别即为声纹识别集成模型的预测结果;After each voiceprint recognition model obtains the prediction result of the spectrogram sample, if a certain prediction category receives more than half of the voiceprint recognition model votes, the prediction category is the prediction result of the voiceprint recognition integrated model;

再用良性样本对声纹识别集成模型进行训练,用测试集进行测试,以提高声纹识别集成模型。Then use benign samples to train the integrated model of voiceprint recognition, and test it with the test set to improve the integrated model of voiceprint recognition.

一种基于声谱图的声纹识别集成模型的防御装置,包括计算机存储器、计算机处理器以及存储在所述计算机存储器中并可在所述计算机处理器上执行的计算机程序,所述计算机处理器执行所述计算机程序时实现上述基于声谱图的声纹识别集成模型的防御方法。A defense device based on a voiceprint recognition integrated model of a spectrogram, comprising a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, the computer processor When the computer program is executed, the above-mentioned defense method based on the voiceprint recognition integrated model of the spectrogram is realized.

本发明中,基于以上声纹识别系统可能存在的缺陷及已有攻击方法的局限性,研究一种将语音转化为声谱图,利用声谱图训练图像识别模型,使其达到声纹识别的目的。并将多个训练好的图像识别模型集成在一起,在提高模型精度的同时,使该特殊的声纹识别模型能够抵御对抗样本的攻击,并通过对抗训练进一步提高模型的防御能力,实现对白盒或黑盒攻击的防御。In the present invention, based on the possible defects of the above voiceprint recognition system and the limitations of existing attack methods, a method of converting speech into a spectrogram and using the spectrogram to train an image recognition model is studied to achieve the goal of voiceprint recognition. Purpose. And integrate multiple trained image recognition models together, while improving the accuracy of the model, this special voiceprint recognition model can resist the attack of adversarial samples, and further improve the defense ability of the model through adversarial training, realizing the dialogue box or defense against black-box attacks.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图做简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动前提下,还可以根据这些附图获得其他附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为实施例提供的基于声谱图的声纹识别集成模型的防御方法的流程图;Fig. 1 is the flow chart of the defense method of the integrated model of voiceprint recognition based on the spectrogram provided by the embodiment;

图2是实施例提供的获得对抗样本的结构示意图;Fig. 2 is a schematic structural diagram of obtaining an adversarial example provided by the embodiment;

图3是实施例提供的对集成声纹识别模型再训练的示意图。Fig. 3 is a schematic diagram of retraining the integrated voiceprint recognition model provided by the embodiment.

具体实施方式Detailed ways

为使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例对本发明进行进一步的详细说明。应当理解,此处所描述的具体实施方式仅仅用以解释本发明,并不限定本发明的保护范围。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, and do not limit the protection scope of the present invention.

参加图1~图3,实施例提供的基于声谱图的声纹识别集成模型的防御方法,包括以下步骤:Referring to Figures 1 to 3, the defense method of the voiceprint recognition integrated model based on the spectrogram provided by the embodiment includes the following steps:

1)准备用于声纹识别模型训练的数据集,用Librispeech语音数据集中的train-clean-100数据集作为数据集。train-clean-100数据的各个文件存放的是不同说话人的音频,因此一个文件夹对应一个说话人,文件名实际上就是标签;1) Prepare a data set for voiceprint recognition model training, using the train-clean-100 data set in the Librispeech voice data set as the data set. Each file of the train-clean-100 data stores the audio of different speakers, so a folder corresponds to a speaker, and the file name is actually a label;

2)将各个文件夹中的音频文件进行预处理,转化为声谱图,保存在相应的文件夹中,文件名就是声谱图对应的类标,也就是说话人的身份。将其按照一定比例划分为训练集和测试集。其具体过程如下:2) Preprocess the audio files in each folder, convert them into spectrograms, and save them in corresponding folders. The file name is the corresponding class mark of the spectrogram, that is, the identity of the speaker. Divide it into a training set and a test set according to a certain ratio. The specific process is as follows:

Step1:对于train-claen-100数据集中的各个音频文件x(n),对其进行分帧,每一帧长度为25ms,在该时间段内,语音信号视作稳定状态。对分帧之后的音频信号加窗函数避免高频部分信号泄露。在分帧加窗后,对语音信号进行短时傅里叶变换:Step1: For each audio file x(n) in the train-claen-100 data set, divide it into frames, and the length of each frame is 25ms. During this time period, the speech signal is regarded as a stable state. A window function is added to the audio signal after framing to avoid signal leakage of high frequency parts. After framing and windowing, short-time Fourier transform is performed on the speech signal:

其中k∈{0,1,…N-1},其中N表示一帧音频文件中的所含有的采样点的个数,w(n-m)是沿时间轴滑动的窗函数。Where k∈{0,1,...N-1}, where N represents the number of sampling points contained in a frame of audio file, and w(n-m) is a window function sliding along the time axis.

Step2:根据X(n,k)求得其功率谱为Step2: Calculate its power spectrum according to X(n,k) as

P(n,k)=|X(n,k)|2 (2)P(n,k)=|X(n,k)| 2 (2)

Step3:由于语音中的静音段中有大量的非零噪声,因此对语谱图进行用最大-最小归一化方法进行处理。归一化处理后使语谱图均值和方差对应的明暗程度和明暗分布情况更加均匀,归一化公式如下:Step3: Since there is a large amount of non-zero noise in the silent segment in the speech, the spectrogram is processed with the maximum-minimum normalization method. After the normalization process, the lightness and darkness corresponding to the spectrogram mean and variance and the lightness distribution are more uniform. The normalization formula is as follows:

G(a,b)中,a代表对应的时间,b代表在a时刻的频率,G(a,b)的大小表示在对应a时刻,频率大小为b的音频成分所含有的能量大小。由G(a,b)可画语谱图,用颜色相同,但深浅程度不同的颜色代表各个时刻下,不同频率成分所含有的能量大小。In G(a,b), a represents the corresponding time, b represents the frequency at time a, and the size of G(a, b) represents the energy contained in the audio component with frequency b at the corresponding time a. G(a,b) can draw a spectrogram, using the same color but different shades of color to represent the energy contained in different frequency components at each moment.

Step4:将生成的声谱图按相应的说话人存放在相应的文件夹中,文件名就是类标,也就是对应的说话人,按一定的比例把生成的声谱图数据集按一定比例分为训练集和测试集。Step4: Store the generated spectrogram in the corresponding folder according to the corresponding speaker. The file name is the class label, that is, the corresponding speaker. Divide the generated spectrogram data set according to a certain proportion. for training set and test set.

3)训练基于声谱图的声纹识别模型:利用生成的声谱图训练VGG16模型,文件名就是声谱图的类标,达到用图像识别实现声纹识别的目的。在训练完后用测试集进行测试,使识别精度达到要求,若达不到要求,则继续训练模型,直到模型精度达到要求。其具体步骤如下:3) Train the voiceprint recognition model based on the spectrogram: use the generated spectrogram to train the VGG16 model, and the file name is the class mark of the spectrogram, so as to achieve the purpose of realizing voiceprint recognition by image recognition. After training, use the test set to test to make the recognition accuracy meet the requirements. If it fails to meet the requirements, continue to train the model until the model accuracy meets the requirements. The specific steps are as follows:

Step1:对图像进行预处理,将声谱图的大小设置为224×224×3。Step1: Preprocess the image, and set the size of the spectrogram to 224×224×3.

Step2:搭建VGG16模型。搭建基于CNN结构的图像识别模型,该结构有13个卷积层,3个全连接层。Step2: Build the VGG16 model. Build an image recognition model based on the CNN structure, which has 13 convolutional layers and 3 fully connected layers.

Step3:设置相关参数并进行训练。设声谱图样本xi经过VGG16模型输出的置信度为yipre,用交叉熵作损失函数:Step3: Set relevant parameters and perform training. Let the confidence degree of the spectrogram sample x i output by the VGG16 model be y ipre , and use the cross entropy as the loss function:

L(xi)=-[yilogyipre+(1-yi)log(1-yipre)] (4)L(x i )=-[y i logy ipre +(1-y i )log(1-y ipre )] (4)

其中yi表示真实标签。where yi denotes the ground truth label.

Step4:用测试数据集测试识别模型的准确率,确保达到预设的识别准确率,否则修改模型的结构和参数重新进行训练。Step4: Use the test data set to test the accuracy of the recognition model to ensure that the preset recognition accuracy is achieved, otherwise modify the structure and parameters of the model and retrain.

4)更换模型结构,重复步骤3),训练多个不同结构的基于声谱图的声纹识别模型。在训练完后用测试集对各个图像识别模型进行测试,使识别精度达到要求,若达不到要求,则更改模型参数继续训练模型,直到各个模型的精度达到要求。从而获得多个基于声谱图的声纹识别模型。4) Replace the model structure, repeat step 3), and train multiple voiceprint recognition models based on spectrograms with different structures. After training, use the test set to test each image recognition model to make the recognition accuracy meet the requirements. If the requirements are not met, change the model parameters and continue training the model until the accuracy of each model meets the requirements. Thereby, multiple voiceprint recognition models based on spectrograms are obtained.

5)将上述获得的多个基于声谱图的声纹识别模型进行集成。则集成后的模型具有多个不同结构的基于声谱图的声纹识别模型,采用投票法对各个模型的输出进行投票。然后再次进行训练,进一步提高模型的识别精度以及鲁棒性。具体步骤为:5) Integrating the multiple voiceprint recognition models based on the spectrogram obtained above. The integrated model has multiple voiceprint recognition models based on spectrograms with different structures, and the output of each model is voted by the voting method. Then train again to further improve the recognition accuracy and robustness of the model. The specific steps are:

Step1:将上述获得的多个基于声谱图的声纹识别模型进行集成,集成后采用投票机制。Step1: Integrate the multiple spectrogram-based voiceprint recognition models obtained above, and adopt a voting mechanism after integration.

Step2:投票前先将各自声纹识别模型返回的预测置信度转化为预测类别,即最高置信度对应的类别标记作为该声纹识别模型的预测结果。Step2: Before voting, convert the prediction confidence returned by each voiceprint recognition model into a prediction category, that is, the category mark corresponding to the highest confidence level is used as the prediction result of the voiceprint recognition model.

Step3:各个模型得到输入样本x的最终预测后,若某预测类别获得一半以上模型投票,也就是若对于声谱图样本,声纹识别集成模型输出中有一半以上的输出是说话者A,则认为该声谱图样本对应的音频所属说话者A;Step3: After each model obtains the final prediction of the input sample x, if a certain prediction category receives more than half of the model votes, that is, if for the spectrogram sample, more than half of the output of the voiceprint recognition integrated model is speaker A, then It is considered that the audio corresponding to the spectrogram sample belongs to speaker A;

Step对声纹识别集成模型再用train-clean-100数据集进行训练,用测试集进行测试,使模型的识别精度以及模型的防御能力进一步提升。Step trains the voiceprint recognition integrated model with the train-clean-100 data set and tests it with the test set to further improve the recognition accuracy of the model and the defense capability of the model.

6)攻击基于声谱图的声纹识别模型:采用布谷鸟搜索算法攻击基于声谱图的声纹识别模型。对于步骤4)中获得的多个基于声谱图的声纹识别模型,采用布谷鸟搜索算法对各个模型进行攻击,不断迭代优化寻找最优扰动,叠加到原音频上生成对抗样本。其具体步骤如下:6) Attacking the voiceprint recognition model based on the spectrogram: using the cuckoo search algorithm to attack the voiceprint recognition model based on the spectrogram. For the multiple spectrogram-based voiceprint recognition models obtained in step 4), the cuckoo search algorithm is used to attack each model, and iteratively optimizes to find the optimal disturbance, which is superimposed on the original audio to generate an adversarial sample. The specific steps are as follows:

Step1:初始化适应度函数,定义适应度函数如下:Step1: Initialize the fitness function and define the fitness function as follows:

f=[ytilogyipre+(1-yti)log(1-yadvipre)]+c·||xadvi-xi,0||2 (5)(5)其中,xadvi表示对抗样本,xi,0表示原音频,yti表示目标说话人的标签,yadvipre表示对抗样本的输出,该式中用L2函数来衡量对抗样本与原始样本之间的差异,通过参数c控制这个差异的大小。f=[y ti logy ipre +(1-y ti )log(1-y advipre )]+c||x advi -x i,0 || 2 (5)(5) where x advi represents an adversarial example , xi , 0 represents the original audio, y ti represents the label of the target speaker, y advipre represents the output of the adversarial sample, in which the L2 function is used to measure the difference between the adversarial sample and the original sample, and the difference is controlled by the parameter c the size of.

Step2:初始化鸟巢。设置鸟巢数量为G,初始化与原音频大小相同的随机扰动,叠加到原音频上,形成初始对抗样本。即初始鸟巢,设为:Step2: Initialize the nest. Set the number of nests to G, initialize random perturbations of the same size as the original audio, and superimpose them on the original audio to form an initial adversarial sample. That is, the initial bird's nest is set to:

X=x1,x2,…,xG (6)X=x 1 , x 2 , . . . , x G (6)

Step3:通过莱维飞行获得新的鸟巢,即通过莱维飞行获得新的对抗样本,莱维飞行更新如下:Step3: Obtain a new bird's nest through Levi's flight, that is, obtain a new confrontation sample through Levi's flight. The update of Levi's flight is as follows:

xi=xi+α*S*n (7)x i = x i +α*S*n (7)

其中α是步长缩放因子,n是与xi维数相同的,由标准正态分布的随机数组成的数组。S为步长:where α is the step scaling factor and n is an array of random numbers from the standard normal distribution with the same dimension as xi . S is the step size:

其中,u,v是两个服从高斯分布的变量,β是常数,σ2由公式下式计算:Among them, u and v are two variables that obey the Gaussian distribution, β is a constant, and σ2 is calculated by the following formula:

其中是伽马函数。in is the gamma function.

Step4:计算每个个体的适应度,记为F=f1,f2,…,fG,,找到种群中的最优个体即适应度函数值最小的个体,记为Fglobal。若迭代次数达到设定的最大迭代次数或者生成的对抗样本能够分类为目标类别,则停止迭代,输出对抗样本。若不满足上述条件,则重复Step1-Step3中的步骤,对种群继续迭代寻优。由此可获得不同模型下生成的对抗样本。Step4: Calculate the fitness of each individual, denoted as F=f 1 , f 2 , ..., f G , and find the optimal individual in the population, that is, the individual with the smallest fitness function value, denoted as F global . If the number of iterations reaches the set maximum number of iterations or the generated adversarial samples can be classified into the target category, the iteration is stopped and the adversarial samples are output. If the above conditions are not met, repeat the steps in Step1-Step3, and continue to iteratively optimize the population. In this way, adversarial examples generated under different models can be obtained.

7)对抗训练基于声谱图的集成声纹识别模型:将步骤6)中生成的对抗样本转换为声谱图后加入到训练数据集中,重新训练基于声谱图的集成声纹识别模型,提高集成声纹识别模型的识别精度以及防御能力,提高声纹识别模型的安全性和稳定性。7) Adversarial training The integrated voiceprint recognition model based on the spectrogram: convert the confrontation sample generated in step 6) into a spectrogram and add it to the training data set, retrain the integrated voiceprint recognition model based on the spectrogram, and improve Integrate the recognition accuracy and defense capabilities of the voiceprint recognition model to improve the security and stability of the voiceprint recognition model.

实施例还提供了一种基于声谱图的声纹识别集成模型的防御装置,包括计算机存储器、计算机处理器以及存储在所述计算机存储器中并可在所述计算机处理器上执行的计算机程序,所述计算机处理器执行所述计算机程序时实现上述基于声谱图的声纹识别集成模型的防御方法。The embodiment also provides a defense device based on a voiceprint recognition integrated model of a spectrogram, including a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, When the computer processor executes the computer program, the above-mentioned defense method based on the voiceprint recognition integrated model of the spectrogram is realized.

由于该防御装置中以及计算机存储器存储的计算机程序主要用于实现上述的基于声谱图的声纹识别集成模型的防御方法,因此其作用于上述防御方法的作用相对应,此处不再赘述。Since the defense device and the computer program stored in the computer memory are mainly used to implement the above-mentioned defense method based on the voiceprint recognition integrated model of the spectrogram, its effect on the above-mentioned defense method is corresponding, and will not be repeated here.

针对可能存在的对声纹识别系统的白盒或黑盒的攻击,本发明采用将语音信号转换为声谱图,利用图像识别模型达到声纹识别的目的,并将多个图像识别模型集成后,在提高声纹识别准确率的同时,获得对对抗样本的防御能力,实现对白盒或黑盒攻击的防御。In view of possible attacks on the white box or black box of the voiceprint recognition system, the present invention converts the voice signal into a spectrogram, uses the image recognition model to achieve the purpose of voiceprint recognition, and integrates multiple image recognition models , while improving the accuracy of voiceprint recognition, it also obtains defense capabilities against adversarial samples, and realizes defense against white-box or black-box attacks.

以上所述的具体实施方式对本发明的技术方案和有益效果进行了详细说明,应理解的是以上所述仅为本发明的最优选实施例,并不用于限制本发明,凡在本发明的原则范围内所做的任何修改、补充和等同替换等,均应包含在本发明的保护范围之内。The above-mentioned specific embodiments have described the technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned are only the most preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, supplements and equivalent replacements made within the scope shall be included in the protection scope of the present invention.

Claims (4)

1. The voiceprint recognition integrated model defending method based on the spectrogram is characterized by comprising the following steps of:
(1) Collecting an audio file, and converting the audio file into a spectrogram, wherein the spectrogram is used as a benign sample;
(2) Training a plurality of image recognition models by utilizing benign samples to enable the image recognition models to achieve the effect of voiceprint recognition, so as to obtain a plurality of image-based voiceprint recognition models;
(3) Integrating the plurality of the trained voiceprint recognition models based on the images in the step (2) by adopting a voting mechanism to form a voiceprint recognition integrated model, and retraining the voiceprint recognition integrated model by utilizing a benign sample, wherein the method specifically comprises the following steps of: integrating a plurality of voiceprint recognition models by utilizing a voting mechanism to obtain a voiceprint recognition integrated model; before voting, converting the prediction confidence coefficient returned by each voiceprint recognition model into a prediction category, namely, using a category label corresponding to the highest confidence coefficient as a prediction result of the voiceprint recognition model; after each voiceprint recognition model obtains a prediction result of a voiceprint sample, if a certain prediction category obtains more than half of voiceprint recognition model votes, the prediction category is the prediction result of the voiceprint recognition integrated model; training the voiceprint recognition integrated model by using benign samples, and testing by using a testing set to improve the voiceprint recognition integrated model;
(4) The cuckoo search algorithm is adopted to attack a plurality of voiceprint recognition models respectively, an countermeasure sample is generated, and the countermeasure sample is converted into a spectrogram which is used as a malignant sample, and the method specifically comprises the following steps:
(4-1) initializing a fitness function, defining the fitness function as follows:
f=[y ti logy ipre +(1-y ti )log(1-y advipre )]+c·||x advi -x i,0 || 2
wherein x is advi Representing challenge samples, x i,0 Representing the original audio, y ti Tag representing target speaker, y advipre Representing the output of the challenge sample, wherein the difference between the challenge sample and the original audio is measured by an L2 function, the magnitude of this difference being controlled by a parameter c, y ipre Representing confidence of voiceprint recognition model output;
(4-2) initializing bird nests, setting the number of bird nests as G, initializing random disturbance with the same size as the original audio frequency, and superposing the random disturbance on the original audio frequency to form an initial countermeasure sample, namely setting the initial bird nests as follows:
X=x 1 ,x 2 ,…,x G
(4-3) obtaining a new bird nest by a lewy flight, i.e., obtaining a new challenge sample by a lewy flight, the lewy flight being updated as follows:
x i =x i +α*S*n
where α is the step size scale factor and n is the sum of x i The number of dimensions is the same, an array of standard normal distributed random numbers, S is the step size:
where u, v are two variables subject to a gaussian distribution, β is a constant, σ 2 Calculated from the formula:
wherein the method comprises the steps ofIs a gamma function;
(4-4) calculating fitness of each individual, denoted as f=f 1 ,f 2 ,…,f G Finding the optimal individual in the population, namely the individual with the smallest fitness function value, and marking the optimal individual as F global If the iteration number reaches the set maximum iteration number or the generated challenge sample can be classified into the target class, stopping iteration, outputting the challenge sample, and if the conditions are not met, repeating the steps (4-1) - (4-3), and continuing iteration optimizing on the population, so that the challenge sample generated under different voiceprint recognition models can be obtained;
(5) Retraining the voiceprint recognition integrated model based on the image obtained in the step (3) by utilizing a malignant sample and a benign sample to obtain a voiceprint recognition integrated model capable of resisting attack;
(6) And (5) performing defending and identifying on the spectrogram corresponding to the audio file by utilizing the voiceprint identification integrated model obtained in the step (5).
2. The method for defending a voiceprint recognition integrated model based on a spectrogram according to claim 1, wherein the specific steps of converting an audio file into the spectrogram are as follows:
framing the audio, windowing each frame of voice signal, and performing short-time Fourier transform;
calculating a power spectrum of the short-time Fourier transform result, normalizing the power spectrum to obtain a spectrogram, and forming a benign sample by the spectrogram and a corresponding speaker.
3. The method for defending a voiceprint recognition integrated model based on a spectrogram according to claim 1, wherein the image recognition model adopts VGG16 or VGG19.
4. A device for defending a voiceprint recognition integrated model based on a spectrogram, comprising a computer memory, a computer processor and a computer program stored in the computer memory and executable on the computer processor, wherein the computer processor implements the method for defending a voiceprint recognition integrated model based on a spectrogram according to any one of claims 1 to 3 when the computer program is executed.
CN202010105807.7A 2020-02-20 2020-02-20 A defense method and defense device for an integrated model of voiceprint recognition based on a spectrogram Active CN111310836B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010105807.7A CN111310836B (en) 2020-02-20 2020-02-20 A defense method and defense device for an integrated model of voiceprint recognition based on a spectrogram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010105807.7A CN111310836B (en) 2020-02-20 2020-02-20 A defense method and defense device for an integrated model of voiceprint recognition based on a spectrogram

Publications (2)

Publication Number Publication Date
CN111310836A CN111310836A (en) 2020-06-19
CN111310836B true CN111310836B (en) 2023-08-18

Family

ID=71162113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010105807.7A Active CN111310836B (en) 2020-02-20 2020-02-20 A defense method and defense device for an integrated model of voiceprint recognition based on a spectrogram

Country Status (1)

Country Link
CN (1) CN111310836B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420072B (en) * 2021-01-25 2021-04-27 北京远鉴信息技术有限公司 Method and device for generating spectrogram, electronic equipment and storage medium
CN115346532B (en) * 2021-05-11 2025-08-26 中国移动通信集团有限公司 Optimization method, terminal device and storage medium of voiceprint recognition system
CN114708871B (en) * 2022-03-11 2025-10-24 支付宝(杭州)信息技术有限公司 Voiceprint recognition model training method, system, and voiceprint recognition method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014074732A (en) * 2012-10-02 2014-04-24 Nippon Hoso Kyokai <Nhk> Voice recognition device, error correction model learning method and program
CN107154258A (en) * 2017-04-10 2017-09-12 哈尔滨工程大学 Method for recognizing sound-groove based on negatively correlated incremental learning
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
CN108446765A (en) * 2018-02-11 2018-08-24 浙江工业大学 The multi-model composite defense method of sexual assault is fought towards deep learning
CN108711436A (en) * 2018-05-17 2018-10-26 哈尔滨工业大学 Speaker verification's system Replay Attack detection method based on high frequency and bottleneck characteristic
CN109801636A (en) * 2019-01-29 2019-05-24 北京猎户星空科技有限公司 Training method, device, electronic equipment and the storage medium of Application on Voiceprint Recognition model
CN110610708A (en) * 2019-08-31 2019-12-24 浙江工业大学 A voiceprint recognition attack defense method based on cuckoo search algorithm
CN110728993A (en) * 2019-10-29 2020-01-24 维沃移动通信有限公司 Voice-changing recognition method and electronic device
CN110767216A (en) * 2019-09-10 2020-02-07 浙江工业大学 Voice recognition attack defense method based on PSO algorithm
CN110808033A (en) * 2019-09-25 2020-02-18 武汉科技大学 An Audio Classification Method Based on Double Data Augmentation Strategy

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10657259B2 (en) * 2017-11-01 2020-05-19 International Business Machines Corporation Protecting cognitive systems from gradient based attacks through the use of deceiving gradients
CN108900725B (en) * 2018-05-29 2020-05-29 平安科技(深圳)有限公司 Voiceprint recognition method and device, terminal equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014074732A (en) * 2012-10-02 2014-04-24 Nippon Hoso Kyokai <Nhk> Voice recognition device, error correction model learning method and program
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
CN107154258A (en) * 2017-04-10 2017-09-12 哈尔滨工程大学 Method for recognizing sound-groove based on negatively correlated incremental learning
CN108446765A (en) * 2018-02-11 2018-08-24 浙江工业大学 The multi-model composite defense method of sexual assault is fought towards deep learning
CN108711436A (en) * 2018-05-17 2018-10-26 哈尔滨工业大学 Speaker verification's system Replay Attack detection method based on high frequency and bottleneck characteristic
CN109801636A (en) * 2019-01-29 2019-05-24 北京猎户星空科技有限公司 Training method, device, electronic equipment and the storage medium of Application on Voiceprint Recognition model
CN110610708A (en) * 2019-08-31 2019-12-24 浙江工业大学 A voiceprint recognition attack defense method based on cuckoo search algorithm
CN110767216A (en) * 2019-09-10 2020-02-07 浙江工业大学 Voice recognition attack defense method based on PSO algorithm
CN110808033A (en) * 2019-09-25 2020-02-18 武汉科技大学 An Audio Classification Method Based on Double Data Augmentation Strategy
CN110728993A (en) * 2019-10-29 2020-01-24 维沃移动通信有限公司 Voice-changing recognition method and electronic device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Jinyin Chen,et al..Can Adversarial Network Attack be Defended?.《arXiv》.2019,全文. *

Also Published As

Publication number Publication date
CN111310836A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
Chen et al. Towards understanding and mitigating audio adversarial examples for speaker recognition
CN110610708B (en) A voiceprint recognition attack defense method based on cuckoo search algorithm
CN113571067B (en) Voiceprint recognition countermeasure sample generation method based on boundary attack
CN113646833B (en) Speech adversarial sample detection method, device, equipment and computer-readable storage medium
Yu et al. Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features
US9368110B1 (en) Method for distinguishing components of an acoustic signal
CN111261147B (en) A Defense Method for Music Embedding Attacks for Speech Recognition Systems
WO2018107810A1 (en) Voiceprint recognition method and apparatus, and electronic device and medium
CN108319666A (en) A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion
CN111310836B (en) A defense method and defense device for an integrated model of voiceprint recognition based on a spectrogram
Wang et al. Query-efficient adversarial attack with low perturbation against end-to-end speech recognition systems
CN114038469B (en) A speaker recognition method based on multi-class spectrogram feature attention fusion network
CN105261367A (en) Identification method of speaker
Xue et al. Cross-modal information fusion for voice spoofing detection
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
Esmaeilpour et al. Class-conditional defense GAN against end-to-end speech attacks
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
CN119964600B (en) Voice recorder keyword sound recognition method, device and equipment
Hsu et al. Local wavelet acoustic pattern: A novel time–frequency descriptor for birdsong recognition
Zhang et al. A highly stealthy adaptive decay attack against speaker recognition
Jahangir et al. Spectrogram Features-Based Automatic Speaker Identification For Smart Services
Magazine et al. Fake speech detection using modulation spectrogram
Hassan et al. Enhancing speaker identification through reverberation modeling and cancelable techniques using ANNs
CN118230722B (en) Intelligent voice recognition method and system based on AI
Jiang et al. Research on voiceprint recognition of camouflage voice based on deep belief network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant