[go: up one dir, main page]

CN116778910A - A voice detection method - Google Patents

A voice detection method Download PDF

Info

Publication number
CN116778910A
CN116778910A CN202310505872.2A CN202310505872A CN116778910A CN 116778910 A CN116778910 A CN 116778910A CN 202310505872 A CN202310505872 A CN 202310505872A CN 116778910 A CN116778910 A CN 116778910A
Authority
CN
China
Prior art keywords
features
feature
speech
sound source
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310505872.2A
Other languages
Chinese (zh)
Inventor
张鹏远
张震
陆镜泽
孙旭东
王文超
刘睿霖
王丽
杜金浩
陈树丽
计哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, National Computer Network and Information Security Management Center filed Critical Institute of Acoustics CAS
Priority to CN202310505872.2A priority Critical patent/CN116778910A/en
Publication of CN116778910A publication Critical patent/CN116778910A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本申请提供了一种语音检测方法,包括:获取目标语音,将所述目标语音进行预处理,所述预处理包括预加重、分帧及加窗;确定所述预处理后目标语音的第一声道特征、第一声源波特征和多种第一相关特征;基于所述第一声道特征、第一声源波特征和多种第一相关特征确定所述第一主成分特征;将所述第一主成分特征输入训练好的分类器,输出分类的结果,所述分类结果为伪造语音,或自然语音。本申请利用伪造语音在基频处留下的痕迹信息,利用伪造语音与自然语音在声源和声道特征上的差异以实现伪造语音检测。使用主成分分析的方法分别对声源和声道特征进行筛选,选取具有较高相关性的主成分作为特征,减少特征维度和冗余特征,提高模型的泛化能力和效率。

This application provides a speech detection method, which includes: acquiring a target speech, preprocessing the target speech, the preprocessing including pre-emphasis, framing and windowing; determining the first value of the preprocessed target speech. vocal tract features, first sound source wave features and a variety of first correlation features; determining the first principal component feature based on the first vocal tract features, first sound source wave features and a variety of first correlation features; The first principal component feature is input into a trained classifier and a classification result is output, and the classification result is a fake speech or a natural speech. This application utilizes the trace information left by forged speech at the fundamental frequency, and utilizes the differences in sound source and vocal tract characteristics between forged speech and natural speech to achieve forged speech detection. Use the principal component analysis method to screen the sound source and vocal tract features respectively, select principal components with higher correlation as features, reduce feature dimensions and redundant features, and improve the generalization ability and efficiency of the model.

Description

一种语音检测方法A voice detection method

技术领域Technical field

本申请涉及语音检测领域,尤其涉及一种语音检测方法。The present application relates to the field of speech detection, and in particular to a speech detection method.

背景技术Background technique

随着技术的不断进步,语音技术得到了广泛的应用,例如语音识别、语音合成等。伴随着深度学习的蓬勃发展,在语音领域的许多任务中引入了人工智能技术以提升性能。然而,语音技术在发展的过程中,也引入了一些挑战。为了应对语音欺骗攻击的重大威胁,近年来针对语音欺骗攻击的伪造语音检测系统的发展备受关注。虽然许多伪造语音检测方法被提出,但是只有极少数被实施。现有的伪造语音检测系统没有针对伪造语音的特点进行设计,导致模型泛化能力低,难以得到人们的信赖。With the continuous advancement of technology, speech technology has been widely used, such as speech recognition, speech synthesis, etc. With the booming development of deep learning, artificial intelligence technology has been introduced to improve performance in many tasks in the speech field. However, the development process of voice technology has also introduced some challenges. In order to deal with the major threat of voice spoofing attacks, the development of forged voice detection systems for voice spoofing attacks has attracted much attention in recent years. Although many fake speech detection methods have been proposed, only a few have been implemented. The existing forged speech detection system is not designed for the characteristics of forged speech, resulting in low model generalization ability and difficulty in gaining people's trust.

当前伪造语音检测系统鉴伪系统过度依赖传统的人工设计特征和深度神经网络的分类性能,没有针对伪造语音特点进行设计,导致模型与数据集密切相关,在实际应用场景中泛化性较差,难以实际部署。The current counterfeiting voice detection system relies too much on traditional artificially designed features and the classification performance of deep neural networks. It is not designed for the characteristics of forged voice, resulting in a model that is closely related to the data set and has poor generalization in actual application scenarios. Difficult to deploy practically.

发明内容Contents of the invention

第一方面,本申请实施例提供了一种语音检测方法,方法包括:获取目标语音,将目标语音进行预处理,预处理包括预加重、分帧及加窗;确定预处理后的目标语音的多个第一声道特征;确定预处理后目标语音的第一声源波特征;第一声源波特征是通过逆滤波器提取的;基于多个第一声道特征和第一声源波特征确定第一主成分特征;将第一主成分特征输入训练好的分类器,输出分类的结果,分类结果为伪造语音,或自然语音。In the first aspect, embodiments of the present application provide a speech detection method. The method includes: acquiring the target speech, preprocessing the target speech, and the preprocessing includes pre-emphasis, framing and windowing; determining the preprocessed target speech. Multiple first vocal channel features; determine the first sound source wave feature of the preprocessed target speech; the first sound source wave feature is extracted through an inverse filter; based on multiple first vocal channel features and the first sound source wave The feature determines the first principal component feature; the first principal component feature is input into the trained classifier and the classification result is output. The classification result is fake speech or natural speech.

由此,本申请实施例提出的语音检测方法利用伪造语音在基频处留下的痕迹信息,利用伪造语音与自然语音由于生成过程不同而导致的出现在声源和声道特征上的差异以实现伪造语音检测。同时使用主成分分析的方法分别对声源和声道特征进行筛选,选取具有较高相关性的主成分作为特征,减少特征维度和冗余特征,提高模型的泛化能力和效率。Therefore, the speech detection method proposed in the embodiment of the present application uses the trace information left by the forged speech at the fundamental frequency, and uses the differences in the sound source and vocal tract characteristics caused by the different generation processes between the fake speech and the natural speech to Implement fake speech detection. At the same time, the principal component analysis method is used to screen the sound source and vocal channel features respectively, and the principal components with higher correlation are selected as features to reduce feature dimensions and redundant features, and improve the generalization ability and efficiency of the model.

在一些可以实现的实施方式中,多个第一声道特征包括声道滤波器的幅频特性和逆滤波器的幅频特性,确定预处理后的目标语音的多个第一声道特征,包括:使用线性预测编码预测预处理后的目标语音每一帧的滤波器参数;根据每一帧的滤波器参数计算确定声道滤波器的幅频特性和逆滤波器的幅频特性。In some implementable implementations, the plurality of first vocal channel features include the amplitude-frequency characteristics of the vocal channel filter and the amplitude-frequency characteristics of the inverse filter, and the plurality of first vocal channel features of the preprocessed target speech are determined, It includes: using linear predictive coding to predict the filter parameters of each frame of the preprocessed target speech; calculating and determining the amplitude-frequency characteristics of the vocal channel filter and the amplitude-frequency characteristics of the inverse filter based on the filter parameters of each frame.

由此,本申请实施例引入了多种和声道相关的特征以加强分类器的泛化能力。Therefore, embodiments of the present application introduce a variety of features related to the vocal tract to enhance the generalization ability of the classifier.

在一些可以实现的实施方式中,多个第一声道特征还包括基频特征、基频微扰特征、振幅微扰特征和梅尔倒谱系数。In some implementable implementations, the plurality of first channel features further include fundamental frequency features, fundamental frequency perturbation features, amplitude perturbation features, and Mel cepstrum coefficients.

由此,本申请实施例引入了多种和声道相关的特征以加强分类器的泛化能力。Therefore, embodiments of the present application introduce a variety of features related to the vocal tract to enhance the generalization ability of the classifier.

在一些可以实现的实施方式中,确定预处理后目标语音的第一声源波特征,包括:获取分帧后的目标语音的短时傅里叶特征;通过逆滤波器对短时傅里叶特征进行逆滤波,得到预处理后目标语音的第一声源波特征。In some implementations that can be implemented, determining the first sound source wave characteristics of the preprocessed target speech includes: obtaining the short-time Fourier characteristics of the framed target speech; using an inverse filter to perform the short-time Fourier The features are inversely filtered to obtain the first sound source wave characteristics of the preprocessed target speech.

由此,本申请实施例使用逆滤波的方法分离声源特征和声道特征,使得分类器可以利用伪造语音与自然语音生成机理不同而引入的特异性特征进行判决,对信息的利用更充分,同时没有引入过多的数据预处理步骤。Therefore, the embodiment of the present application uses the inverse filtering method to separate the sound source features and vocal tract features, so that the classifier can make judgments using specific features introduced due to the different generation mechanisms of fake speech and natural speech, making full use of information. At the same time, too many data preprocessing steps are not introduced.

在一些可以实现的实施方式中,基于多个第一声道特征和第一声源波特征确定第一主成分特征,包括:拼接多个第一声道特征和第一声源波特征得到第一拼接特征;第一拼接特征的特征数量为n;对第一拼接特征中的每个特征去中心化;计算去中心化的第一拼接特征的协方差矩阵;对协方差矩阵进行特征值分解,得到多个特征值和对应的特征向量;按照多个特征值从大到小的顺序,选取前k个特征向量组成转换矩阵,其中k<n;将第一拼接特征乘以转换矩阵得到第一主成分特征。In some implementable implementations, determining the first principal component feature based on a plurality of first vocal channel features and a first sound source wave feature includes: splicing a plurality of first vocal channel features and the first sound source wave feature to obtain a first One splicing feature; the number of features of the first splicing feature is n; decentralize each feature in the first splicing feature; calculate the covariance matrix of the decentralized first splicing feature; perform eigenvalue decomposition on the covariance matrix , obtain multiple eigenvalues and corresponding eigenvectors; according to the order of multiple eigenvalues from large to small, select the first k eigenvectors to form a transformation matrix, where k<n; multiply the first splicing feature by the transformation matrix to obtain the One principal component feature.

由此,本申请实施例在特征拼接的过程中引入了主成分分析以实现降维,选择出对任务更高贡献度的高相关性特征。减少冗余,以实现效率的提升。Therefore, the embodiment of the present application introduces principal component analysis in the feature splicing process to achieve dimensionality reduction and select highly relevant features that contribute more to the task. Reduce redundancy to improve efficiency.

在一些可以实现的实施方式中,方法还包括训练分类器的步骤:获取训练集带标签的训练语音,将训练语音进行预处理,预处理包括预加重、分帧及加窗;确定预处理后的训练语音的多个第二声道特征;确定预处理后的训练语音的第二声源波特征;第二声源波特征是通过声道逆滤波器提取的;基于多个第二声道特征和第二声源波特征确定第二主成分特征;将第二主成分特征输入分类器,进行迭代训练,在损失函数收敛的情况下得到训练好的分类器。In some implementations that can be implemented, the method also includes the step of training a classifier: obtaining labeled training speech in the training set, preprocessing the training speech, and the preprocessing includes pre-emphasis, framing, and windowing; determining that the preprocessed Multiple second vocal channel features of the training speech; determine the second sound source wave characteristics of the preprocessed training speech; the second sound source wave feature is extracted through the vocal tract inverse filter; based on multiple second vocal channels The second principal component feature and the second sound source wave feature determine the second principal component feature; input the second principal component feature into the classifier, perform iterative training, and obtain the trained classifier when the loss function converges.

由此,本申请实施例训练得到的分类器结构简单,可移植性好。其他有益效果如前所述,此处不再赘述。Therefore, the classifier trained in the embodiment of the present application has a simple structure and good portability. Other beneficial effects are as mentioned above and will not be repeated here.

在一些可以实现的实施方式中,多个第二声道特征包括声道滤波器的幅频特性和逆滤波器的幅频特性,确定预处理后的训练语音的多个第二声道特征,包括:使用线性预测编码预测预处理后的训练语音的每一帧的滤波器参数;根据训练语音的每一帧的滤波器参数计算确定声道滤波器的幅频特性和逆滤波器的幅频特性。In some implementations that can be implemented, the plurality of second channel features include the amplitude-frequency characteristics of the channel filter and the amplitude-frequency characteristics of the inverse filter, and the plurality of second channel features of the preprocessed training speech are determined, Including: using linear predictive coding to predict the filter parameters of each frame of the preprocessed training speech; calculating and determining the amplitude-frequency characteristics of the vocal channel filter and the amplitude-frequency of the inverse filter based on the filter parameters of each frame of the training speech. characteristic.

在一些可以实现的实施方式中,多个第二声道特征还包括基频特征、基频微扰特征、振幅微扰特征和梅尔倒谱系数。In some implementable implementations, the plurality of second channel features further include fundamental frequency features, fundamental frequency perturbation features, amplitude perturbation features, and Mel cepstrum coefficients.

在一些可以实现的实施方式中,确定预处理后训练语音的第二声源波特征,包括:获取分帧后的训练语音的短时傅里叶特征;通过逆滤波器对短时傅里叶特征进行逆滤波,得到预处理后训练语音的第二声源波特征。In some implementations that can be implemented, determining the second sound source wave characteristics of the preprocessed training speech includes: obtaining the short-time Fourier characteristics of the framed training speech; using an inverse filter to perform the short-time Fourier The features are inversely filtered to obtain the second sound source wave features of the preprocessed training speech.

在一些可以实现的实施方式中,基于多个第二声道特征和第二声源波特征确定第二主成分特征,包括:拼接多个第二声道特征和第二声源波特征得到第二拼接特征;第二拼接特征的特征数量为n;对第二拼接特征中的每个特征去中心化;计算去中心化的第二拼接特征的协方差矩阵;对协方差矩阵进行特征值分解,得到n个特征值和对应的特征向量;按照多个特征值从大到小的顺序,选取前k个特征向量组成转换矩阵,其中k<n;将第二拼接特征乘以转换矩阵得到第二主成分特征。In some implementable implementations, determining the second principal component feature based on a plurality of second vocal channel features and a second sound source wave feature includes: splicing a plurality of second vocal channel features and the second sound source wave feature to obtain a third Two spliced features; the number of features of the second spliced feature is n; decentralize each feature in the second spliced feature; calculate the covariance matrix of the decentralized second spliced feature; perform eigenvalue decomposition of the covariance matrix , obtain n eigenvalues and corresponding eigenvectors; according to the order of multiple eigenvalues from large to small, select the first k eigenvectors to form a transformation matrix, where k<n; multiply the second splicing feature by the transformation matrix to obtain the Two principal component features.

第二方面,本申请实施例提供了一种电子设备,包括:至少一个存储器,用于存储程序;In a second aspect, embodiments of the present application provide an electronic device, including: at least one memory for storing a program;

至少一个处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如第一方面任意一项所述的方法。At least one processor is configured to execute the program stored in the memory. When the program stored in the memory is executed, the processor is configured to execute the method described in any one of the first aspects.

第三方面,本申请实施例提供了一种计算机存储介质,计算机存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行第一方面中所提供的方法。In a third aspect, embodiments of the present application provide a computer storage medium. Instructions are stored in the computer storage medium. When the instructions are run on a computer, they cause the computer to execute the method provided in the first aspect.

附图说明Description of drawings

为了更清楚地说明本说明书披露的多个实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书披露的多个实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the multiple embodiments disclosed in this specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only those disclosed in this specification. For multiple embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

下面对实施例或现有技术描述中所需使用的附图作简单地介绍。The drawings needed to be used in the description of the embodiments or the prior art are briefly introduced below.

图1是本申请实施例提供的一种本申请实施例提供的一种语音检测方法的系统架构图;Figure 1 is a system architecture diagram of a speech detection method provided by an embodiment of the present application;

图2是本申请实施例提供的一种本申请实施例提供的一种语音检测方法的流程图;Figure 2 is a flow chart of a speech detection method provided by an embodiment of the present application;

图3是本申请实施例提供的一种本申请实施例提供的一种语音检测方法的训练流程图。Figure 3 is a training flow chart of a speech detection method provided by an embodiment of the present application.

具体实施方式Detailed ways

为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本申请实施例中的技术方案进行描述。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

在本申请实施例的描述中,“示例性的”、“例如”或者“举例来说”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”、“例如”或者“举例来说”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”、“例如”或者“举例来说”等词旨在以具体方式呈现相关概念。In the description of the embodiments of this application, words such as "exemplary", "for example" or "for example" are used to represent examples, illustrations or explanations. Any embodiment or design described as "exemplary," "such as," or "for example" in the embodiments of the present application is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary," "such as," or "for example" is intended to present the concepts in a concrete manner.

在本申请实施例的描述中,术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,单独存在B,同时存在A和B这三种情况。另外,除非另有说明,术语“多个”的含义是指两个或两个以上。例如,多个系统是指两个或两个以上的系统,多个终端是指两个或两个以上的终端。In the description of the embodiments of this application, the term "and/or" is only an association relationship describing associated objects, indicating that there can be three relationships. For example, A and/or B can mean: A alone exists, and A alone exists. There is B, and there are three situations A and B at the same time. In addition, unless otherwise stated, the term "plurality" means two or more. For example, multiple systems refer to two or more systems, and multiple terminals refer to two or more terminals.

此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。In addition, the terms "first" and "second" are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include one or more of these features. The terms “including,” “includes,” “having,” and variations thereof all mean “including but not limited to,” unless otherwise specifically emphasized.

在本申请实施例的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the description of the embodiments of this application, reference is made to "some embodiments", which describes a subset of all possible embodiments, but it can be understood that "some embodiments" may be the same subset or different subsets of all possible embodiments. sets and can be combined with each other without conflict.

在本申请实施例的描述中,所涉及的术语“第一\第二\第三等”或模块A、模块B、模块C等,仅用于区别类似的对象,不代表针对对象的特定排序,可以理解地,在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the description of the embodiments of this application, the terms "first\second\third, etc." or module A, module B, module C, etc. are only used to distinguish similar objects and do not represent a specific ordering of the objects. , it is understood that the specific order or sequence may be interchanged where permitted, so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.

在本申请实施例的描述中,所涉及的表示步骤的标号,如S110、S120……等,并不表示一定会按此步骤执行,在允许的情况下可以互换前后步骤的顺序,或同时执行。In the description of the embodiments of this application, the labels indicating steps involved, such as S110, S120..., etc., do not necessarily mean that this step will be executed. If allowed, the order of the previous and subsequent steps can be interchanged, or they can be performed simultaneously. implement.

除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application and are not intended to limit the present application.

图1为本申请实施例提供的一种语音检测方法的系统架构图。如图1所示,语音预处理模块11首先将目标语音进行预处理,目标语音可以是获取到的真实或伪造语音;预处理包括预加重、分帧及加窗。声道特征提取模块121使用线性预测编码(LPC)提取目标语音中每帧语音信号的声道滤波器的系数,并基于该系数估算每帧语音信号的声道滤波器的幅频特性,以及其逆滤波器的幅频特性,得到声道特征。声源波特征提取模块122对于预处理后的每帧语音信号计算短时傅里叶变换(STFT)谱,并利用声道逆滤波器对短时短时傅里叶变换(STFT)谱逆滤波,得到声源波特征。声源声道相关特征提取模块123对于预处理后的每帧语音信号,求取其基频特征、基频微扰特征、梅尔倒谱系数(MFCC)等相关特征。主成分确定模块13对于声道特征、声源波特征、其基频特征、基频微扰特征、梅尔倒谱系数(MFCC)等相关特征进行拼接,使用主成分分析(PCA)去除冗余特征,得到具有高相关性的主成分特征。分类模块14将的主成分特征输入分类器进行二分类,输出得到自然语音或伪造语音的结果。本申请实施例提供一种语音检测方法,针对真伪难辨的语音,充分利用其在声源和声道处的差异性,实现在实际使用场景中具备泛化性的伪造语音识别;利用主成分分析的特征选择能力,较少冗余、提升效率。Figure 1 is a system architecture diagram of a speech detection method provided by an embodiment of the present application. As shown in Figure 1, the speech preprocessing module 11 first preprocesses the target speech, which can be the acquired real or fake speech; the preprocessing includes pre-emphasis, framing and windowing. The vocal tract feature extraction module 121 uses linear prediction coding (LPC) to extract the coefficients of the vocal tract filter of each frame of the speech signal in the target speech, and estimates the amplitude-frequency characteristics of the vocal channel filter of each frame of the speech signal based on the coefficients, and its Inverse the amplitude-frequency characteristics of the filter to obtain the vocal channel characteristics. The sound source wave feature extraction module 122 calculates the short-time Fourier transform (STFT) spectrum for each frame of the preprocessed speech signal, and uses the vocal tract inverse filter to inversely filter the short-time Fourier transform (STFT) spectrum. , obtain the sound source wave characteristics. The sound source channel related feature extraction module 123 obtains related features such as fundamental frequency features, fundamental frequency perturbation features, Mel Cepstral Coefficients (MFCC), etc. for each frame of preprocessed speech signal. The principal component determination module 13 splices related features such as vocal tract features, sound source wave features, fundamental frequency features, fundamental frequency perturbation features, Mel Cepstral Coefficients (MFCC), etc., and uses principal component analysis (PCA) to remove redundancy Features to obtain principal component features with high correlation. The classification module 14 inputs the principal component features into the classifier for two classifications, and outputs the result of natural speech or fake speech. Embodiments of the present application provide a speech detection method that makes full use of the differences in sound sources and vocal channels to achieve generalized forged speech recognition in actual usage scenarios for speech whose authenticity is difficult to distinguish; using the main The feature selection ability of component analysis reduces redundancy and improves efficiency.

图2为本申请实施例提供的一种语音检测方法示意图。如图2所示,语音检测方法包括:S11,获取目标语音,将目标语音进行预处理,预处理包括预加重、分帧及加窗;S12,确定预处理后的目标语音的第一声道特征、第一声源波特征和多种第一相关特征;S14,基于第一声道特征、第一声源波特征和多种第一相关特征确定第一主成分特征;S15,将第一主成分特征输入训练好的分类器,输出分类的结果,分类结果为伪造语音,或自然语音。Figure 2 is a schematic diagram of a speech detection method provided by an embodiment of the present application. As shown in Figure 2, the speech detection method includes: S11, obtain the target speech, and preprocess the target speech. The preprocessing includes pre-emphasis, framing and windowing; S12, determine the first channel of the preprocessed target speech. Features, first sound source wave features and multiple first correlation features; S14, determine the first principal component features based on the first vocal channel features, first sound source wave features and multiple first correlation features; S15, determine the first principal component features The principal component features are input to the trained classifier, and the classification result is output. The classification result is fake speech or natural speech.

下面结合实施例对本申请提供的语音检测方法的各个步骤进行详细介绍。Each step of the voice detection method provided by this application will be introduced in detail below with reference to the embodiments.

S11,获取目标语音,将目标语音进行预处理,预处理包括预加重、分帧及加窗。S11, obtain the target speech and preprocess the target speech. The preprocessing includes pre-emphasis, framing and windowing.

由于语音的高频部分通常与较低频率部分相比具有较小的幅度,预加重能够平衡频谱,避免在傅里叶变换操作过程中出现数值问题,改善信噪比(SNR),消除发声过程中声带和嘴唇的效应,补偿语音信号受到发音系统所抑制的高频部分,突出高频的共振峰。Since high-frequency parts of speech usually have smaller amplitudes compared to lower frequency parts, pre-emphasis balances the spectrum, avoids numerical problems during Fourier transform operations, improves signal-to-noise ratio (SNR), and eliminates the voicing process. The effect of the vocal cords and lips compensates for the high-frequency part of the speech signal that is suppressed by the articulatory system, highlighting the high-frequency formants.

在本申请实施例中,可以使用高通滤波器对目标语音进行预加重。预加重的语音信号y(n)为:In this embodiment of the present application, a high-pass filter can be used to pre-emphasize the target speech. The pre-emphasized speech signal y(n) is:

y(n)=x(n)-0.79·x(n-1) (1)y(n)=x(n)-0.79·x(n-1) (1)

其中x(n)表示目标语音中的第n帧语音信号,x(n-1)表示目标语音中的第n-1帧语音信号。Where x(n) represents the n-th frame speech signal in the target speech, and x(n-1) represents the n-1th frame speech signal in the target speech.

在大多数情况下,语音信号是非平稳的,对整个信号进行傅里叶变换是没有意义。因此可以在短时帧上进行傅里叶变换,通过连接相邻帧来获得信号时频变换的良好近似值。In most cases, the speech signal is non-stationary and it does not make sense to Fourier transform the entire signal. Therefore, Fourier transform can be performed on short-time frames, and a good approximation of the time-frequency transformation of the signal can be obtained by connecting adjacent frames.

在本申请实施例中,在预加重之后,将目标语音分成多个短时帧,每一帧是短时平稳的语音信号。In this embodiment of the present application, after pre-emphasis, the target speech is divided into multiple short-term frames, and each frame is a short-term smooth speech signal.

示例性地,帧长可以设置为25ms,帧移动可以设置为10ms,原来的语音切分为多个语音帧,用以计算短时时频特征。For example, the frame length can be set to 25ms, the frame movement can be set to 10ms, and the original speech is divided into multiple speech frames to calculate short-term time-frequency characteristics.

将目标语音分割成多个短时帧后,对每帧乘以一个窗函数,以增加帧左端和右端的连续性,并减少频谱泄漏。窗函数可以为汉明窗(Hamming),窗函数公式w(n)为:After segmenting the target speech into multiple short-term frames, each frame is multiplied by a window function to increase the continuity between the left and right ends of the frame and reduce spectrum leakage. The window function can be a Hamming window (Hamming), and the window function formula w(n) is:

其中,N表示窗内的采样点数。Among them, N represents the number of sampling points in the window.

S12,确定预处理后的目标语音的第一声道特征、第一声源波特征和多种第一相关特征。S12. Determine the first vocal channel characteristics, first sound source wave characteristics and various first related characteristics of the preprocessed target speech.

下面分别介绍目标语音的第一声道特征、第一声源波特征和多种第一相关特征。The following introduces the first vocal channel characteristics, first sound source wave characteristics and various first related characteristics of the target speech respectively.

S121,确定预处理后的目标语音的第一声道特征。S121. Determine the first vocal channel characteristics of the preprocessed target speech.

第一声道特征包括声道滤波器的幅频特性和逆滤波器的幅频特性,可以使用线性预测编码(LPC)预测预处理后的目标语音每一帧的声道滤波器参数;根据所述每一帧的滤波器参数计算得到目标语音每一帧的声道滤波器的幅频特性,以及逆滤波器的幅频特性。The first vocal channel features include the amplitude-frequency characteristics of the vocal channel filter and the amplitude-frequency characteristics of the inverse filter. Linear predictive coding (LPC) can be used to predict the vocal channel filter parameters of each frame of the preprocessed target speech; according to the required The amplitude-frequency characteristics of the vocal channel filter of each frame of the target speech and the amplitude-frequency characteristics of the inverse filter can be obtained by calculating the filter parameters of each frame.

在本申请实施例中,可以根据线性预测编码(LPC)计算出每一帧语音的声道滤波器在z变换下的零点和极点,根据声道滤波器的零点和极点确定声道滤波器的幅频特性。进一步地,对于目标语音的每一帧,将声道滤波器的零点和极点互换得到声道逆滤波器。根据互换后的零点和极点计算出的逆滤波器幅频特性。In the embodiment of the present application, the zero points and poles of the vocal channel filter under z-transformation of each frame of speech can be calculated based on linear predictive coding (LPC), and the zero points and poles of the vocal channel filter are determined according to the zero points and poles of the vocal channel filter. Amplitude-frequency characteristics. Further, for each frame of the target speech, the zero points and poles of the vocal tract filter are exchanged to obtain the inverse vocal tract filter. The amplitude-frequency characteristics of the inverse filter calculated from the swapped zeros and poles.

S122,确定预处理后目标语音的第一声源波特征;第一声源波特征是逆滤波器幅频特性与原始语音在频域上相乘后滤除声道特征后得到声源波特征。S122, determine the first sound source wave characteristic of the preprocessed target speech; the first sound source wave characteristic is the amplitude-frequency characteristic of the inverse filter multiplied by the original speech in the frequency domain, and then the vocal tract characteristics are filtered out to obtain the sound source wave characteristic. .

在本申请实施例中,获取目标语音的每一帧的短时傅里叶特征;通过逆滤波器对短时傅里叶特征进行逆滤波,得到第一声源波特征。In the embodiment of the present application, the short-time Fourier features of each frame of the target speech are obtained; the short-time Fourier features are inversely filtered through an inverse filter to obtain the first sound source wave features.

S123,确定预处理后目标语音的多种第一相关特征,包括梅尔倒谱系数、基频特征、基频微扰特征和振幅微扰特征。S123. Determine multiple first correlation features of the preprocessed target speech, including Mel cepstrum coefficients, fundamental frequency features, fundamental frequency perturbation features and amplitude perturbation features.

其中,梅尔倒谱系数是利用梅尔滤波器组对目标语音每一帧的短时傅里叶谱通过梅尔滤波器滤波得到梅尔谱特征(fbank)。Among them, the Mel cepstrum coefficient uses a Mel filter bank to filter the short-time Fourier spectrum of each frame of the target speech through the Mel filter to obtain the Mel spectrum feature (fbank).

具体地,通过短时傅里叶变换对目标语音的每一帧加窗后进行时频分析,再取对数幅度谱,得到语谱图。其中短时傅里叶变换定义为:Specifically, each frame of the target speech is windowed through short-time Fourier transform and time-frequency analysis is performed, and then the logarithmic amplitude spectrum is taken to obtain the spectrogram. The short-time Fourier transform is defined as:

其中,x(τ)为单帧语音信号,h(τ-t)为分析窗函数,τ为偏移量。Among them, x(τ) is a single frame speech signal, h(τ-t) is the analysis window function, and τ is the offset.

将语谱图通过80维梅尔滤波器组滤波,得到梅尔谱特征。The spectrogram is filtered through an 80-dimensional Mel filter bank to obtain the Mel spectrum features.

梅尔滤波器组模仿人类的听觉感知系统,它对不同频率信号的灵敏度是不同。梅尔滤波器组在频率坐标轴上不是统一分布的,在低频区域梅尔滤波器分布密集,但在高频区域梅尔滤波器分布稀疏。因此梅尔滤波器组可以模拟人耳对声音的非线性感知,在较低的频率下更具辨别力,在较高的频率下则分辨率较低。梅尔频率fmel与线性频率f转换公式为:The Mel filter bank imitates the human auditory perception system, and its sensitivity to signals of different frequencies is different. The Mel filter bank is not uniformly distributed on the frequency axis. In the low-frequency region, the Mel filters are densely distributed, but in the high-frequency region, the Mel filters are sparsely distributed. Therefore, the Mel filter bank can simulate the human ear's nonlinear perception of sound, which is more discriminative at lower frequencies and less resolution at higher frequencies. The conversion formula between Mel frequency f mel and linear frequency f is:

根据该公式可以计算出梅尔滤波器组,将滤波器组中的每个滤波器依次与短时傅里叶谱(STFT)进行相乘累加,得到了梅尔滤波器组特征。对于计算得到的梅尔滤波器组特征进行倒谱分析即可得到最终的梅尔倒谱(MFCC)特征。According to this formula, the Mel filter bank can be calculated. Each filter in the filter bank is multiplied and accumulated with the short-time Fourier spectrum (STFT) in turn to obtain the Mel filter bank characteristics. Perform cepstrum analysis on the calculated Mel filter bank features to obtain the final Mel Cepstrum (MFCC) features.

倒谱分析的作用是提取频谱的包络,在语音的时频特征中则表现为提取语音的共振峰特征,也就是声道特征。可以对fbank特征使用离散余弦变换(DCT)进行倒谱分析,得到梅尔倒谱系数MFCCiThe function of cepstral analysis is to extract the envelope of the frequency spectrum, which in the time-frequency characteristics of speech is to extract the formant characteristics of speech, that is, the vocal tract characteristics. The discrete cosine transform (DCT) can be used to perform cepstral analysis on the fbank feature to obtain the Mel cepstral coefficient MFCC i :

其中Sj是fbank中第j个滤波器的输出值,MFCCi是第i个梅尔滤波器的梅尔倒谱系数,N是梅尔滤波器组的数量。where S j is the output value of the jth filter in fbank, MFCC i is the Mel cepstrum coefficient of the i th Mel filter, and N is the number of Mel filter banks.

基频特征F0是通过计算每帧信号的基音周期得到的特征。The fundamental frequency feature F 0 is a feature obtained by calculating the pitch period of each frame signal.

基频微扰特征是表征基频微扰的非周期性变化的特征。基频微扰是基音频率的相邻周期间变化量的度量,基频微扰以百分比(%)为单位,基频微扰Jitter的公式为:The fundamental frequency perturbation feature is a characteristic that represents the non-periodic changes of the fundamental frequency perturbation. The fundamental frequency perturbation is a measure of the change in the fundamental frequency between adjacent cycles. The fundamental frequency perturbation is expressed in percentage (%). The formula of the fundamental frequency perturbation Jitter is:

其中F0为每帧语音的基频,J为每帧语音的基因周期个数。Among them, F 0 is the fundamental frequency of each frame of speech, and J is the number of gene cycles of each frame of speech.

振幅微扰特征是一种表征振幅微扰的非周期变化的特征。振幅微扰是语音信号相邻周期之间振幅的变化量的度量,振幅微扰以分贝(dB)为单位。The amplitude perturbation feature is a feature that represents the non-periodic changes of the amplitude perturbation. Amplitude perturbation is a measure of the change in amplitude between adjacent periods of the speech signal, and the amplitude perturbation is measured in decibels (dB).

振幅微扰Shimmer的公式为:The formula of amplitude perturbation Shimmer is:

其中A为每个基音周期的峰间振幅,L为振幅数量。Where A is the peak-to-peak amplitude of each pitch period, and L is the number of amplitudes.

上述第一声道特征、第一声源特征和多种第一相关特征都表征了语音的声源和声道信息,这些特征信息指示了伪造语音和真实语音在生成过程上存在的差异。The above-mentioned first vocal channel features, first sound source features and various first related features all represent the sound source and vocal tract information of the speech, and these feature information indicate the difference in the generation process of the fake speech and the real speech.

本申请实施例在语音特征提取过程中得到了多种相关特征,然而过多的特征包含冗余信息,对于模型的泛化性以及计算速度存在影响。因此,需要对特征进行筛选,以得到与语音鉴伪任务具有更高相关性的特征。In the embodiment of the present application, a variety of relevant features are obtained during the speech feature extraction process. However, too many features contain redundant information, which has an impact on the generalization and calculation speed of the model. Therefore, features need to be screened to obtain features with higher relevance to the speech forgery detection task.

S13,基于第一声道特征、第一声源波特征和多种第一相关特征确定第一主成分特征。S13: Determine the first principal component feature based on the first vocal channel feature, the first sound source wave feature and a variety of first correlation features.

本申请实施例使用主成分分析(PCA)对多个第一声道特征和第一声源波特征进行筛选,包括以下步骤:The embodiment of this application uses principal component analysis (PCA) to screen multiple first vocal channel features and first sound source wave features, including the following steps:

S131,拼接第一声道特征、第一声源波特征和多种第一相关特征得到第一拼接特征;拼接特征的特征数量为n。S131. The first splicing feature is obtained by splicing the first vocal channel feature, the first sound source wave feature and a variety of first related features; the number of features of the splicing feature is n.

在本申请实施例中,将目标语音每一帧的声道滤波器的幅频特性、逆滤波器的幅频特性、基频特征、基频微扰特征、振幅微扰特征、梅尔倒谱系数和第一声源波特征等n个特征拼接,得到第一拼接特征。In the embodiment of this application, the amplitude-frequency characteristics of the vocal tract filter, the amplitude-frequency characteristics of the inverse filter, the fundamental frequency characteristics, the fundamental frequency perturbation characteristics, the amplitude perturbation characteristics, and the Mel cepstral spectrum of each frame of the target speech are The number is spliced with n features such as the first sound source wave feature to obtain the first splicing feature.

S132,对拼接特征中的每个特征去中心化。S132, decentralize each feature in the spliced features.

在本申请实施例中,去中心化包括将第一拼接特征中的每个特征减去该维度的均值。In the embodiment of the present application, decentralization includes subtracting the mean value of the dimension from each feature in the first concatenated feature.

S133,基于去中心化的第一拼接特征计算协方差矩阵。S133. Calculate the covariance matrix based on the decentralized first splicing feature.

S134,对协方差矩阵进行特征值分解得到多个第一特征值和对应的第一特征向量。S134. Perform eigenvalue decomposition on the covariance matrix to obtain multiple first eigenvalues and corresponding first eigenvectors.

S135,按照多个特征值从大到小的顺序,选取前k个特征向量组成第一转换矩阵,其中k为降维后的特征数量,k<n。S135, select the first k eigenvectors to form the first transformation matrix according to the order of multiple eigenvalues from large to small, where k is the number of features after dimensionality reduction, k<n.

S136,将第一拼接特征乘以第一转换矩阵得到第一主成分特征。S136: Multiply the first splicing feature by the first transformation matrix to obtain the first principal component feature.

其中,第一主成分特征为降维后的高相关性特征。Among them, the first principal component feature is a high-correlation feature after dimensionality reduction.

在本申请实施例中,主成分特征分析的过程能够将高维的特征投影到低维的特征上,并最大可能性保留了对于任务具有高相关性的特征。In the embodiment of the present application, the process of principal component feature analysis can project high-dimensional features onto low-dimensional features and retain features with high relevance to the task to the greatest extent possible.

S14,将第一主成分特征输入训练好的分类器,输出分类的结果,分类结果为伪造语音,或自然语音。S14. Input the first principal component feature into the trained classifier and output the classification result. The classification result is fake speech or natural speech.

本申请实施例提供的语音检测方法还包括训练分类器的步骤。The speech detection method provided by the embodiment of the present application also includes the step of training a classifier.

图3为本申请实施例提供的语音检测方法中训练分类器流程图。如图3所示,分类器的训练过程包括:S31,获取训练集带标签的训练语音,将训练语音进行预处理,预处理包括预加重、分帧及加窗。S32,确定预处理后的训练语音的第二声道特征、第二声源波特征和多个第二相关特征。S34,基于训练语音的第二声道特征、第二声源波特征和多个第二相关特征确定第二主成分特征。S35,将第二主成分特征输入分类器,进行迭代训练,在损失函数收敛的情况下得到训练好的分类器。Figure 3 is a flow chart of training a classifier in the speech detection method provided by the embodiment of the present application. As shown in Figure 3, the training process of the classifier includes: S31, obtain the labeled training speech of the training set, and preprocess the training speech. The preprocessing includes pre-emphasis, framing and windowing. S32. Determine the second vocal channel feature, the second sound source wave feature and a plurality of second related features of the preprocessed training speech. S34: Determine the second principal component feature based on the second vocal channel feature, the second sound source wave feature and the multiple second related features of the training speech. S35, input the second principal component feature into the classifier, perform iterative training, and obtain the trained classifier when the loss function converges.

下面对训练分类器的各个步骤进行详细介绍。The various steps of training the classifier are introduced in detail below.

S31,获取训练集带标签的训练语音,将训练语音进行预处理,预处理包括预加重、分帧及加窗。S31, obtain the labeled training speech of the training set, and preprocess the training speech. The preprocessing includes pre-emphasis, framing and windowing.

其中预加重、分帧及加窗的实施方式可以参考步骤S11的实施方式,此处不再赘述。For the implementation of pre-emphasis, framing and windowing, reference can be made to the implementation of step S11 and will not be described again here.

S32,确定预处理后的训练语音的第二声道特征、第二声源波特征和多个第二相关特征。S32. Determine the second vocal channel feature, the second sound source wave feature and multiple second related features of the preprocessed training speech.

下面分别介绍训练语音的第二声道特征、第二声源波特征和多个第二相关特征。The following introduces the second vocal channel features, second sound source wave features and multiple second related features of the training speech respectively.

S321,确定预处理后的训练语音的第二声道特征。S321. Determine the second channel characteristics of the preprocessed training speech.

第二声道特征包括训练语音的声道滤波器的幅频特性和逆滤波器的幅频特性,可以使用线性预测编码(LPC)的方法预测预处理后的训练语音每一帧的滤波器参数;根据训练语音的每一帧的滤波器参数计算得到训练语音每一帧的声道滤波器的幅频特性,以及逆滤波器的幅频特性。The second channel features include the amplitude-frequency characteristics of the vocal channel filter of the training speech and the amplitude-frequency characteristics of the inverse filter. The linear predictive coding (LPC) method can be used to predict the filter parameters of each frame of the preprocessed training speech. ; Calculate the amplitude-frequency characteristics of the vocal tract filter for each frame of the training speech based on the filter parameters of each frame of the training speech, and the amplitude-frequency characteristics of the inverse filter.

训练语音的每一帧的逆滤波器的幅频特性与滤波器的幅频特性的关系及计算方式可以参考步骤S121,此处不再赘述。The relationship and calculation method between the amplitude-frequency characteristics of the inverse filter and the amplitude-frequency characteristics of the filter for each frame of the training speech can be referred to step S121, which will not be described again here.

S322,确定预处理后训练语音的第二声源波特征;第二声源波特征是训练语音的逆滤波器幅频特性与原始语音在频域上相乘后滤除声道特征后得到声源波特征。S322, determine the second sound source wave feature of the preprocessed training speech; the second sound source wave feature is the amplitude-frequency characteristic of the inverse filter of the training speech and the original speech multiplied in the frequency domain and filtered out the vocal tract features to obtain the sound Source wave characteristics.

S323,确定预处理后训练语音的多种第二相关特征,包括梅尔倒谱系数、基频特征、基频微扰特征和振幅微扰特征。S323. Determine multiple second correlation features of the preprocessed training speech, including Mel cepstrum coefficients, fundamental frequency features, fundamental frequency perturbation features and amplitude perturbation features.

训练语音的每一帧的多种相关特征及计算方式可以参考步骤S123,此处不再赘述。For various related features and calculation methods of each frame of the training speech, please refer to step S123, which will not be described again here.

上述训练语音的第二声道特征、第二声源特征和多种相关特征都表征了训练语音的声源和声道信息,这些特征指示了伪造语音和真实语音在生成过程上存在的差异。The above-mentioned second vocal channel features, second sound source features and various related features of the training voice all represent the sound source and vocal tract information of the training voice. These features indicate the difference in the generation process of the fake voice and the real voice.

同样地,本申请实施例语在训练语音的特征提取过程中得到了许多种声源声道相关特征,过多的特征包含冗余信息,对于模型的泛化性以及计算速度存在影响,需要对这些特征进行筛选,以得到与语音鉴伪任务具有更高相关性的特征。Similarly, in the embodiments of this application, many types of sound source and vocal tract related features were obtained during the feature extraction process of training speech. Too many features contain redundant information, which has an impact on the generalization of the model and the calculation speed. It is necessary to These features are screened to obtain features with higher relevance to the speech forgery detection task.

S33,基于第二声道特征、第二声源特征和多种相关特征确定第二主成分特征。S33: Determine the second principal component feature based on the second vocal channel feature, the second sound source feature and multiple related features.

本申请实施例使用主成分分析(PCA)对第二声道特征、第二声源特征和多种相关特征进行筛选,包括以下步骤:The embodiment of this application uses principal component analysis (PCA) to screen the second channel features, second sound source features and multiple related features, including the following steps:

S331,拼接第二声道特征、第二声源特征和多种相关特征得到第二拼接特征;第二拼接特征的特征数量为n。S331, splicing the second channel feature, the second sound source feature and multiple related features to obtain the second splicing feature; the number of features of the second splicing feature is n.

在本申请实施例中,将训练语音每一帧的滤波器的幅频特性、逆滤波器的幅频特性、基频特征、基频微扰特征、振幅微扰特征、梅尔倒谱系数和第二声源特征等n个特征拼接得到第二拼接特征。In the embodiment of this application, the amplitude-frequency characteristics of the filter of each frame of the training speech, the amplitude-frequency characteristics of the inverse filter, the fundamental frequency characteristics, the fundamental frequency perturbation characteristics, the amplitude perturbation characteristics, the Mel cepstrum coefficients and The second splicing feature is obtained by splicing n features such as the second sound source feature.

S332,对第二拼接特征中的每个特征去中心化。S332, decentralize each feature in the second spliced feature.

在本申请实施例中,去中心化包括将第二拼接特征中的每个特征减去该维度的均值。In the embodiment of the present application, decentralization includes subtracting the mean value of the dimension from each feature in the second concatenated feature.

S333,基于去中心化的第二拼接特征,计算协方差矩阵。S333. Calculate the covariance matrix based on the decentralized second splicing feature.

S334,对协方差矩阵进行特征值分解,得到多个第二特征值和对应的第二特征向量。S334: Perform eigenvalue decomposition on the covariance matrix to obtain multiple second eigenvalues and corresponding second eigenvectors.

S335,按照多个第二特征值从大到小的顺序,选取前k个第二特征向量组成第二转换矩阵,其中k为降维后的维度数,k<n。S335: Select the first k second eigenvectors to form a second transformation matrix in descending order of multiple second eigenvalues, where k is the number of dimensions after dimensionality reduction, k<n.

S336,将第二拼接特征乘以第二转换矩阵得到第二主成分特征。其中,第二主成分特征为降维后的高相关性特征。S336: Multiply the second splicing feature by the second transformation matrix to obtain the second principal component feature. Among them, the second principal component feature is a high-correlation feature after dimensionality reduction.

S34,将第二主成分特征输入分类器,进行迭代训练,在损失函数收敛的情况下得到训练好的分类器。S34, input the second principal component feature into the classifier, perform iterative training, and obtain the trained classifier when the loss function converges.

在本申请实施例中,使用的分类器为隐层特征提取器,隐层特征提取器具有挤压激励模块(SE-block)的残差网络(ResNet)。In the embodiment of this application, the classifier used is a hidden layer feature extractor, and the hidden layer feature extractor has a residual network (ResNet) with a squeeze excitation module (SE-block).

其中残差网络是卷积神经网络(CNN)的一个变种,通过在卷积层与卷积层之间加入残差连接解决了了由网络加深所导致的退化问题。每个残差模块通常包含多层的结构,这些层可以是任意神经网络的组成层,这使得残差网络具有较好的扩展性。残差模块的输入x经过的前向计算F(x)后会在输出上加入未经前向计算的原始输入,形成一种短路连接。这种计算方式使得即使前向计算为0,残差模块也仅仅等同于做恒等映射,保证了网络性能不会下降。残差模块使得卷积神经网络可以避免退化问题而加深网络层数,使得深度神经网络具有更好的性能。The residual network is a variant of the convolutional neural network (CNN). It solves the degradation problem caused by network deepening by adding residual connections between convolutional layers. Each residual module usually contains a multi-layer structure, and these layers can be the constituent layers of any neural network, which makes the residual network more scalable. After the input x of the residual module undergoes forward calculation F(x), the original input without forward calculation will be added to the output, forming a short-circuit connection. This calculation method makes even if the forward calculation is 0, the residual module is only equivalent to doing identity mapping, ensuring that the network performance will not be degraded. The residual module allows the convolutional neural network to avoid degradation problems and deepen the network layers, making the deep neural network have better performance.

挤压激励模块是一种卷积神经网络的扩展模块,它可以显示地建模特征通道之间的相互依赖关系,从通道角度提升系统性能。挤压激励模块首先对输入的特征图进行挤压,通过全局平均池化层对空间维度进行挤压,将每个二维的特征通道变成一个具有全局感受野实数。随后对输出的一维通道特征进行激励,通过参数为每个特征通道生成权重,显式地建模特征通道之间的相关性。最后将计算出的权重重新加权到初始的二位通道特征中,完成在通道维度上对原始特征的重标定。The squeeze excitation module is an extension module of the convolutional neural network, which can explicitly model the interdependence between feature channels and improve system performance from the channel perspective. The squeeze excitation module first squeezes the input feature map, squeezes the spatial dimension through the global average pooling layer, and turns each two-dimensional feature channel into a real number with a global receptive field. The output one-dimensional channel features are then excited, and weights are generated for each feature channel through parameters to explicitly model the correlation between feature channels. Finally, the calculated weights are reweighted into the initial two-bit channel features to complete the recalibration of the original features in the channel dimension.

经过隐层特征提取器提取的隐层特征再经过线性层进行二分类,得到最终的伪造语音检测结果。The hidden layer features extracted by the hidden layer feature extractor are then classified into two categories through the linear layer to obtain the final forged speech detection result.

在本申请实施例中,使用Adam优化器和加性角度损失函数(Additive angularmargin softmax loss,AAM-Softmax)训练分类器,AAM-Softmax损失函数的计算公式为:In the embodiment of this application, the Adam optimizer and the additive angularmargin softmax loss (AAM-Softmax) are used to train the classifier. The calculation formula of the AAM-Softmax loss function is:

式中,m为边界值,且m=0.2,s=40为尺度参数。fi是输入的嵌入式特征,是权重矩阵,yi代表语音携带的标签信息,c为类别数量,这里为2,N为训练语音中的语音条数。优化器参数为:β1=0.9,β2=0.999,∈=10-8,权重衰减为10-4In the formula, m is the boundary value, and m=0.2, s=40 is the scale parameter. f i is the embedded feature of the input, is the weight matrix, yi represents the label information carried by the speech, c is the number of categories, here is 2, and N is the number of speech items in the training speech. The optimizer parameters are: β 1 =0.9, β 2 =0.999, ∈ =10 -8 , and the weight attenuation is 10 -4 .

本申请实施例提供的语音检测方法,应用于部署和配置PyTorch运行环境的电子设备上。The speech detection method provided by the embodiment of this application is applied to electronic devices that deploy and configure the PyTorch running environment.

本申请实施例提供的语音检测方法使用逆滤波的方法分离声源特征和声道特征,利用伪造语音与自然语音生成机理不同,使得分类器可以引入特异性特征的信息,充分利用对特异性特征的分析,同时不需引入过多的数据预处理步骤,资源消耗较少,节约训练时间。The speech detection method provided by the embodiment of the present application uses the inverse filtering method to separate the sound source characteristics and vocal tract characteristics, and utilizes the different generation mechanisms of fake speech and natural speech, so that the classifier can introduce information of specific features and make full use of the specific features. analysis without introducing too many data preprocessing steps, consuming less resources and saving training time.

本申请实施例提供的语音检测方法引入了多种声源和声道相关的特征以加强模型的泛化能力。The speech detection method provided by the embodiment of the present application introduces a variety of sound source and vocal channel-related features to enhance the generalization ability of the model.

本申请实施例提供的语音检测方法在特征拼接的过程中引入了主成分分析以实现降维,选择出对任务更高贡献度的高相关性特征。减少冗余,以实现效率的提升。The speech detection method provided by the embodiment of the present application introduces principal component analysis in the feature splicing process to achieve dimensionality reduction and select high-correlation features that contribute more to the task. Reduce redundancy to improve efficiency.

本申请实施例提供的语音检测方法结构简单,可移植性好,使用的分类器可以为其它二分类的卷积神经网络。The speech detection method provided by the embodiments of the present application has a simple structure and good portability, and the classifier used can be other two-class convolutional neural networks.

本申请实施例提供了一种电子设备,包括:至少一个存储器,用于存储程序;至少一个处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行所述语音检测方法。An embodiment of the present application provides an electronic device, including: at least one memory for storing a program; at least one processor for executing the program stored in the memory. When the program stored in the memory is executed, the The processor is used to execute the voice detection method.

本申请实施例提供了一种计算机存储介质,计算机存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行所述语音检测方法。Embodiments of the present application provide a computer storage medium. Instructions are stored in the computer storage medium. When the instructions are run on a computer, they cause the computer to execute the speech detection method.

可以理解的是,本申请的实施例中的处理器可以是中央处理单元(centralprocessing unit,CPU),还可以是其他通用处理器、数字信号处理器(digital signalprocessor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。It can be understood that the processor in the embodiments of the present application may be a central processing unit (CPU), or other general-purpose processor, digital signal processor (DSP), application specific integrated circuit (Application Specific Integrated Circuit). specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. A general-purpose processor can be a microprocessor or any conventional processor.

本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable rom,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。The method steps in the embodiments of the present application can be implemented by hardware or by a processor executing software instructions. Software instructions can be composed of corresponding software modules. The software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable rom). , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or other well-known in the art any other form of storage media. An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage media may be located in an ASIC.

在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server or data center to another website through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. , computer, server or data center for transmission. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。It can be understood that the various numerical numbers involved in the embodiments of the present application are only for convenience of description and are not used to limit the scope of the embodiments of the present application.

Claims (10)

1. A method of voice detection, the method comprising:
obtaining target voice, and preprocessing the target voice, wherein the preprocessing comprises pre-emphasis, framing and windowing;
determining a first channel characteristic, a first sound source wave characteristic and a plurality of first related characteristics of the preprocessed target voice;
determining the first principal component feature based on the first channel feature, a first source wave feature, and a plurality of first correlation features;
inputting the first principal component characteristics into a trained classifier, and outputting a classification result, wherein the classification result is fake voice or natural voice.
2. The method of claim 1, wherein the first channel characteristics include an amplitude-frequency characteristic of a channel filter and an amplitude-frequency characteristic of an inverse filter, and wherein determining the first channel characteristics of the preprocessed target speech comprises:
predicting the filter parameters of each frame of the target voice after pretreatment by using linear prediction coding;
and calculating and determining the amplitude-frequency characteristic of the sound channel filter and the amplitude-frequency characteristic of the inverse filter according to the filter parameters of each frame.
3. The method of claim 1, wherein the plurality of first correlation features includes a fundamental frequency feature, a fundamental frequency perturbation feature, an amplitude perturbation feature, and a mel-frequency cepstral coefficient.
4. The method of claim 1, wherein said determining the first channel characteristic, the first source wave characteristic, and the plurality of first correlation characteristics of the pre-processed target speech comprises:
acquiring short-time Fourier features of the target voice after framing;
and carrying out inverse filtering on the short-time Fourier features through an inverse filter to obtain first sound source wave features of the preprocessed target voice.
5. The method of claim 1, wherein the determining the first principal component feature based on the first channel feature, the first source wave feature, and the plurality of first correlation features comprises:
splicing the first sound channel characteristic, the first sound source wave characteristic and a plurality of first related characteristics to obtain a first splicing characteristic; the feature quantity of the first splicing features is n;
decentralizing each of the first stitching features; calculating a covariance matrix of the first decentralised stitching feature;
performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues and corresponding eigenvectors;
selecting the first k eigenvectors to form a conversion matrix according to the sequence of the eigenvalues from large to small, wherein k is smaller than n;
multiplying the first splicing characteristic by the conversion matrix to obtain the first principal component characteristic.
6. The method according to any one of claims 1-5, further comprising the step of training a classifier:
acquiring training voice of a training set with a label, and preprocessing the training voice, wherein the preprocessing comprises pre-emphasis, framing and windowing;
determining a second sound characteristic, a second sound source characteristic and a plurality of second related characteristics of the preprocessed training speech;
determining a second principal component feature based on a second sound feature, a second sound source feature, and a plurality of second correlation features of the training speech;
and inputting the second principal component characteristics into a classifier, performing iterative training, and obtaining the trained classifier under the condition of convergence of a loss function.
7. The method of claim 6, wherein the plurality of second channel characteristics includes an amplitude-frequency characteristic of a channel filter and an amplitude-frequency characteristic of an inverse filter, and wherein determining the second channel characteristics of the pre-processed training speech comprises:
predicting filter parameters for each frame of the pre-processed training speech using linear predictive coding;
and calculating and determining the amplitude-frequency characteristic of the second channel filter and the amplitude-frequency characteristic of the inverse filter according to the filter parameters of each frame of the training voice.
8. The method of claim 6, wherein the plurality of second correlation features further comprises a fundamental frequency feature, a fundamental frequency perturbation feature, an amplitude perturbation feature, and a mel-frequency cepstral coefficient for each frame of the training speech.
9. The method of claim 6, wherein the determining the second sound characteristic, the second sound source characteristic, and the plurality of second correlation characteristics of the pre-processed training speech comprises:
acquiring short-time Fourier features of the training voice after framing;
and carrying out inverse filtering on the short-time Fourier features through the inverse filter to obtain second sound source wave features of the preprocessed training voice.
10. The method of claim 6, wherein the determining the second principal component feature based on the second sound feature, the second sound source feature, and the plurality of second correlation features of the training speech comprises:
splicing the second channel characteristics, the second sound source wave characteristics and a plurality of second correlation characteristics of the training voice to obtain second splicing characteristics; the feature quantity of the second splicing features is n;
decentralizing each of the second stitching features; calculating a covariance matrix of the second decentralised stitching feature;
performing eigenvalue decomposition on the covariance matrix to obtain n eigenvalues and corresponding eigenvectors;
selecting the first k eigenvectors to form a conversion matrix according to the sequence of the eigenvalues from large to small, wherein k is smaller than n;
multiplying the second stitching feature by the transformation matrix to obtain the second principal component feature.
CN202310505872.2A 2023-05-06 2023-05-06 A voice detection method Pending CN116778910A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310505872.2A CN116778910A (en) 2023-05-06 2023-05-06 A voice detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310505872.2A CN116778910A (en) 2023-05-06 2023-05-06 A voice detection method

Publications (1)

Publication Number Publication Date
CN116778910A true CN116778910A (en) 2023-09-19

Family

ID=88012343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310505872.2A Pending CN116778910A (en) 2023-05-06 2023-05-06 A voice detection method

Country Status (1)

Country Link
CN (1) CN116778910A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117133295A (en) * 2023-10-24 2023-11-28 清华大学 Fake voice detection method, device and equipment based on brain-like perception and decision
CN119724163A (en) * 2025-02-26 2025-03-28 国网安徽省电力有限公司合肥供电公司 A method for checking power grid equipment status based on finite state machine

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117133295A (en) * 2023-10-24 2023-11-28 清华大学 Fake voice detection method, device and equipment based on brain-like perception and decision
CN117133295B (en) * 2023-10-24 2023-12-29 清华大学 Forgery speech detection methods, devices and equipment based on brain-like perception and decision-making
CN119724163A (en) * 2025-02-26 2025-03-28 国网安徽省电力有限公司合肥供电公司 A method for checking power grid equipment status based on finite state machine

Similar Documents

Publication Publication Date Title
CN110459241B (en) Method and system for extracting voice features
Reghunath et al. Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music
CN116778910A (en) A voice detection method
CN114302301B (en) Frequency response correction method and related product
CN115273904A (en) Angry emotion recognition method and device based on multi-feature fusion
Cui et al. Research on audio recognition based on the deep neural network in music teaching
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
Zhang et al. Chinese dialect tone’s recognition using gated spiking neural P systems
CN111666996B (en) High-precision equipment source identification method based on attention mechanism
Ahmad et al. Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture
US12361946B1 (en) Speech interaction method, speech interaction system and storage medium
CN119993193A (en) Audio feature extraction method and device based on neural network
Rituerto-González et al. End-to-end recurrent denoising autoencoder embeddings for speaker identification
CN116994563A (en) Voice recognition method based on ADRMFCC fusion characteristics
CN117636839A (en) Speech synthesis method and device
CN116959425A (en) Robust tracing method and device for fake voice algorithm
CN112951270B (en) Voice fluency detection method and device and electronic equipment
Konduru et al. Multidimensional feature diversity based speech signal acquisition
Liu Study on the application of improved audio recognition technology based on deep learning in vocal music teaching
Liu et al. YAMNet-based transfer learning for compact noise classification in urban and wireless systems
Arora et al. An efficient text-independent speaker verification for short utterance data from Mobile devices
Therese et al. A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system
Paniagua-Peñaranda et al. Assessing the Robustness of Recurrent Neural Networks to Enhance the Spectrum of Reverberated Speech
Wei et al. Lambda-vector modeling temporal and channel interactions for text-independent speaker verification
Eshaghi et al. A voice activity detection algorithm in spectro-temporal domain using sparse representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination