CN110600038A - Audio fingerprint dimension reduction method based on discrete kini coefficient - Google Patents
Audio fingerprint dimension reduction method based on discrete kini coefficient Download PDFInfo
- Publication number
- CN110600038A CN110600038A CN201910784077.5A CN201910784077A CN110600038A CN 110600038 A CN110600038 A CN 110600038A CN 201910784077 A CN201910784077 A CN 201910784077A CN 110600038 A CN110600038 A CN 110600038A
- Authority
- CN
- China
- Prior art keywords
- audio
- fingerprint
- discrete
- dimension
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/08—Use of distortion metrics or a particular distance between probe pattern and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/54—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及一种基于离散基尼系数计算的音频指纹降维方法,旨在解决音频指纹特征维度高的问题,具体包括分类构建目标声音库、提取样本音频的指纹特征、引入离散基尼系数对音频指纹特征进行降维。本发明在音频指纹各维度引入指纹离散基尼系数,通过音频指纹各维度的离散基尼系数大小反映出不同音频在该维的区分性大小,通过保留离散基尼系数大的维度,删除离散基尼系数小的维度达到降维的目的。通过降维后的音频指纹特征构建的样本音频指纹库数据量更小、利用率更高。
The invention relates to an audio fingerprint dimensionality reduction method based on discrete Gini coefficient calculation, which aims to solve the problem of high dimensionality of audio fingerprint features. features for dimensionality reduction. The present invention introduces the discrete Gini coefficient of the fingerprint into each dimension of the audio fingerprint, and the size of the discrete Gini coefficient of each dimension of the audio fingerprint reflects the distinguishability of different audio in this dimension, and deletes the dimension with the large discrete Gini coefficient by retaining the dimension with the small discrete Gini coefficient Dimensionality achieves the purpose of dimensionality reduction. The sample audio fingerprint library constructed by dimensionality-reduced audio fingerprint features has smaller data volume and higher utilization rate.
Description
技术领域technical field
本发明属于智能应用型声场领域,具体涉及一种基于离散基尼系数计算的音频指纹降维方法。The invention belongs to the field of intelligent application-type sound fields, and in particular relates to an audio fingerprint dimensionality reduction method based on discrete Gini coefficient calculation.
背景技术Background technique
近年来,智能化一直深受人们的喜爱,被人们广泛的研究与讨论,音频的智能识别更是人工智能发展的重要基石,音频的智能识别离不开音频的特征提取,在音频的众多特征中,音频指纹是近年来最受欢迎的一种,音频指纹是指可以代表一段音频重要声学特征的基于内容的紧致数字签名,其主要目的是用少量的数字信息代表大量音频数据。它相对于传统的音频特征具有数据量较小、抗噪性能高、特征提取流程相对简单等优点,被广泛的用在音乐识别、广告监管、版权保护等领域中。但音频指纹有个缺点就是指纹维度较高,这使得在音频识别中延缓了识别速度,且占用计算机很大内存。对此,若能降低音频指纹的维度,就可以在很大程度上减少音频指纹的数据量,同时也能够提高音频检索的速率,增强音频的识别性能。In recent years, intelligence has been deeply loved by people, and it has been widely studied and discussed by people. The intelligent recognition of audio is an important cornerstone of the development of artificial intelligence. The intelligent recognition of audio is inseparable from the feature extraction of audio. Many features of audio Among them, audio fingerprint is the most popular one in recent years. Audio fingerprint refers to a content-based compact digital signature that can represent important acoustic features of a piece of audio. Its main purpose is to represent a large amount of audio data with a small amount of digital information. Compared with traditional audio features, it has the advantages of small data volume, high anti-noise performance, and relatively simple feature extraction process. It is widely used in music recognition, advertising supervision, copyright protection and other fields. However, one disadvantage of audio fingerprints is that the fingerprints have a high dimension, which slows down the recognition speed in audio recognition and takes up a lot of computer memory. In this regard, if the dimensionality of audio fingerprints can be reduced, the data volume of audio fingerprints can be reduced to a large extent, and at the same time, the speed of audio retrieval can be improved, and the performance of audio recognition can be enhanced.
发明内容Contents of the invention
针对音频指纹特征的高维度问题,本发明在音频指纹各维度引入指纹离散基尼系数,通过音频指纹各维度的离散基尼系数大小反映出不同音频在该维的区分性大小。音频指纹某位的离散基尼系数越大,不同音频在该位的差异就越大,说明该位的区分性越好,反之区分性差。所以保留区分性好的位,而去掉区分性差的位,可以将高维音频指纹转换为较低的维数从而有效的减少指纹数据量。Aiming at the high-dimensional problem of audio fingerprint features, the present invention introduces fingerprint discrete Gini coefficients in each dimension of audio fingerprints, and the size of discrete Gini coefficients in each dimension of audio fingerprints reflects the distinguishability of different audio in this dimension. The larger the discrete Gini coefficient of a certain bit of the audio fingerprint, the greater the difference between different audios in this bit, indicating that the better the discrimination of the bit, otherwise the poorer the discrimination. Therefore, retaining the bits with good discrimination and removing the bits with poor discrimination can convert high-dimensional audio fingerprints into lower dimensions, thereby effectively reducing the amount of fingerprint data.
本发明的技术方案用于解决音频指纹特征库数据量过大问题,在通过降维后的音频指纹特征构建的样本音频指纹库数据量更小、利用率更高,本发明技术主要分为以下几个步骤:The technical solution of the present invention is used to solve the problem of excessive data volume of the audio fingerprint feature database. The sample audio fingerprint database constructed by the dimensionality-reduced audio fingerprint feature has a smaller data volume and higher utilization rate. The technology of the present invention is mainly divided into the following Several steps:
步骤1,分类构建目标声音库Step 1, classify and build the target sound library
本设计根据音频特点种类或已有数据情况,将音频进行分类建库。由于音频种类不同其特征就不同,将音频分类处理便于找其音频特征的共性,若不进行音频分类将会影响音频指纹的降维质量。将已有音频数据分类储存,再分别对各类音频进行指纹特征提取。音频指纹提取算法流程图如图1.所示。This design classifies the audio and builds a database according to the types of audio characteristics or existing data. Since different types of audio have different features, it is easy to find the commonality of audio features by classifying audio. If audio classification is not performed, the dimensionality reduction quality of audio fingerprints will be affected. The existing audio data is classified and stored, and then fingerprint features are extracted for each type of audio. The flowchart of audio fingerprint extraction algorithm is shown in Figure 1.
步骤2,提取样本音频的指纹特征Step 2, extract the fingerprint features of the sample audio
从已构建的音频库中选取各类音频数据作为原始样本音频。提取原始样本指纹特征并引入离散基尼系数对指纹特征进行降维。具体流程为:Select various audio data from the built audio library as the original sample audio. The original sample fingerprint features are extracted and the discrete Gini coefficient is introduced to reduce the dimensionality of the fingerprint features. The specific process is:
Step2.1:对目标音频预处理Step2.1: Preprocessing the target audio
在数据特征提取前,先做预处理操作。预处理包括:带通滤波、预加重、分帧。Before data feature extraction, preprocessing operation is done first. Preprocessing includes: bandpass filtering, pre-emphasis, and framing.
(1)选取8kHz采样音频信号作为处理对象进行带通滤波处理,用于提取人耳感知最重要的频率成分,选用通带范围为20Hz-4000Hz的带通滤波器对信号进行处理。本设计中带通滤波选用有限冲击响应(Finite Impulse Filter,FIR)滤波器,滤波过程为:(1) The 8kHz sampled audio signal is selected as the processing object for band-pass filter processing to extract the most important frequency components perceived by the human ear, and a band-pass filter with a pass band range of 20 Hz-4000 Hz is selected to process the signal. In this design, the bandpass filter uses a finite impulse response (Finite Impulse Filter, FIR) filter, and the filtering process is:
其中,T为处理信号的采样点数,p为时域标号,h(l)为FIR滤波器系数,x(p)为输入信号,y(p)为带通滤波后信号。Among them, T is the number of sampling points of the processed signal, p is the time domain label, h(l) is the coefficient of the FIR filter, x(p) is the input signal, and y(p) is the signal after bandpass filtering.
(2)对带通滤波后信号y(p)进行预加重处理,本设计选用具有6dB/倍频程的数字滤波器实现,用以提升预处理后信号的高频特性,使得信号频谱变得相对平坦,同时使语音信号在从低频到高频的整个频带中,能用同样的信噪比求频谱。预加重处理如下式所示:(2) Perform pre-emphasis processing on the band-pass filtered signal y(p). This design uses a 6dB/octave digital filter to improve the high-frequency characteristics of the pre-processed signal, making the signal spectrum become It is relatively flat, and at the same time, the speech signal can use the same signal-to-noise ratio to calculate the spectrum in the entire frequency band from low frequency to high frequency. The pre-emphasis processing is shown in the following formula:
s(p)=y(p)-μ*y(p-1)s(p)=y(p)-μ*y(p-1)
其中,μ为预加重系数,其取值为0.96,s(p)为预加重处理后信号。Among them, μ is the pre-emphasis coefficient, and its value is 0.96, and s(p) is the signal after pre-emphasis processing.
(3)对预加重信号进行加窗分帧处理。分帧虽然可以采用连续分段的方法,但一般使用交叠分段的方法,这是为了使帧与帧之间平滑过渡,保持其连续性。以帧长为0.064秒对音频进行分帧,帧与帧之间保持75%的重叠率,每一帧用相同长度的汉宁窗进行加权。加窗公式如下:(3) Perform windowing and framing processing on the pre-emphasis signal. Although the method of continuous segmentation can be used for framing, the method of overlapping segments is generally used, which is to make the transition between frames smooth and maintain their continuity. The audio is divided into frames with a frame length of 0.064 seconds, and an overlap rate of 75% is maintained between frames. Each frame is weighted with a Hanning window of the same length. The windowing formula is as follows:
其中,T为汉宁窗的长度,也为一帧音频的帧长。Among them, T is the length of the Hanning window and also the frame length of one frame of audio.
Step2.2:对音频数据进行指纹特征提取Step2.2: Extract fingerprint features from audio data
一个数字音频指纹可视为一段音频的浓缩精华,它包含了音频数据听觉最重要的部分,而音频指纹的维度高低又是影响指纹数据量和被检索速率的关键因素,因此本技术引用离散基尼系数对音频指纹进行特征降维,在降维前首先要对音频进行指纹特征提取,步骤如下:A digital audio fingerprint can be regarded as the condensed essence of a piece of audio, which contains the most important part of the audio data, and the dimension of the audio fingerprint is a key factor affecting the amount of fingerprint data and the retrieval rate, so this technology refers to discrete Gini The coefficient performs feature dimensionality reduction on audio fingerprints. Before dimensionality reduction, the audio fingerprint features must be extracted first. The steps are as follows:
(1)对已分帧音频信号进行离散傅里叶变换,将音频信号每一帧数据进行离散傅里叶变换,变换公式如下:(1) Discrete Fourier transform is carried out to framed audio signal, and each frame data of audio signal is carried out discrete Fourier transform, and transformation formula is as follows:
其中,X(k)为频域信号,x(p)为时域信号,k为频率索引,T为离散傅里叶变换的样本长度。Among them, X(k) is the frequency domain signal, x(p) is the time domain signal, k is the frequency index, and T is the sample length of the discrete Fourier transform.
(2)对离散傅里叶变换后的频域信号进行频谱子带划分,从频谱中选取33个非重叠的频带,分布于20-4000Hz范围(人耳对音频的鉴别主要集中在这一范围内),频带之间是等对数间隔的(按照人耳对不同频率的反应不是线性的)。第m子带的起始频率也即第m-1子带的终止频率f(m)可表示为下式:(2) The frequency domain signal after discrete Fourier transform is divided into spectrum subbands, and 33 non-overlapping frequency bands are selected from the spectrum, distributed in the range of 20-4000Hz (the human ear's identification of audio is mainly concentrated in this range Within), the frequency bands are equally logarithmically spaced (according to the human ear's response to different frequencies is not linear). The start frequency of the mth subband, that is, the stop frequency f(m) of the m-1th subband can be expressed as the following formula:
其中Fmin为映射下限,此处为20Hz,Fmax为映射上限,此处为4000Hz,M为子带个数,此处为33。Among them, Fmin is the lower limit of mapping, which is 20Hz here, Fmax is the upper limit of mapping, which is 4000Hz here, and M is the number of subbands, which is 33 here.
(3)计算每帧音频的各个子带能量,分别求其上述选取的33个非重叠频带的能量,假设第m个子带起始频率为f(m),终止频率为f(m+1),离散傅里叶变换后的频域信号为x(k),则第n帧的第m个子带能量的公式如下:(3) Calculate the energy of each subband of each frame of audio, and calculate the energy of the 33 non-overlapping frequency bands selected above, assuming that the start frequency of the mth subband is f(m), and the end frequency is f(m+1) , the frequency-domain signal after discrete Fourier transform is x(k), then the formula for the energy of the mth subband of the nth frame is as follows:
(4)生成每帧音频的子指纹,对上述每帧所求的33个子带能量作比特差分判别,生成每帧音频的32位二进制码(子指纹),第n帧的第m个子带能量为E(n,m),其对应的二进制比特信息为F(n,m),则每帧的二进制音频指纹信息判别公式如下:(4) Generate the sub-fingerprint of each frame of audio, and perform bit difference discrimination on the 33 sub-band energies required for each frame above, generate 32-bit binary codes (sub-fingerprints) of each frame of audio, the mth sub-band energy of the nth frame is E(n,m), and its corresponding binary bit information is F(n,m), then the binary audio fingerprint information discrimination formula of each frame is as follows:
由上式可知,每帧音频最后生成一个32维的二进制子指纹信息,子指纹所含信息较少,一个音频指纹特征常有多个子指纹构成。It can be seen from the above formula that a 32-dimensional binary sub-fingerprint information is finally generated for each frame of audio, and the sub-fingerprint contains less information, and an audio fingerprint feature often consists of multiple sub-fingerprints.
步骤3,对音频指纹特征降维Step 3, Dimensionality reduction of audio fingerprint features
在经过上述音频指纹提取过程后,每一帧的音频数据最后生成一个32维的二进制子指纹信息。对于一段音频来说,所含的音频指纹信息是由多个二进制子指纹信息构成,其指纹信息数据量仍然很大,在实际应用中,希望进一步降低音频指纹维数从而有效的减少指纹数据量。本设计提出一种基于离散基尼系数计算的音频指纹降维方法,本方法对音频指纹的每一维度求离散基尼系数,通过指纹各维度离散基尼系数的大小来实现降低指纹维数的目的。After the above-mentioned audio fingerprint extraction process, each frame of audio data finally generates a 32-dimensional binary sub-fingerprint information. For a piece of audio, the audio fingerprint information contained is composed of multiple binary sub-fingerprint information, and the amount of fingerprint information data is still large. In practical applications, it is hoped to further reduce the dimensionality of the audio fingerprint so as to effectively reduce the amount of fingerprint data. . This design proposes an audio fingerprint dimension reduction method based on discrete Gini coefficient calculation. This method calculates the discrete Gini coefficient for each dimension of the audio fingerprint, and achieves the purpose of reducing the fingerprint dimension by the size of the discrete Gini coefficient of each dimension of the fingerprint.
将音频指纹各维度数据按每50帧为一组求各维度指纹的离散基尼系数,由于各维度指纹的离散基尼系数反应了音频指纹某维度数据的离散程度,也就是音频指纹某维度数据的差异性大小,音频指纹某位的离散基尼系数越大,不同音频在该位的差异就越大,说明该位的区分性越好,反之区分性差。本设计通过保留音频指纹中区分性较好维的信息,去掉区分性较差维的信息,将32维音频指纹转换为较低的维数减少指纹数据量。Calculate the discrete Gini coefficients of each dimension of fingerprints by dividing the data of each dimension of the audio fingerprint into a group of 50 frames, because the discrete Gini coefficients of the fingerprints of each dimension reflect the degree of discreteness of the data of a certain dimension of the audio fingerprint, that is, the difference of the data of a certain dimension of the audio fingerprint The greater the discrete Gini coefficient of a certain bit of the audio fingerprint, the greater the difference between different audios in this bit, indicating that the better the discrimination of the bit, otherwise the poorer the discrimination. In this design, the 32-dimensional audio fingerprint is converted to a lower dimension to reduce the amount of fingerprint data by retaining the more distinguishable dimension information in the audio fingerprint and removing the less distinguishable dimension information.
Step3.1:求取音频指纹各维度的离散基尼系数Step3.1: Calculate the discrete Gini coefficient of each dimension of the audio fingerprint
推导过程及具体步骤如下:The derivation process and specific steps are as follows:
(1)求取音频指纹的离散洛伦兹曲线,离散洛伦兹曲线是求取离散基尼系数的关键曲线,是由累积指纹数据占比矢量的各个元素构成,j表示音频指纹的维度序号,取值范围j=1,2,……,32求取累积指纹数据占比矢量的计算过程如下:(1) Calculate the discrete Lorenz curve of the audio fingerprint, the discrete Lorenz curve is the key curve for calculating the discrete Gini coefficient, which is calculated by the proportion vector of the accumulated fingerprint data The components of each element, j represents the dimension number of the audio fingerprint, and the value range j=1,2,...,32 to find the proportion vector of the accumulated fingerprint data The calculation process is as follows:
将音频指纹库中的各类音频指纹按帧处理,音频指纹每50帧指纹数据为一组共分成N组,构建第j维累积指纹数据矢量:All kinds of audio fingerprints in the audio fingerprint library are processed by frame, and every 50 frames of audio fingerprint data are divided into N groups, and the j-th dimension cumulative fingerprint data vector is constructed:
其中为组编号且有 构建第j维累积指纹数据占比矢量矢量中各元素定义为:in is the group number and has Construct the j-th dimension cumulative fingerprint data proportion vector Each element in the vector is defined as:
………………
占比矢量各元素构成的曲线为离散洛伦兹曲线,具体而言音频指纹第j维的离散洛伦兹曲线绘制过程如下:The curve formed by each element of the proportion vector is a discrete Lorenz curve. Specifically, the process of drawing the discrete Lorenz curve of the j-th dimension of the audio fingerprint is as follows:
以音频指纹累积组数占比量为横坐标(横坐标即为),以音频指纹第j维累积指纹数据占比量为纵坐标,对音频指纹第j维累积指纹数据占比量的各个元素进行绘制离散点,离散点链接的离散曲线即为音频指纹第j维的离散洛伦兹曲线,其坐标为 The abscissa takes the proportion of the cumulative number of audio fingerprints as the abscissa (the abscissa is ), taking the proportion of accumulated fingerprint data in the j-dimension of the audio fingerprint as the vertical coordinate, and the proportion of the accumulated fingerprint data in the j-dimension of the audio fingerprint Each element of the discrete point is drawn, and the discrete curve linked by the discrete point is the discrete Lorenz curve of the j-th dimension of the audio fingerprint, and its coordinates are
(2)求取各类音频指纹各维度的离散基尼系数,以上述所求的离散洛伦兹曲线为分界线,可得音频指纹第j维度的基尼系数公式如下:(2) Calculate the discrete Gini coefficient of each dimension of various audio fingerprints, and use the discrete Lorenz curve obtained above as the dividing line, the formula of the Gini coefficient of the jth dimension of the audio fingerprint can be obtained as follows:
其中,Sa为坐标对角线段OA与离散洛伦兹曲线围成的闭合面积,点O的坐标为(0,0)点A的坐标(1,1),Sb为坐标线段OB、BA与离散洛伦兹曲线围成的闭合面积,点B的坐标为(1,0),Gj为音频指纹第j维度的基尼系数,指纹离散基尼系数计算辅助图如图4.所示。Among them, S a is the closed area enclosed by the coordinate diagonal line segment OA and the discrete Lorenz curve, the coordinates of point O are (0,0) and the coordinates (1,1) of point A, and S b is the coordinate line segment OB, BA The closed area surrounded by the discrete Lorenz curve, the coordinates of point B is (1,0), G j is the Gini coefficient of the j-th dimension of the audio fingerprint, and the auxiliary figure for the calculation of the discrete Gini coefficient of the fingerprint is shown in Figure 4.
由上述可知,Sa+Sb的和为对角线段OA与线段OB、BA所围成的闭合面积,即:Sa+Sb=1/2,因为音频指纹是离散的,故本设计将上述公式离散化为:It can be seen from the above that the sum of S a + S b is the closed area enclosed by the diagonal line segment OA and the line segments OB and BA, that is: S a + S b = 1/2, because the audio fingerprint is discrete, so this design Discretize the above formula as:
由此,得到音频指纹第j维离散基尼系数其中i为组编号,为音频指纹第j维度累积第i组指纹数据占比量。Thus, the j-th dimension discrete Gini coefficient of the audio fingerprint is obtained where i is the group number, Accumulate the i-th group of fingerprint data proportions for the j-th dimension of the audio fingerprint.
Step3.2:统计不同类型的音频指纹各维度离散基尼系数实施指纹降维Step3.2: Count the discrete Gini coefficients of different types of audio fingerprints in each dimension to implement fingerprint dimensionality reduction
结合应用场景进行音频指纹各维度离散基尼系数训练。从音频数据中根据数据类型构建训练集或结合已有数据集构建训练集,对每个训练集合的音频数据求取其音频指纹各维度的离散基尼系数并进行统计分析。Combined with the application scenario, the discrete Gini coefficient training of each dimension of the audio fingerprint is carried out. The training set is constructed from the audio data according to the data type or combined with the existing data set, and the discrete Gini coefficient of each dimension of the audio fingerprint is obtained for the audio data of each training set and statistical analysis is performed.
本发明设计的音频指纹离散基尼系数适用于各类音频指纹的降维,本次选取以下几类音频作为分析对象,对音频指纹各维度的离散基尼系数进行统计分析:The audio fingerprint discrete Gini coefficient designed by the present invention is suitable for dimensionality reduction of various audio fingerprints. This time, the following types of audio are selected as analysis objects, and the discrete Gini coefficient of each dimension of the audio fingerprint is statistically analyzed:
如图5.异常声音的音频指纹各维度的离散基尼系数,从音频指纹库中统计所求异常声音的音频指纹32个维度的离散基尼系数,通过设置音频指纹离散基尼系数的阈值,可舍去低于阈值的音频指纹维度信息,保留高于阈值的音频指纹维度信息,异常声音类音频指纹的离散基尼系数阈值范围为0.36~0.38,从图中可以看出,音频指纹各维度的离散基尼系数大小各不相同,有些维度的指纹离散基尼系数较小,如第2、25、26维,也有些维度的指纹离散基尼系数较大,如第11、20、28维等,指纹离散基尼系数的大小代表着该维度音频指纹数据差异性的大小。此时,若将指纹离散基尼系数的阈值设为0.37,由图中可以看出指纹离散基尼系数小于0.37的维数有第2、5、24、25、26维,说明这5个维度的各帧指纹数据差异性较小,可舍去,通过舍去这些差异性较小的维数,从而减小每帧音频指纹的维数,进而减小音频指纹库的数据量。As shown in Figure 5. The discrete Gini coefficients of each dimension of the audio fingerprint of the abnormal sound. The discrete Gini coefficients of the 32 dimensions of the audio fingerprint of the abnormal sound are calculated from the audio fingerprint database. By setting the threshold of the audio fingerprint discrete Gini coefficient, it can be discarded. Audio fingerprint dimension information lower than the threshold, keep the audio fingerprint dimension information higher than the threshold, the discrete Gini coefficient threshold range of abnormal sound audio fingerprints is 0.36~0.38, as can be seen from the figure, the discrete Gini coefficient of each dimension of the audio fingerprint The sizes are different. The discrete Gini coefficient of fingerprints in some dimensions is small, such as the 2nd, 25th, and 26th dimensions, and the discrete Gini coefficient of fingerprints in some dimensions is relatively large, such as the 11th, 20th, and 28th dimensions. The size represents the difference of the audio fingerprint data in this dimension. At this time, if the threshold value of the discrete Gini coefficient of the fingerprint is set to 0.37, it can be seen from the figure that the dimensions of the fingerprint discrete Gini coefficient less than 0.37 have the 2nd, 5th, 24th, 25th, and 26th dimensions, indicating that each of the five dimensions The frame fingerprint data has small differences and can be discarded. By discarding these dimensions with small differences, the dimensions of each frame of audio fingerprints are reduced, thereby reducing the data volume of the audio fingerprint database.
图6.为正常说话类音频指纹各维度的离散基尼系数,从音频指纹库中统计所求正常说话类的音频指纹32个维度的离散基尼系数,通过设置音频指纹离散基尼系数的阈值,可舍去低于阈值的音频指纹维度信息,保留高于阈值的音频指纹维度信息,正常说话类音频指纹的离散基尼系数阈值范围为0.36~0.38,从图中可以看出此类音频指纹离散基尼系数较小的维度有第2、22、25维,指纹离散基尼系数较大的维度有第18、26、28维等,若将此类音频指纹离散基尼系数的阈值设为0.37,由图中可以看出指纹离散基尼系数小于0.37的维数有第2、22、25维,说明这3个维度的各帧指纹数据差异性较小,可舍去。Figure 6. It is the discrete Gini coefficient of each dimension of the audio fingerprint of the normal speech class. The discrete Gini coefficient of the 32 dimensions of the audio fingerprint of the normal speech class is calculated from the audio fingerprint database. By setting the threshold value of the discrete Gini coefficient of the audio fingerprint, it can be discarded. The audio fingerprint dimension information below the threshold is removed, and the audio fingerprint dimension information above the threshold is retained. The discrete Gini coefficient threshold range of normal speech audio fingerprints is 0.36 to 0.38. It can be seen from the figure that the discrete Gini coefficient of such audio fingerprints is relatively low. Small dimensions include the 2nd, 22nd, and 25th dimensions, and dimensions with larger discrete Gini coefficients include the 18th, 26th, and 28th dimensions, etc. If the threshold for such audio fingerprint discrete Gini coefficients is set to 0.37, it can be seen from the figure The dimensions of the fingerprint discrete Gini coefficient less than 0.37 have the 2nd, 22nd, and 25th dimensions, indicating that the differences in the fingerprint data of each frame in these 3 dimensions are small and can be discarded.
图7.歌曲类音频指纹各维度的离散基尼系数,从音频指纹库中统计所求歌曲类的音频指纹32个维度的离散基尼系数,通过设置音频指纹离散基尼系数的阈值,可舍去低于阈值的音频指纹维度信息,保留高于阈值的音频指纹维度信息,歌曲类音频指纹的离散基尼系数阈值范围为0.36~0.38,从图中可以看出此类音频指纹的各维度离散基尼系数差异性相对较大,指纹各维度的离散基尼系数相对比较分散,在指纹的第2、14、15、25维的离散尼系数较小,在第4、5、11、18、29维指纹的离散基尼系数较大,此时,若将此类音频指纹的离散基尼系数阈值设为0.37,由图中可以看出指纹离散基尼系数小于0.37的有第1、2、14、15、25、26维,说明这6个维度的各帧指纹数据差异性较小,可舍去。Figure 7. The discrete Gini coefficients of each dimension of the audio fingerprint of the song class. The discrete Gini coefficients of the 32 dimensions of the audio fingerprint of the song class are calculated from the audio fingerprint library. By setting the threshold value of the discrete Gini coefficient of the audio fingerprint, it can be discarded. The threshold audio fingerprint dimension information, keep the audio fingerprint dimension information higher than the threshold, the discrete Gini coefficient threshold range of song audio fingerprint is 0.36~0.38, from the figure we can see the difference of discrete Gini coefficient of each dimension of this type of audio fingerprint Relatively large, the discrete Gini coefficients of each dimension of the fingerprint are relatively scattered. The discrete Gini coefficients of the 2nd, 14th, 15th, and 25th dimensions of the fingerprint are relatively small, and the discrete Gini coefficients of the 4th, 5th, 11th, 18th, and 29th dimensions of the fingerprint are relatively large. The coefficient is large. At this time, if the discrete Gini coefficient threshold of this type of audio fingerprint is set to 0.37, it can be seen from the figure that the discrete Gini coefficient of the fingerprint is less than 0.37. There are 1, 2, 14, 15, 25, and 26 dimensions It shows that the difference of the fingerprint data of each frame in these 6 dimensions is small and can be discarded.
从这几幅图可以看出,虽然各类音频的指纹离散基尼系数各不相同,在各维度系数趋势也不相同,但指纹总有几个维度的基尼系数较小,说明音频指纹在这几个维度的区分性较小,因此可以把这几个维度数据去除,将32维音频指纹转换为较低的维数从而有效的减少指纹数据量。From these figures, it can be seen that although the discrete Gini coefficients of various audio fingerprints are different, and the trend of the coefficients in each dimension is also different, there are always several dimensions of fingerprints with small Gini coefficients, which shows that audio fingerprints are in these dimensions. The discriminativeness of each dimension is small, so the data of these dimensions can be removed, and the 32-dimensional audio fingerprint can be converted into a lower dimension to effectively reduce the amount of fingerprint data.
本发明的优点在于:The advantages of the present invention are:
1、算法复杂度低且灵活性更强1. Algorithm complexity is low and flexibility is stronger
2、比传统的音频特征数据量更小2. Smaller than traditional audio feature data
3、引入离散基尼系数作为降维算法,复杂度低且可处理大批量的数据3. Introduce the discrete Gini coefficient as a dimensionality reduction algorithm, which has low complexity and can handle large batches of data
4、经过降维后的音频指纹鲁棒性更高4. The audio fingerprint after dimensionality reduction is more robust
附图说明Description of drawings
图1.音频指纹提取算法流程图Figure 1. Flowchart of Audio Fingerprint Extraction Algorithm
图2.一段音频的32维音频指纹块图Figure 2. 32-dimensional audio fingerprint block diagram of a piece of audio
图3.异常声音类音频指纹第1维的洛伦兹曲线图Figure 3. The Lorenz curve of the first dimension of audio fingerprints of abnormal sounds
图4.音频指纹离散基尼系数计算辅助图Figure 4. Auxiliary diagram for calculating the discrete Gini coefficient of audio fingerprints
图5.异常声音类音频指纹各维度的离散基尼系数图Figure 5. Discrete Gini coefficient diagram of each dimension of abnormal sound audio fingerprint
图6.正常说话类音频指纹各维度的离散基尼系数图Figure 6. Discrete Gini coefficient map of each dimension of normal speech audio fingerprints
图7.歌曲类音频指纹各维度的离散基尼系数图Figure 7. Discrete Gini coefficient map of each dimension of song audio fingerprint
具体实施方式Detailed ways
本发明提供一种基于离散基尼系数计算的音频指纹降维方法,该方法包括分类构建目标声音库、提取样本音频的指纹特征、引入离散基尼系数对音频指纹特征进行降维。The invention provides an audio fingerprint dimensionality reduction method based on discrete Gini coefficient calculation. The method includes classifying and constructing a target sound library, extracting fingerprint features of sample audio, and introducing discrete Gini coefficient to reduce dimensionality of audio fingerprint features.
本发明的技术方案用于解决音频指纹特征库数据量过大问题,在通过降维后的音频指纹特征构建的样本音频指纹库数据量更小、利用率更高,本发明技术主要分为以下几个步骤:The technical solution of the present invention is used to solve the problem of excessive data volume of the audio fingerprint feature database. The sample audio fingerprint database constructed by the dimensionality-reduced audio fingerprint feature has a smaller data volume and higher utilization rate. The technology of the present invention is mainly divided into the following Several steps:
步骤1,分类构建目标声音库Step 1, classify and build the target sound library
本设计根据音频特点种类或已有数据情况,将音频进行分类建库。由于音频种类不同其特征就不同,将音频分类处理便于找其音频特征的共性,若不进行音频分类将会影响音频指纹的降维质量。This design classifies the audio and builds a database according to the types of audio characteristics or existing data. Since different types of audio have different features, it is easy to find the commonality of audio features by classifying audio. If audio classification is not performed, the dimensionality reduction quality of audio fingerprints will be affected.
本发明设计的音频指纹离散基尼系数适用于各类音频指纹的降维,因此,要将已有音频数据分类储存,然后再分别对各类音频进行指纹特征提取。音频指纹提取算法流程图如图1.所示。The discrete Gini coefficient of audio fingerprints designed by the present invention is suitable for dimensionality reduction of various audio fingerprints. Therefore, the existing audio data should be classified and stored, and then the fingerprint features of various audios should be extracted respectively. The flowchart of audio fingerprint extraction algorithm is shown in Figure 1.
步骤2,提取样本音频的指纹特征Step 2, extract the fingerprint features of the sample audio
从已构建的音频库中选取各类音频数据作为原始样本音频。提取原始样本指纹特征并引入离散基尼系数对指纹特征进行降维。具体流程为:Select various audio data from the built audio library as the original sample audio. The original sample fingerprint features are extracted and the discrete Gini coefficient is introduced to reduce the dimensionality of the fingerprint features. The specific process is:
Step2.1:对目标音频预处理Step2.1: Preprocessing the target audio
在数据特征提取前,先做预处理操作。预处理包括:带通滤波、预加重、分帧。Before data feature extraction, preprocessing operation is done first. Preprocessing includes: bandpass filtering, pre-emphasis, and framing.
(1)选取8kHz采样音频信号作为处理对象进行带通滤波处理,为提取人耳感知最重要的频率成分,选用通带范围为20Hz-4000Hz的带通滤波器对信号进行处理。本设计中带通滤波选用有限冲击响应(Finite Impulse Filter,FIR)滤波器,滤波过程为:(1) The 8kHz sampling audio signal is selected as the processing object for band-pass filter processing. In order to extract the most important frequency components perceived by the human ear, a band-pass filter with a pass band range of 20 Hz-4000 Hz is selected to process the signal. In this design, the bandpass filter uses a finite impulse response (Finite Impulse Filter, FIR) filter, and the filtering process is:
其中,T为处理信号的采样点数,p为时域标号,h(l)为FIR滤波器系数,x(p)为输入信号,y(p)为带通滤波后信号。Among them, T is the number of sampling points of the processed signal, p is the time domain label, h(l) is the coefficient of the FIR filter, x(p) is the input signal, and y(p) is the signal after bandpass filtering.
(2)对带通滤波后信号y(p)进行预加重处理,本设计选用具有6dB/倍频程的数字滤波器实现,用以提升预处理后信号的高频特性,使得信号频谱变得相对平坦,同时使语音信号在从低频到高频的整个频带中,能用同样的信噪比求频谱。预加重处理如下式所示:(2) Perform pre-emphasis processing on the band-pass filtered signal y(p). This design uses a 6dB/octave digital filter to improve the high-frequency characteristics of the pre-processed signal, making the signal spectrum become It is relatively flat, and at the same time, the speech signal can use the same signal-to-noise ratio to calculate the spectrum in the entire frequency band from low frequency to high frequency. The pre-emphasis processing is shown in the following formula:
s(p)=y(p)-μ*y(p-1)s(p)=y(p)-μ*y(p-1)
其中,μ为预加重系数,其取值为0.96,s(p)为预加重处理后信号。Among them, μ is the pre-emphasis coefficient, and its value is 0.96, and s(p) is the signal after pre-emphasis processing.
(3)对预加重信号进行加窗分帧处理。分帧虽然可以采用连续分段的方法,但一般使用交叠分段的方法,这是为了使帧与帧之间平滑过渡,保持其连续性。以帧长为0.064秒对音频进行分帧,帧与帧之间保持75%的重叠率,每一帧用相同长度的汉宁窗进行加权。加窗公式如下:(3) Perform windowing and framing processing on the pre-emphasis signal. Although the method of continuous segmentation can be used for framing, the method of overlapping segments is generally used, which is to make the transition between frames smooth and maintain their continuity. The audio is divided into frames with a frame length of 0.064 seconds, and an overlap rate of 75% is maintained between frames. Each frame is weighted with a Hanning window of the same length. The windowing formula is as follows:
其中,T为汉宁窗的长度,也为一帧音频的帧长。Among them, T is the length of the Hanning window and also the frame length of one frame of audio.
Step2.2:对音频数据进行指纹特征提取Step2.2: Extract fingerprint features from audio data
一个数字音频指纹可视为一段音频的浓缩精华,它包含了音频数据听觉最重要的部分,而音频指纹的维度高低又是影响指纹数据量和被检索速率的关键因素,因此本技术引用离散基尼系数对音频指纹进行特征降维,在降维前首先要对音频进行指纹特征提取,步骤如下:A digital audio fingerprint can be regarded as the condensed essence of a piece of audio, which contains the most important part of the audio data, and the dimension of the audio fingerprint is a key factor affecting the amount of fingerprint data and the retrieval rate, so this technology refers to discrete Gini The coefficient performs feature dimensionality reduction on audio fingerprints. Before dimensionality reduction, the audio fingerprint features must be extracted first. The steps are as follows:
(1)对已分帧音频信号进行离散傅里叶变换,将音频信号每一帧数据进行离散傅里叶变换,变换公式如下:(1) Discrete Fourier transform is carried out to framed audio signal, and each frame data of audio signal is carried out discrete Fourier transform, and transformation formula is as follows:
其中,X(k)为频域信号,x(p)为时域信号,k为频率索引,T为离散傅里叶变换的样本长度。Among them, X(k) is the frequency domain signal, x(p) is the time domain signal, k is the frequency index, and T is the sample length of the discrete Fourier transform.
(2)对离散傅里叶变换后的频域信号进行频谱子带划分,从频谱中选取33个非重叠的频带,分布于20-4000Hz范围(人耳对音频的鉴别主要集中在这一范围内),频带之间是等对数间隔的(按照人耳对不同频率的反应不是线性的)。第m子带的起始频率也即第m-1子带的终止频率f(m)可表示为下式:(2) The frequency domain signal after discrete Fourier transform is divided into spectrum subbands, and 33 non-overlapping frequency bands are selected from the spectrum, distributed in the range of 20-4000Hz (the human ear's identification of audio is mainly concentrated in this range Within), the frequency bands are equally logarithmically spaced (according to the human ear's response to different frequencies is not linear). The start frequency of the mth subband, that is, the stop frequency f(m) of the m-1th subband can be expressed as the following formula:
其中Fmin为映射下限,此处为20Hz,Fmax为映射上限,此处为4000Hz,M为子带个数,此处为33。Among them, Fmin is the lower limit of mapping, which is 20Hz here, Fmax is the upper limit of mapping, which is 4000Hz here, and M is the number of subbands, which is 33 here.
(3)计算每帧音频的各个子带能量,分别求其上述选取的33个非重叠频带的能量,假设第m个子带起始频率为f(m),终止频率为f(m+1),离散傅里叶变换后的频域信号为x(k),则第n帧的第m个子带能量的公式如下:(3) Calculate the energy of each subband of each frame of audio, and calculate the energy of the 33 non-overlapping frequency bands selected above, assuming that the start frequency of the mth subband is f(m), and the end frequency is f(m+1) , the frequency-domain signal after discrete Fourier transform is x(k), then the formula for the energy of the mth subband of the nth frame is as follows:
(4)生成每帧音频的子指纹,对上述每帧所求的33个子带能量作比特差分判别,生成每帧音频的32位二进制码,即每帧音频的子指纹,第n帧的第m个子带能量为E(n,m),其对应的二进制比特信息为F(n,m),则每帧的二进制音频指纹信息判别公式如下:(4) Generate the sub-fingerprint of each frame of audio, perform bit differential discrimination on the 33 sub-band energies required for each frame above, and generate the 32-bit binary code of each frame of audio, that is, the sub-fingerprint of each frame of audio, the nth frame of the first The m sub-band energy is E(n,m), and its corresponding binary bit information is F(n,m), then the binary audio fingerprint information discrimination formula of each frame is as follows:
由上式可知,每帧音频最后生成一个32维的二进制子指纹信息,子指纹所含信息较少,一个音频指纹特征常有多个子指纹构成。It can be seen from the above formula that a 32-dimensional binary sub-fingerprint information is finally generated for each frame of audio, and the sub-fingerprint contains less information, and an audio fingerprint feature often consists of multiple sub-fingerprints.
在利用音频指纹进行音频检索和音频识别等实际应用时,一帧音频信号生成的32维二进制子指纹所含信息较少,不能准确的检索或识别出目标音频,在实际应用中,常用音频指纹块进行音频检索或音频识别,音频指纹块是由至少256个音频子指纹组合而成,如图2.所示一段音频提取的32维音频指纹块。In practical applications such as audio retrieval and audio recognition using audio fingerprints, the 32-dimensional binary sub-fingerprint generated by a frame of audio signal contains less information, and cannot accurately retrieve or identify the target audio. In practical applications, audio fingerprints are commonly used The block carries out audio retrieval or audio recognition, and the audio fingerprint block is composed of at least 256 audio sub-fingerprints, as shown in Figure 2. A 32-dimensional audio fingerprint block extracted from a section of audio.
步骤3,对音频指纹特征降维Step 3, Dimensionality reduction of audio fingerprint features
在经过上述音频指纹提取过程后,每一帧的音频数据最后生成一个32维的二进制子指纹信息。对于一段音频来说,所含的音频指纹信息是由多个二进制子指纹信息构成,其指纹信息数据量仍然很大,在实际应用中,希望进一步降低音频指纹维数从而有效的减少指纹数据量。本设计提出一种基于离散基尼系数计算的音频指纹降维方法,本方法对音频指纹的每一维度求离散基尼系数,通过指纹各维度离散基尼系数的大小来实现降低指纹维数的目的。After the above-mentioned audio fingerprint extraction process, each frame of audio data finally generates a 32-dimensional binary sub-fingerprint information. For a piece of audio, the audio fingerprint information contained is composed of multiple binary sub-fingerprint information, and the amount of fingerprint information data is still large. In practical applications, it is hoped to further reduce the dimensionality of the audio fingerprint so as to effectively reduce the amount of fingerprint data. . This design proposes an audio fingerprint dimension reduction method based on discrete Gini coefficient calculation. This method calculates the discrete Gini coefficient for each dimension of the audio fingerprint, and achieves the purpose of reducing the fingerprint dimension by the size of the discrete Gini coefficient of each dimension of the fingerprint.
将音频指纹各维度数据按每50帧为一组,包含但不限于50帧,求各维度指纹的离散基尼系数,由于各维度指纹的离散基尼系数反应了音频指纹某维度数据的离散程度,也就是音频指纹某维度数据的差异性大小,音频指纹某位的离散基尼系数越大,不同音频在该位的差异就越大,说明该位的区分性越好,反之区分性差。本设计通过保留音频指纹中区分性较好维的信息,去掉区分性较差维的信息,将32维音频指纹转换为较低的维数减少指纹数据量。The data of each dimension of the audio fingerprint is divided into groups of 50 frames, including but not limited to 50 frames, and the discrete Gini coefficients of the fingerprints of each dimension are calculated. Since the discrete Gini coefficients of the fingerprints of each dimension reflect the degree of discreteness of the data of a certain dimension of the audio fingerprint, it is also It is the difference in the data of a certain dimension of the audio fingerprint. The larger the discrete Gini coefficient of a certain digit of the audio fingerprint, the greater the difference between different audios in this digit, indicating that the discrimination of the digit is better, otherwise the discrimination is poor. In this design, the 32-dimensional audio fingerprint is converted to a lower dimension to reduce the amount of fingerprint data by retaining the more distinguishable dimension information in the audio fingerprint and removing the less distinguishable dimension information.
Step3.1:求取音频指纹各维度的离散基尼系数Step3.1: Calculate the discrete Gini coefficient of each dimension of the audio fingerprint
具体步骤如下:Specific steps are as follows:
(1)求取音频指纹的离散洛伦兹曲线,离散洛伦兹曲线是求取离散基尼系数的关键曲线,是由累积指纹数据占比矢量构成,j表示音频指纹的维度序号,取值范围j=1,2,……,32求取累积指纹数据占比矢量的计算过程如下:(1) Calculate the discrete Lorenz curve of the audio fingerprint, the discrete Lorenz curve is the key curve for calculating the discrete Gini coefficient, which is calculated by the proportion vector of the accumulated fingerprint data Composition, j represents the dimension number of the audio fingerprint, and the value range j=1,2,...,32 to obtain the proportion vector of the accumulated fingerprint data The calculation process is as follows:
将音频指纹库中的各类音频指纹按帧处理,音频指纹每50帧指纹数据为一组共分成N组,构建第j维累积指纹数据矢量:All kinds of audio fingerprints in the audio fingerprint library are processed by frame, and every 50 frames of audio fingerprint data are divided into N groups, and the j-th dimension cumulative fingerprint data vector is constructed:
其中为组编号且有 in is the group number and has
构建第j维累积指纹数据占比矢量矢量中各元素定义为: Construct the j-th dimension cumulative fingerprint data proportion vector Each element in the vector is defined as:
………………
占比矢量各元素构成的曲线为离散洛伦兹曲线,具体而言音频指纹第j维的离散洛伦兹曲线绘制过程如下:The curve formed by each element of the proportion vector is a discrete Lorenz curve. Specifically, the process of drawing the discrete Lorenz curve of the j-th dimension of the audio fingerprint is as follows:
以音频指纹累积组数占比量为横坐标(横坐标即为),以音频指纹第j维累积指纹数据占比量为纵坐标,对音频指纹第j维累积指纹数据占比量的各个元素进行绘制离散点,离散点链接的离散曲线即为音频指纹第j维的离散洛伦兹曲线,其坐标为例如,异常声音类音频指纹第1维的洛伦兹曲线图如图3.所示。The abscissa takes the proportion of the cumulative number of audio fingerprints as the abscissa (the abscissa is ), taking the proportion of accumulated fingerprint data in the j-dimension of the audio fingerprint as the vertical coordinate, and the proportion of the accumulated fingerprint data in the j-dimension of the audio fingerprint Each element of the discrete point is drawn, and the discrete curve linked by the discrete point is the discrete Lorenz curve of the j-th dimension of the audio fingerprint, and its coordinates are For example, the Lorenz curve of the first dimension of the abnormal sound audio fingerprint is shown in Figure 3.
(2)求取各类音频指纹各维度的离散基尼系数,以上述所求的离散洛伦兹曲线为分界线,可得音频指纹第j维度的基尼系数公式如下:(2) Calculate the discrete Gini coefficient of each dimension of various audio fingerprints, and use the discrete Lorenz curve obtained above as the dividing line, the formula of the Gini coefficient of the jth dimension of the audio fingerprint can be obtained as follows:
其中,Sa为坐标对角线段OA与离散洛伦兹曲线围成的闭合面积,点O的坐标为(0,0)点A的坐标(1,1),Sb为坐标线段OB、BA与离散洛伦兹曲线围成的闭合面积,点B的坐标为(1,0),Gj为音频指纹第j维度的基尼系数,指纹离散基尼系数计算辅助图如图4.所示。Among them, S a is the closed area enclosed by the coordinate diagonal line segment OA and the discrete Lorenz curve, the coordinates of point O are (0,0) and the coordinates (1,1) of point A, and S b is the coordinate line segment OB, BA The closed area surrounded by the discrete Lorenz curve, the coordinates of point B is (1,0), G j is the Gini coefficient of the j-th dimension of the audio fingerprint, and the auxiliary figure for the calculation of the discrete Gini coefficient of the fingerprint is shown in Figure 4.
由上述可知,Sa+Sb的和为对角线段OA与线段OB、BA所围成的闭合面积,即:Sa+Sb=1/2,因为音频指纹是离散的,故本设计将上述公式离散化为:It can be seen from the above that the sum of S a + S b is the closed area enclosed by the diagonal line segment OA and the line segments OB and BA, that is: S a + S b = 1/2, because the audio fingerprint is discrete, so this design Discretize the above formula as:
由此,得到音频指纹第j维离散基尼系数其中i为组编号,为音频指纹第j维度累积第i组指纹数据占比量。Thus, the j-th dimension discrete Gini coefficient of the audio fingerprint is obtained where i is the group number, Accumulate the i-th group of fingerprint data proportions for the j-th dimension of the audio fingerprint.
Step3.2:统计不同类型的音频指纹各维度离散基尼系数实施指纹降维Step3.2: Count the discrete Gini coefficients of different types of audio fingerprints in each dimension to implement fingerprint dimensionality reduction
结合应用场景进行音频指纹各维度离散基尼系数训练。从音频数据中根据数据类型构建训练集或结合已有数据集构建训练集,对每个训练集合的音频数据求取其音频指纹各维度的离散基尼系数并进行统计分析。Combined with the application scenario, the discrete Gini coefficient training of each dimension of the audio fingerprint is carried out. The training set is constructed from the audio data according to the data type or combined with the existing data set, and the discrete Gini coefficient of each dimension of the audio fingerprint is obtained for the audio data of each training set and statistical analysis is performed.
本发明设计的离散基尼系数适用于各类音频指纹的降维,本次选取以下几类音频作为分析对象,对音频指纹各维度的离散基尼系数进行统计分析:The discrete Gini coefficient designed by the present invention is suitable for dimensionality reduction of various audio fingerprints. This time, the following types of audio are selected as analysis objects, and the discrete Gini coefficients of each dimension of the audio fingerprint are statistically analyzed:
如图5.异常声音的音频指纹各维度的离散基尼系数,从音频指纹库中统计所求异常声音的音频指纹32个维度的离散基尼系数,通过设置音频指纹离散基尼系数的阈值,可舍去低于阈值的音频指纹维度信息,保留高于阈值的音频指纹维度信息,异常声音类音频指纹的离散基尼系数阈值范围为0.36~0.38,从图中可以看出,音频指纹各维度的离散基尼系数大小各不相同,有些维度的指纹离散基尼系数较小,如第2、25、26维,也有些维度的指纹离散基尼系数较大,如第11、20、28维等,指纹离散基尼系数的大小代表着该维度音频指纹数据差异性的大小。此时,若将指纹离散基尼系数的阈值设为0.37,由图中可以看出指纹离散基尼系数小于0.37的维数有第2、5、24、25、26维,说明这5个维度的各帧指纹数据差异性较小,可舍去,通过舍去这些差异性较小的维数,从而减小每帧音频指纹的维数,进而减小音频指纹库的数据量。As shown in Figure 5. The discrete Gini coefficients of each dimension of the audio fingerprint of the abnormal sound. The discrete Gini coefficients of the 32 dimensions of the audio fingerprint of the abnormal sound are calculated from the audio fingerprint database. By setting the threshold of the audio fingerprint discrete Gini coefficient, it can be discarded. Audio fingerprint dimension information lower than the threshold, keep audio fingerprint dimension information higher than the threshold, the discrete Gini coefficient threshold range of abnormal sound audio fingerprints is 0.36-0.38, as can be seen from the figure, the discrete Gini coefficient of each dimension of the audio fingerprint The sizes are different. The discrete Gini coefficient of fingerprints in some dimensions is small, such as the 2nd, 25th, and 26th dimensions, and the discrete Gini coefficient of fingerprints in some dimensions is relatively large, such as the 11th, 20th, and 28th dimensions. The size represents the difference of the audio fingerprint data in this dimension. At this time, if the threshold value of the discrete Gini coefficient of the fingerprint is set to 0.37, it can be seen from the figure that the dimensions of the fingerprint discrete Gini coefficient less than 0.37 have the 2nd, 5th, 24th, 25th, and 26th dimensions, indicating that each of the five dimensions The frame fingerprint data has small differences and can be discarded. By discarding these dimensions with small differences, the dimensions of each frame of audio fingerprints are reduced, thereby reducing the data volume of the audio fingerprint database.
图6.为正常说话类音频指纹各维度的离散基尼系数,从音频指纹库中统计所求正常说话类的音频指纹32个维度的离散基尼系数,通过设置音频指纹离散基尼系数的阈值,可舍去低于阈值的音频指纹维度信息,保留高于阈值的音频指纹维度信息,正常说话类音频指纹的离散基尼系数阈值范围为0.36~0.38,从图中可以看出此类音频指纹离散基尼系数较小的维度有第2、22、25维,指纹离散基尼系数较大的维度有第18、26、28维等,若将此类音频指纹离散基尼系数的阈值设为0.37,由图中可以看出指纹离散基尼系数小于0.37的维数有第2、22、25维,说明这3个维度的各帧指纹数据差异性较小,可舍去。Figure 6. It is the discrete Gini coefficient of each dimension of the audio fingerprint of the normal speech class. The discrete Gini coefficient of the 32 dimensions of the audio fingerprint of the normal speech class is calculated from the audio fingerprint database. By setting the threshold value of the discrete Gini coefficient of the audio fingerprint, it can be discarded. The audio fingerprint dimension information below the threshold is removed, and the audio fingerprint dimension information above the threshold is retained. The discrete Gini coefficient threshold range of normal speech audio fingerprints is 0.36 to 0.38. It can be seen from the figure that the discrete Gini coefficient of such audio fingerprints is relatively low. Small dimensions include the 2nd, 22nd, and 25th dimensions, and dimensions with larger discrete Gini coefficients include the 18th, 26th, and 28th dimensions, etc. If the threshold of the discrete Gini coefficient of such audio fingerprints is set to 0.37, it can be seen from the figure The dimensions of the fingerprint discrete Gini coefficient less than 0.37 have the 2nd, 22nd, and 25th dimensions, indicating that the differences in the fingerprint data of each frame in these 3 dimensions are small and can be discarded.
图7.歌曲类音频指纹各维度的离散基尼系数,从音频指纹库中统计所求歌曲类的音频指纹32个维度的离散基尼系数,通过设置音频指纹离散基尼系数的阈值,可舍去低于阈值的音频指纹维度信息,保留高于阈值的音频指纹维度信息,歌曲类音频指纹的离散基尼系数阈值范围为0.36~0.38,从图中可以看出此类音频指纹的各维度离散基尼系数差异性相对较大,指纹各维度的离散基尼系数相对比较分散,在指纹的第2、14、15、25维的离散尼系数较小,在第4、5、11、18、29维指纹的离散基尼系数较大,此时,若将此类音频指纹的离散基尼系数阈值设为0.37,由图中可以看出指纹离散基尼系数小于0.37的有第1、2、14、15、25、26维,说明这6个维度的各帧指纹数据差异性较小,可舍去。Figure 7. The discrete Gini coefficients of each dimension of the audio fingerprint of the song class. The discrete Gini coefficients of the 32 dimensions of the audio fingerprint of the song class are calculated from the audio fingerprint library. By setting the threshold value of the discrete Gini coefficient of the audio fingerprint, it can be discarded. The threshold audio fingerprint dimension information, keep the audio fingerprint dimension information higher than the threshold, the discrete Gini coefficient threshold range of song audio fingerprint is 0.36~0.38, from the figure we can see the difference of discrete Gini coefficient of each dimension of this type of audio fingerprint Relatively large, the discrete Gini coefficients of each dimension of the fingerprint are relatively scattered. The discrete Gini coefficients of the 2nd, 14th, 15th, and 25th dimensions of the fingerprint are relatively small, and the discrete Gini coefficients of the 4th, 5th, 11th, 18th, and 29th dimensions of the fingerprint are relatively large. The coefficient is large. At this time, if the discrete Gini coefficient threshold of this type of audio fingerprint is set to 0.37, it can be seen from the figure that the discrete Gini coefficient of the fingerprint is less than 0.37. There are 1, 2, 14, 15, 25, and 26 dimensions It shows that the difference of the fingerprint data of each frame in these 6 dimensions is small and can be discarded.
从这几幅图可以看出,虽然各类音频的指纹离散基尼系数各不相同,在各维度系数趋势也不相同,但指纹总有几个维度的基尼系数较小,说明音频指纹在这几个维度的区分性较小,因此可以把这几个维度数据去除,将32维音频指纹转换为较低的维数从而有效的减少指纹数据量。From these figures, it can be seen that although the discrete Gini coefficients of various audio fingerprints are different, and the trend of the coefficients in each dimension is also different, there are always several dimensions of fingerprints with small Gini coefficients, which shows that audio fingerprints are in these dimensions. The discriminativeness of each dimension is small, so the data of these dimensions can be removed, and the 32-dimensional audio fingerprint can be converted into a lower dimension to effectively reduce the amount of fingerprint data.
Claims (2)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910784077.5A CN110600038B (en) | 2019-08-23 | 2019-08-23 | A Dimensionality Reduction Method of Audio Fingerprint Based on Discrete Gini Coefficient |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910784077.5A CN110600038B (en) | 2019-08-23 | 2019-08-23 | A Dimensionality Reduction Method of Audio Fingerprint Based on Discrete Gini Coefficient |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN110600038A true CN110600038A (en) | 2019-12-20 |
| CN110600038B CN110600038B (en) | 2022-04-05 |
Family
ID=68855402
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910784077.5A Expired - Fee Related CN110600038B (en) | 2019-08-23 | 2019-08-23 | A Dimensionality Reduction Method of Audio Fingerprint Based on Discrete Gini Coefficient |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN110600038B (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111063360A (en) * | 2020-01-21 | 2020-04-24 | 北京爱数智慧科技有限公司 | Voiceprint library generation method and device |
| CN111612038A (en) * | 2020-04-24 | 2020-09-01 | 平安直通咨询有限公司上海分公司 | Abnormal user detection method and device, storage medium and electronic equipment |
| CN113421585A (en) * | 2021-05-10 | 2021-09-21 | 云境商务智能研究院南京有限公司 | Audio fingerprint database generation method and device |
| CN115277322A (en) * | 2022-07-13 | 2022-11-01 | 金陵科技学院 | CR Signal Modulation Identification Method and System Based on Graph and Persistent Entropy Features |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6363342B2 (en) * | 1998-12-18 | 2002-03-26 | Matsushita Electric Industrial Co., Ltd. | System for developing word-pronunciation pairs |
| CN108122562A (en) * | 2018-01-16 | 2018-06-05 | 四川大学 | A kind of audio frequency classification method based on convolutional neural networks and random forest |
| CN109447180A (en) * | 2018-11-14 | 2019-03-08 | 山东省通信管理局 | A kind of fooled people's discovery method of the telecommunication fraud based on big data and machine learning |
| CN109493886A (en) * | 2018-12-13 | 2019-03-19 | 西安电子科技大学 | Speech-emotion recognition method based on feature selecting and optimization |
-
2019
- 2019-08-23 CN CN201910784077.5A patent/CN110600038B/en not_active Expired - Fee Related
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6363342B2 (en) * | 1998-12-18 | 2002-03-26 | Matsushita Electric Industrial Co., Ltd. | System for developing word-pronunciation pairs |
| CN108122562A (en) * | 2018-01-16 | 2018-06-05 | 四川大学 | A kind of audio frequency classification method based on convolutional neural networks and random forest |
| CN109447180A (en) * | 2018-11-14 | 2019-03-08 | 山东省通信管理局 | A kind of fooled people's discovery method of the telecommunication fraud based on big data and machine learning |
| CN109493886A (en) * | 2018-12-13 | 2019-03-19 | 西安电子科技大学 | Speech-emotion recognition method based on feature selecting and optimization |
Non-Patent Citations (2)
| Title |
|---|
| VIPIN KUMAR: "Mood Classifiaction of Lyrics using SentiWordNet", 《2013 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS》 * |
| 候智庭: "时间序列遥感数据植被分类中的特", 《中国优秀硕士学位论文全文数据库 基础科学辑》 * |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111063360A (en) * | 2020-01-21 | 2020-04-24 | 北京爱数智慧科技有限公司 | Voiceprint library generation method and device |
| CN111063360B (en) * | 2020-01-21 | 2022-08-19 | 北京爱数智慧科技有限公司 | Voiceprint library generation method and device |
| CN111612038A (en) * | 2020-04-24 | 2020-09-01 | 平安直通咨询有限公司上海分公司 | Abnormal user detection method and device, storage medium and electronic equipment |
| CN111612038B (en) * | 2020-04-24 | 2024-04-26 | 平安直通咨询有限公司上海分公司 | Abnormal user detection method and device, storage medium and electronic equipment |
| CN113421585A (en) * | 2021-05-10 | 2021-09-21 | 云境商务智能研究院南京有限公司 | Audio fingerprint database generation method and device |
| CN115277322A (en) * | 2022-07-13 | 2022-11-01 | 金陵科技学院 | CR Signal Modulation Identification Method and System Based on Graph and Persistent Entropy Features |
| CN115277322B (en) * | 2022-07-13 | 2023-07-28 | 金陵科技学院 | CR Signal Modulation Recognition Method and System Based on Graph and Persistence Entropy Features |
Also Published As
| Publication number | Publication date |
|---|---|
| CN110600038B (en) | 2022-04-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110827804B (en) | A sound event labeling method from audio frame sequence to event label sequence | |
| CN110600038A (en) | Audio fingerprint dimension reduction method based on discrete kini coefficient | |
| CN113327626A (en) | Voice noise reduction method, device, equipment and storage medium | |
| KR20080059246A (en) | Neural network classifier that separates audio sources from monophonic audio signals | |
| CN108198545B (en) | A Speech Recognition Method Based on Wavelet Transform | |
| CN107993663A (en) | A kind of method for recognizing sound-groove based on Android | |
| CN111279414A (en) | Segmentation-based feature extraction for sound scene classification | |
| CN114639372B (en) | Language Identification Method Based on Adjusted Cosine Mutual Information Estimation | |
| CN114238849B (en) | False audio detection method and system based on complex spectrum sub-band fusion | |
| CN110647656B (en) | An Audio Retrieval Method Using Transform Domain Sparsification and Compression Dimensionality Reduction | |
| CN109065043B (en) | A kind of command word recognition method and computer storage medium | |
| Son et al. | A robust audio fingerprinting using a new hashing method | |
| CN113808604B (en) | Sound scene classification method based on gammatone spectrum separation | |
| CN112309404B (en) | Machine voice authentication method, device, equipment and storage medium | |
| CN110767248B (en) | A method for extracting audio fingerprints against pitch-shifting interference | |
| CN113488069A (en) | Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network | |
| CN117198324A (en) | Bird sound identification method, device and system based on clustering model | |
| CN108735230A (en) | Background music recognition methods, device and equipment based on mixed audio | |
| CN112837704A (en) | A voice background noise recognition method based on endpoint detection | |
| CN116844567A (en) | Depth synthesis audio detection method and system based on multi-feature reconstruction fusion | |
| Seo et al. | Linear speed-change resilient audio fingerprinting | |
| Ravindran et al. | Improving the noise-robustness of mel-frequency cepstral coefficients for speech processing | |
| CN107993666A (en) | Audio recognition method, device, computer equipment and readable storage medium storing program for executing | |
| CN118887961A (en) | Training method, device, equipment and storage medium for cross-channel voiceprint recognition model | |
| CN115602176A (en) | Method, system and storage medium for voiceprint recognition |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220405 |
|
| CF01 | Termination of patent right due to non-payment of annual fee |