[go: up one dir, main page]

WO2008148289A1 - An intelligent audio identifying system and method - Google Patents

An intelligent audio identifying system and method Download PDF

Info

Publication number
WO2008148289A1
WO2008148289A1 PCT/CN2008/000765 CN2008000765W WO2008148289A1 WO 2008148289 A1 WO2008148289 A1 WO 2008148289A1 CN 2008000765 W CN2008000765 W CN 2008000765W WO 2008148289 A1 WO2008148289 A1 WO 2008148289A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
feature vector
data
audio
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2008/000765
Other languages
French (fr)
Chinese (zh)
Inventor
Yangsheng Xu
Jianzhao Qin
Jun Cheng
Xinyu Wu
Chong Guo Li
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Publication of WO2008148289A1 publication Critical patent/WO2008148289A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Definitions

  • the present invention relates to a system and method for automatically recognizing audio data. Background technique
  • Hearing is one of the important sources of human access to external information. It is also an important channel for humans to distinguish external occurrences. For example, when you hear a barking, you can judge that there may be dogs nearby. When you hear screams, you can It is determined that someone may be hurt nearby. Many important pieces of information of the present invention can be provided by analysis of audio. At present, most of the functions of audio-based analysis systems are to pre-process the original audio collected, such as: denoising, extracting or enhancing the audio of the specified features, but finally the recognition of the audio requires human participation. In many applications in nature, different sounds need to be automatically identified.
  • the technical problem to be solved by the present invention is to provide an intelligent audio recognition system and an automatic identification method for automatically identifying audio data.
  • An intelligent audio recognition method includes the following steps:
  • A. Collect various sample audio data, and mark the collected sample audio data
  • the feature vector of the audio data to be identified is input to the classifier, and the classifier performs discrimination according to the feature vector to obtain an identification result of the to-be-identified audio data.
  • step B includes the following steps:
  • step D includes the following steps:
  • the characteristic component described in the step B2 or D2 comprises: a center frequency of the audio, an energy feature of the audio in some specific frequency segments or an energy distribution feature of the audio in the plurality of time periods.
  • the feature vector described in the step B3 or D3 is a vector sum of a center frequency of the audio and a sum of audio energy spectra in some specific frequency segments.
  • step C the category region to be described in step C is divided according to the value of the feature vector, and is defined by a curve or a curved surface.
  • step E includes the following processing:
  • the classifier judging the credibility of the classification result according to the rejection index, when the category index is higher than the preset threshold, determining that the classification result is credible, the classifier gives the category of the audio data to be identified; when the rejection index is lower than the pre- When the threshold is set, the classifier gives the category of the audio data to be identified, and indicates that the classification result is not authentic.
  • the step A comprises identifying the collected sample audio data, determining and indicating what sound the sample audio data is.
  • An intelligent audio recognition system includes an audio data set for collecting and storing various types of sample audio data, a training unit, and an identification unit; the training unit is configured to extract feature vectors of sample audio data, and find and establish a mapping relationship between the sample audio data feature vector and the belonging category; the identification unit is configured to store data of a mapping relationship between the established audio data feature vector and the belonging category, and extract the feature data of the audio data to be identified, and according to The feature vector of the audio data to be identified is given the identification result.
  • the training unit includes a first pre-processing module, a first feature extraction module, and a training module
  • the pre-processing module is configured to perform denoising processing on the sample audio data to obtain training data
  • the feature extraction module A feature vector for extracting sample audio data from the training data, the training module for finding and establishing a mapping relationship from the sample audio data feature vector to the belonging class.
  • the identification unit comprises a second pre-processing module, a second feature extraction module, and a classifier, wherein the second pre-processing module is configured to perform denoising processing on the audio data to be recognized, to obtain identification data;
  • the second feature extraction module is configured to extract a feature vector of the audio data to be recognized from the identification data, where the classifier is configured to store data of a mapping relationship between the audio data feature vector output by the training module and the category, and according to the input The feature vector of the audio data to be identified, and the identification result is output.
  • the invention has the beneficial effects that: the intelligent audio recognition system and method of the invention can automatically identify the audio data, and the system has good real-time performance and expansion capability.
  • FIG. 1 is a block diagram of the system of the present invention
  • Figure 2 is a block diagram of a training unit of the present invention
  • FIG. 3 is a block diagram of an identification unit of the present invention.
  • FIG. 4 is a schematic diagram of establishing a mapping relationship between feature vectors and categories when the sample audio data is of four types
  • FIG. 5 is a schematic diagram of establishing a mapping relationship between feature vectors and categories when the sample audio data is of two types.
  • An intelligent audio recognition system as shown in FIG. 1 includes at least one audio data set 1, a training unit 2 and an identification unit 3 for collecting and storing various types of sample audio data.
  • the training unit 2 is configured to extract feature vectors of the sample audio data, and find and establish a mapping relationship from the sample audio data feature vector to the belonging category;
  • the identification unit 3 is configured to store the established audio data feature vector and the category The data of the mapping relationship is extracted, and the feature vector of the audio data to be identified is extracted, and the identification result is given according to the feature vector of the audio data to be identified.
  • the training unit includes a first pre-processing module 21, a first feature extraction module 22, and a training module 23.
  • the identification unit includes a second pre-processing module 31, a second feature extraction module 32, and a classifier 33.
  • the establishment of the audio data set 1 is to provide the necessary training samples for the subsequent training unit 2.
  • the user collects audio data according to the category of audio that needs to be recognized.
  • the data set can be created by using its own recording, collecting audio material from the Internet, and purchasing Buy audio material CDs and other methods to collect learning samples.
  • each type of audio needs to collect multiple samples, and in the process of sample collection, the collected samples need to be manually labeled, that is, the collected samples are answered by the human ear, and then the sample is determined to be what sound. In order to ensure the identification of the system, samples should be collected as much as possible.
  • the collected sample audio data needs to be pre-processed first, that is, the pre-processing module 21 removes noise and the like from the sample audio data from the audio data set 1, and the sample audio to be recognized is from a complex audio background.
  • the separated training data is obtained.
  • the feature extraction module 22 extracts components from the training data that reflect the essential characteristics of the sample audio data, such as: the center frequency of the audio, and the energy characteristics of the audio in certain frequency segments.
  • the energy distribution characteristics of the audio in the plurality of time periods are obtained by Fourier transforming the audio signal, and the features are combined to obtain corresponding feature vectors. For example: The center frequency of the sample audio is 33, and the sum of the energy spectra in an audio segment is 1000.
  • the resulting feature vector is the vector sum of the sum of the center frequency and the energy spectrum in an audio segment (33,1000).
  • the training module 23 uses the extracted feature components to train the classifier 33 for recognizing the audio, that is, the training classifier, that is, the training module 23 finds a plurality of classification curves or surfaces according to the feature vectors of the N sample audio data. Separating N classification regions by classification curves or surfaces, so that the feature vectors of each specimen audio data are distributed in different classification regions, and the classification regions are divided according to the values of the feature vectors, that is, a feature vector space is established. The mapping to the category.
  • the training module 23 is equivalent to finding two straight lines so that the feature vectors of the four types of samples are respectively It is distributed in four regions divided by two lines.
  • the triangle is the first type of eigenvector obtained during training
  • the circle is the second type of eigenvector obtained during training.
  • the five-pointed star is obtained during training.
  • the third type of eigenvectors, the pentagon is the fourth type of eigenvectors obtained during training
  • the straight line 1 and the straight line 2 are the classification lines obtained from the four types of eigenvectors (ie, the four regions divided by the two classification lines should be respectively
  • the method includes dividing the feature space into four subspaces, and the training module stores the trained data, that is, the data of the established audio data feature vector and the mapping relationship between the categories, in the classifier 33.
  • the division principle of the category region in the method of the invention is: by dividing the feature vector space, the divided different types of regions only contain feature vectors of the same type of samples, or as many feature vectors as possible of the sample, as little as possible Contains feature vectors that are not such samples.
  • the function of the identification unit 3 is to use the classifier 33 trained by the training module 23 according to the audio data to be recognized to obtain the identification result.
  • the second pre-processing module 31 and the second feature extraction module 32 in the identification unit respectively function the same as the second pre-processing module 21 and the second feature extraction 22 in the training unit.
  • the audio sample to be identified After the audio sample to be identified is obtained, it is first pre-processed by the pre-processing module 31 to obtain the processed identification data. Then, the feature extraction method in the feature extraction module 22 is used to identify the audio data. Feature extraction, obtaining a feature vector of the audio data to be recognized; thereafter, the extracted feature vector is input as a classifier 33 (obtained by the training module 23), and the classifier outputs the recognition result according to the input feature vector. For example, when the feature vector to be classified is distributed in the space sandwiched by the upper half of the straight line 1 and the lower half of the straight line 2 (such as the hexagon in FIG. 4), the present invention discriminates the feature vector to be classified as Category 1.
  • the present invention discriminates the feature vector to be classified into the first part of the circular feature vector.
  • the octagon in the drawing can be divided into the third category, and the hexagonal star is divided into the fourth category.
  • the classifier gives the identification result of the audio class to be identified according to the input feature vector of the to-be-identified audio data, and the more sample audios are collected in the audio data set, the more the classified regions are divided, the more to be identified The finer the audio data classification, the closer the classification result is to the real sound category.
  • the more commonly used classifiers are neural networks, support vector machines, Adaboost, and so on.
  • the process of obtaining a linear classification surface by a classifier based on a linear support vector machine is introduced below.
  • the interface w of the linear support vector machine can be obtained by solving the following optimization problems:
  • n b can be used as a rejection index, which is considered reliable when 1 W ' X _ is greater than a certain threshold.
  • the one-against-one method constructs ⁇ _ 1)/2 classification faces. These classification planes obtain ⁇ _ 1)/2 classification planes by taking out the combination of the two types from the class and then using the above-mentioned classification surface construction method for the two types of problems.
  • We use the voting method to determine the category to which the feature vector X belongs. Let: For the class and class, the classification face is w. If 'X _ b > 0, vote for the i-th class; if 1 ' X _ & k Q , vote for the j-th class.
  • the identification result output by the classifier includes a classification result and a rejection index of the category to which the audio to be recognized belongs, and the feature vector of the audio data to be recognized in FIG. 4 is a hexagon as an example, since the hexagon is distributed on the straight line 1. In the first type of space sandwiched between the half and the lower half of the line 2, the category is the first category.
  • the refusal index is a parameter used to measure the credibility of the classification results.
  • the output classification result is a probability belonging to a certain class, and the probability can be used as a rejection index. If the probability of belonging to all classes in the output result is less than a certain probability, then the rejection is The sample category is discriminated; for the classifier based on the classification surface, the distance between the feature vector of the sample and the nearest classification surface can be used as the rejection index, if the distance between the feature vector of the sample and the nearest classification surface is smaller than For a certain value, the sample category is rejected.
  • the rejection index is used to determine the credibility of the classification result.
  • the threshold can be set according to the experiment (for example: the invention can establish a small-scale test set, and then find a threshold to concentrate the test. For some untrustworthy samples, the value is set to a threshold value.
  • a threshold value is preset. When the rejection index is greater than the preset threshold, the classification result given by the classifier is credible; When the index is less than the preset threshold, the classification result of the classifier is less reliable, and the classifier indicates that the classification result is not credible while giving the category of the audio data to be identified.
  • the present invention discriminates the feature vector to be classified into a class to which the circular feature vector belongs.
  • the linear classification surface parameter (the linear classification surface parameter can be obtained by obtaining the linear classification surface normal vector) and the symbol (positive or negative) of the dot product of the feature vector to be classified will be used to distinguish the feature vector to be classified, and the point
  • the absolute value of the product is the rejection index, which is used to measure the credibility of the classification. The larger the rejection index (the absolute value of the dot product), the higher the reliability of the classification. When the absolute value of the dot product is greater than the default. At the threshold, the classification is considered reliable.
  • the method and system can be used to identify various unused audios in nature, and the system can be used to identify specific audios first, and based on the identification results, to implement subsequent Function, training a fast, scalable classifier to ensure that the system has good real-time and scalability.
  • the intelligent audio recognition system of the invention can be used for intelligent monitoring in various situations. Such as: can be installed in the elevator The system automatically recognizes abnormal sounds such as screams, screams, and percussive sounds, and sends an alarm signal to the monitoring personnel, thereby improving the reaction time for handling abnormal conditions in the elevator, and at the same time reducing the workload of the elevator monitoring personnel. .
  • the system can also be used for home monitoring. After installing the system indoors, the system can identify the abnormal sounds that may occur in the room such as glass breaking sound, door crashing sound, explosion sound, gunshot sound, etc., and immediately send out an alarm signal after identifying these abnormal sounds, thereby effectively Prevent the occurrence of criminal acts such as theft of doors and windows into the room.
  • the system can also be installed outdoors to automatically identify weather-related sounds such as thunder, wind, and rain, and monitor weather conditions in real time.
  • the system can help wildlife researchers working in the field to conduct research. Wild zoologists often need to spend weeks or even months to track some rare wild animals.
  • the present invention can identify the sound of a certain wild animal by broadcasting a wireless sensor installed in the designated area. The sound of the animal is signaled to help the wildlifeologist track it.
  • the system can also be used for the diagnosis of mechanical faults. When the machine malfunctions, it will emit a sound different from the normal operation of the machine, and the sound of different faults will be different.
  • the system can learn according to several different fault audios, and then install the real-time sound of the machine work near the machine.
  • the system can also be applied to Internet-based audio retrieval and audio-based scene analysis.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An intelligent audio identifying system and method are provided. The system includes an audio data set for collecting and storing all kinds of audio data samples (1), a training unit (2) and an identifying unit (3). The training unit (2) is used for extracting feature vectors of the audio data samples, searching and establishing the map relationship of the feature vectors of the audio data samples and corresponding classes. The identifying unit (3) is used for storing the established data presenting the map relationship of the audio data samples and the corresponding classes, and extracting a feature vector of the audio data to be identified, and obtaining the identifying result according to the feature vector of the audio data to be identified.

Description

一种智能音频辨识系统及辨识方法 技术领域  Intelligent audio recognition system and identification method

本发明涉及一种能够对音频数据自动辨识的系统及方法。 背景技术  The present invention relates to a system and method for automatically recognizing audio data. Background technique

听觉是人类获取外部信息的重要来源之一,也是人类用于分辨外部发生情况的重要渠 道, 如: 当听到狗叫声, 就可以判定附近可能有狗; 当听到尖叫声, 就可判定附近可能有 人受到伤害。通过对音频的分析可以提供给本发明许多重要的信息。 目前大多数基于音频 的分析系统主要完成的功能是对采集到的原始音频进行预处理, 如: 去噪声, 提取或加强 指定特征的音频, 但最后对音频的辨识都需要人的参与。 而在自然界众多的应用场合中, 需要对不同声响进行自动辨识,例如,对于在野外从事野生动物科研工作的野生动物学家, 需要花费很多时间追踪一些罕见的野生动物, 如果能有音频数据自动辨识系统来辨识某 种野生动物的叫声, 当辨识出该种动物的声音后发出信号, 则能帮助野生动物学家进行追 踪。 又如可在电梯、 家庭设有对音频自动辨识系统的话, 就可以对尖叫声、 打闹声、 撞击 声、玻璃破碎声、爆炸声、枪击声等异常声响进行自动辨识, 并发出报警信号给监控人员, 从而提高监控人员对异常情况处理的反应时间。 因此, 实现自动的对音频进行辨识, 将具 有重要、 广泛的应用价值。 发明内容  Hearing is one of the important sources of human access to external information. It is also an important channel for humans to distinguish external occurrences. For example, when you hear a barking, you can judge that there may be dogs nearby. When you hear screams, you can It is determined that someone may be hurt nearby. Many important pieces of information of the present invention can be provided by analysis of audio. At present, most of the functions of audio-based analysis systems are to pre-process the original audio collected, such as: denoising, extracting or enhancing the audio of the specified features, but finally the recognition of the audio requires human participation. In many applications in nature, different sounds need to be automatically identified. For example, for wild zoologists engaged in wildlife research in the wild, it takes a lot of time to track some rare wild animals, if there is automatic audio data. Identifying the system to identify the sound of a wild animal, signaling when it recognizes the sound of the animal, can help the wildlifeologist track it. For example, if there is an automatic audio recognition system in the elevator or the home, it can automatically identify abnormal sounds such as screams, screams, percussive sounds, glass breaks, explosions, and gunshots, and send out alarm signals. Give the monitoring personnel, so as to improve the response time of the monitoring personnel to the abnormal situation. Therefore, the realization of automatic recognition of audio will have important and extensive application value. Summary of the invention

本发明所要解决的技术问题是: 提供一种智能音频辨识系统及自动辨识方法, 对音 频数据进行自动辨识。  The technical problem to be solved by the present invention is to provide an intelligent audio recognition system and an automatic identification method for automatically identifying audio data.

本发明为解决上述技术问题所采用的技术方案为:  The technical solution adopted by the present invention to solve the above technical problems is:

一种智能音频辨识方法, 包括以下步骤:  An intelligent audio recognition method includes the following steps:

A、 采集各种样本音频数据, 对采集到的样本音频数据进行标注;  A. Collect various sample audio data, and mark the collected sample audio data;

B、 逐一从所述样本音频数据中提取出反映其本质特征的特征向量;  B. Extracting feature vectors reflecting the essential features from the sample audio data one by one;

C、 根据所述特征向量划分所属类别区域, 使得划分后的各个不同类别区域中包含尽 量多的该类样本的特征向量, 建立从特征向量到所属类别之间映射关系的分类器;  C. Dividing the category region according to the feature vector, so that each of the divided different category regions includes as many feature vectors as the sample, and establishing a classifier from the feature vector to the mapping relationship between the categories;

D、 对待辨识音频数据进行处理, 提取其特征向量; E、 将待辨识音频数据的特征向量输入到所述分类器, 分类器根据其特征向量进行判 别, 得到对该待辨识音频数据的辨识结果。 D. processing the identified audio data, and extracting its feature vector; E. The feature vector of the audio data to be identified is input to the classifier, and the classifier performs discrimination according to the feature vector to obtain an identification result of the to-be-identified audio data.

所述的方法, 其中所述的步骤 B包括如下步骤:  The method, wherein the step B includes the following steps:

Bl、 对所述样本音频数据进行预处理, 得到训练数据;  Bl, preprocessing the sample audio data to obtain training data;

B2、 从训练数据中提取反映训练数据本质特征的特征成分;  B2, extracting feature components reflecting the essential characteristics of the training data from the training data;

B3、 将所述特征成分进行组合, 得到所述特征向量。  B3. Combining the feature components to obtain the feature vector.

所述的方法, 其中所述的步骤 D包括如下步骤:  The method, wherein the step D includes the following steps:

Dl、 对所述待辨识音频数据进行预处理, 得到辨识数据;  Dl, preprocessing the to-be-identified audio data to obtain identification data;

D2、 从辨识数据中提取反映辨识数据本质特征的特征成分;  D2. Extracting characteristic components reflecting the essential characteristics of the identification data from the identification data;

D3、 将所述特征成分进行组合, 得到所述特征向量。  D3. Combining the feature components to obtain the feature vector.

所述的方法, 其中: 所述步骤 B2或 D2所述的特征成分包括: 音频的中心频率、 一些 特定频率段内音频的能量特征或多个时段内音频的能量分布特征。  The method, wherein: the characteristic component described in the step B2 or D2 comprises: a center frequency of the audio, an energy feature of the audio in some specific frequency segments or an energy distribution feature of the audio in the plurality of time periods.

所述的方法, 其中: 所述步骤 B3或 D3所述的特征向量为音频的中心频率与一些特定 频率段内音频能量谱之和的矢量和。  The method, wherein: the feature vector described in the step B3 or D3 is a vector sum of a center frequency of the audio and a sum of audio energy spectra in some specific frequency segments.

所述的方法, 其中: 步骤 C 所述的所属类别区域根据所述特征向量的数值来划分, 并由曲线或曲面来限定。  The method, wherein: the category region to be described in step C is divided according to the value of the feature vector, and is defined by a curve or a curved surface.

所述的方法, 其中: 所述步骤 E包括如下处理:  The method, wherein: the step E includes the following processing:

El、将待辨识音频数据的特征向量输入到所述分类器, 分类器根据其特征向量进行判 别, 得到待辨识音频数据划分到所属类别的分类结果以及拒判指数, 所述拒判指数是用来 衡量分类结果可信度的参数;  El, inputting a feature vector of the audio data to be identified to the classifier, and the classifier performs discrimination according to the feature vector, and obtains a classification result of the audio data to be identified into the belonging category and a rejection index, where the rejection index is used a parameter to measure the credibility of the classification results;

E2、 根据拒判指数判断分类结果的可信度, 当类别指数高于预设门限时, 判定所述分 类结果可信, 分类器给出待辨识音频数据所属类别; 当拒判指数低于预设门限时, 分类器 给出待辨识音频数据所属类别的同时, 提示该分类结果不可信。  E2, judging the credibility of the classification result according to the rejection index, when the category index is higher than the preset threshold, determining that the classification result is credible, the classifier gives the category of the audio data to be identified; when the rejection index is lower than the pre- When the threshold is set, the classifier gives the category of the audio data to be identified, and indicates that the classification result is not authentic.

所述的方法, 其中: 所述步骤 A包括对采集到的样本音频数据进行辨识, 确定并注明 该样本音频数据是什么声音。  The method, wherein: the step A comprises identifying the collected sample audio data, determining and indicating what sound the sample audio data is.

一种智能音频辨识系统, 包括一用于采集并存储各类样本音频数据的音频数据集、一 训练单元、 以及辨识单元; 所述训练单元用于提取样本音频数据的特征向量, 并寻找和建 立从样本音频数据特征向量到所属类别之间的映射关系;所述辨识单元用于存放已建立的 音频数据特征向量与所属类别之间映射关系的数据, 以及提取待辨识音频数据特征向量, 并根据待辨识音频数据的特征向量, 给出辨识结果。 所述的系统,其中:训练单元包括第一预处理模块、第一特征提取模块以及训练模块, 所述预处理模块用于对样本音频数据进行去噪处理, 得到训练数据; 所述特征提取模块用 于从训练数据中提取样本音频数据的特征向量,所述训练模块用于寻找和建立从样本音频 数据特征向量到所属类别之间的映射关系。 An intelligent audio recognition system includes an audio data set for collecting and storing various types of sample audio data, a training unit, and an identification unit; the training unit is configured to extract feature vectors of sample audio data, and find and establish a mapping relationship between the sample audio data feature vector and the belonging category; the identification unit is configured to store data of a mapping relationship between the established audio data feature vector and the belonging category, and extract the feature data of the audio data to be identified, and according to The feature vector of the audio data to be identified is given the identification result. The system, wherein: the training unit includes a first pre-processing module, a first feature extraction module, and a training module, the pre-processing module is configured to perform denoising processing on the sample audio data to obtain training data; and the feature extraction module A feature vector for extracting sample audio data from the training data, the training module for finding and establishing a mapping relationship from the sample audio data feature vector to the belonging class.

所述的系统, 其中: 所述辨识单元包括第二预处理模块、 第二特征提取模块以及分类 器, 所述第二预处理模块用于对待辨识音频数据进行去噪处理, 得到辨识数据; 所述第二 特征提取模块用于从辨识数据中提取待辨识音频数据的特征向量,所述分类器用于存放所 述训练模块输出的音频数据特征向量与所属类别之间映射关系的数据,并根据输入的待辨 识音频数据的特征向量, 输出辨识结果。  The system, wherein: the identification unit comprises a second pre-processing module, a second feature extraction module, and a classifier, wherein the second pre-processing module is configured to perform denoising processing on the audio data to be recognized, to obtain identification data; The second feature extraction module is configured to extract a feature vector of the audio data to be recognized from the identification data, where the classifier is configured to store data of a mapping relationship between the audio data feature vector output by the training module and the category, and according to the input The feature vector of the audio data to be identified, and the identification result is output.

本发明的有益效果为: 采用本发明的智能音频辨识系统及方法, 能够对音频数据进 行自动辨识, 并且系统具有良好的实时性和扩展能力。 附图说明  The invention has the beneficial effects that: the intelligent audio recognition system and method of the invention can automatically identify the audio data, and the system has good real-time performance and expansion capability. DRAWINGS

图 1为本发明的系统框图;  Figure 1 is a block diagram of the system of the present invention;

图 2为本发明的训练单元方框图;  Figure 2 is a block diagram of a training unit of the present invention;

图 3为本发明的辨识单元方框图;  Figure 3 is a block diagram of an identification unit of the present invention;

图 4为当标本音频数据为四类时, 建立起特征向量到类别之间映射关系的示意图; 图 5为当标本音频数据为两类时, 建立起特征向量到类别之间映射关系的示意图。 具体实施方式  FIG. 4 is a schematic diagram of establishing a mapping relationship between feature vectors and categories when the sample audio data is of four types; FIG. 5 is a schematic diagram of establishing a mapping relationship between feature vectors and categories when the sample audio data is of two types. detailed description

下面根据附图和实施例对本发明作进一步详细说明:  The present invention will be further described in detail below based on the accompanying drawings and embodiments:

如图 1所示的一种智能音频辨识系统,至少包括一个用于采集并存储各类标本音频数 据的音频数据集 1, 训练单元 2和辨识单元 3。 训练单元 2用于提取样本音频数据的特征 向量, 并寻找和建立从样本音频数据特征向量到所属类别之间的映射关系; 辨识单元 3用 于存放已建立的音频数据特征向量与所属类别之间映射关系的数据, 以及提取待辨识音频 数据特征向量, 并根据待辨识音频数据的特征向量, 给出辨识结果。 其中, 训练单元如图 2所示, 包括第一预处理模块 21, 第一特征提取模块 22以及训练模块 23。 辨识单元如图 3所示, 包括第二预处理模块 31, 第二特征提取模块 32以及分类器 33。  An intelligent audio recognition system as shown in FIG. 1 includes at least one audio data set 1, a training unit 2 and an identification unit 3 for collecting and storing various types of sample audio data. The training unit 2 is configured to extract feature vectors of the sample audio data, and find and establish a mapping relationship from the sample audio data feature vector to the belonging category; the identification unit 3 is configured to store the established audio data feature vector and the category The data of the mapping relationship is extracted, and the feature vector of the audio data to be identified is extracted, and the identification result is given according to the feature vector of the audio data to be identified. The training unit, as shown in FIG. 2, includes a first pre-processing module 21, a first feature extraction module 22, and a training module 23. As shown in FIG. 3, the identification unit includes a second pre-processing module 31, a second feature extraction module 32, and a classifier 33.

音频数据集 1的建立是为后续的训练单元 2提供必要的学习样本。用户根据需要辨识 音频的类别收集音频数据。 该数据集的建立可以采用自己录音, 从网上收集音频素材, 购 买音频素材光盘等办法来收集学习样本。 一般来说, 每一类的音频都需要收集多个样本, 并且样本收集的过程中, 需要对采集到的样本进行人工标注, 即通过人耳来接听采集到的 样本,然后来确定该样本是什么声音。为了保证系统的辨识效果,样本应尽可能的多收集。 The establishment of the audio data set 1 is to provide the necessary training samples for the subsequent training unit 2. The user collects audio data according to the category of audio that needs to be recognized. The data set can be created by using its own recording, collecting audio material from the Internet, and purchasing Buy audio material CDs and other methods to collect learning samples. Generally speaking, each type of audio needs to collect multiple samples, and in the process of sample collection, the collected samples need to be manually labeled, that is, the collected samples are answered by the human ear, and then the sample is determined to be what sound. In order to ensure the identification of the system, samples should be collected as much as possible.

在训练单元 2中, 首先需要对采集的样本音频数据进行预处理, 即由预处理模块 21 对来自音频数据集 1的样本音频数据去除噪声等处理,将待辨识的样本音频从复杂的音频 背景中分离出来, 得到经过处理的训练数据; 接着, 由特征提取模块 22从训练数据中提 取反映样本音频数据本质特征的成分, 如: 音频的中心频率、 某些频率段内音频的能量特 征(可通过对音频信号进行傅利叶变换得到)或多个时间段内音频的能量分布特征, 并将 这些特征组合起来, 得到相应的特征向量。 例如: 该样本音频的中心频率为 33, 某音频 段内能量谱的和为 1000, 则得到的特征向量为中心频率和某音频段内能量谱之和的矢量 和 (33,1000)。 然后, 由训练模块 23利用提取出的特征分量训练出用于辨识音频的分类器 33, 所谓训练分类器, 就是由训练模块 23根据 N个样本音频数据的特征向量寻找到多条 分类曲线或曲面, 由分类曲线或曲面分隔出 N个分类区域, 使每个标本音频数据的特征向 量分布在各自不同的分类区域内, 分类区域根据特征向量的数值来划分, 也就是建立一种 从特征向量空间到类别的映射关系。 例如, 当样本音频数据仅为四个不同类别的数据时, 针对四类的分类问题, 当特征向量为两维时, 训练模块 23就等价于找到两条直线使得四 类样本的特征向量分别分布在两条直线分割的四个区域中, 如图 4所示, 三角形为训练时 得到的第 1类特征向量, 圆形为训练时得到的第 2类特征向量, 五角星为训练时得到的第 3类特征向量, 五边形为训练时得到第 4类特征向量, 直线 1和直线 2是由这四类特征 向量得到的分类线 (即这两条分类线所划分的四个区域应分别包括 把特征空间分割成 了四个子空间。 并且, 训练模块将训练好的数据, 也就是已建立的音频数据特征向量与所 属类别之间映射关系的数据存放于分类器 33内。  In the training unit 2, the collected sample audio data needs to be pre-processed first, that is, the pre-processing module 21 removes noise and the like from the sample audio data from the audio data set 1, and the sample audio to be recognized is from a complex audio background. The separated training data is obtained. Then, the feature extraction module 22 extracts components from the training data that reflect the essential characteristics of the sample audio data, such as: the center frequency of the audio, and the energy characteristics of the audio in certain frequency segments. The energy distribution characteristics of the audio in the plurality of time periods are obtained by Fourier transforming the audio signal, and the features are combined to obtain corresponding feature vectors. For example: The center frequency of the sample audio is 33, and the sum of the energy spectra in an audio segment is 1000. The resulting feature vector is the vector sum of the sum of the center frequency and the energy spectrum in an audio segment (33,1000). Then, the training module 23 uses the extracted feature components to train the classifier 33 for recognizing the audio, that is, the training classifier, that is, the training module 23 finds a plurality of classification curves or surfaces according to the feature vectors of the N sample audio data. Separating N classification regions by classification curves or surfaces, so that the feature vectors of each specimen audio data are distributed in different classification regions, and the classification regions are divided according to the values of the feature vectors, that is, a feature vector space is established. The mapping to the category. For example, when the sample audio data is only four different categories of data, for the classification problem of the four categories, when the feature vector is two-dimensional, the training module 23 is equivalent to finding two straight lines so that the feature vectors of the four types of samples are respectively It is distributed in four regions divided by two lines. As shown in Fig. 4, the triangle is the first type of eigenvector obtained during training, and the circle is the second type of eigenvector obtained during training. The five-pointed star is obtained during training. The third type of eigenvectors, the pentagon is the fourth type of eigenvectors obtained during training, and the straight line 1 and the straight line 2 are the classification lines obtained from the four types of eigenvectors (ie, the four regions divided by the two classification lines should be respectively The method includes dividing the feature space into four subspaces, and the training module stores the trained data, that is, the data of the established audio data feature vector and the mapping relationship between the categories, in the classifier 33.

本发明方法中类别区域的划分原则是:通过对特征向量空间的划分,使得划分后的各 个不同类别区域中只包含同类样本的特征向量, 或尽量多的包含该类样本的特征向量, 尽 量少的包含非该类样本的特征向量。  The division principle of the category region in the method of the invention is: by dividing the feature vector space, the divided different types of regions only contain feature vectors of the same type of samples, or as many feature vectors as possible of the sample, as little as possible Contains feature vectors that are not such samples.

辨识单元 3的作用是根据待辨识的音频数据,利用训练模块 23训练得到的分类器 33, 得到辨识结果。 辨识单元中的第二预处理模块 31, 以及第二特征提取模块 32分别与训练 单元中的第二预处理模块 21以及第二特征提取 22作用相同。  The function of the identification unit 3 is to use the classifier 33 trained by the training module 23 according to the audio data to be recognized to obtain the identification result. The second pre-processing module 31 and the second feature extraction module 32 in the identification unit respectively function the same as the second pre-processing module 21 and the second feature extraction 22 in the training unit.

当获取到待辨识的音频样本后, 首先要由预处理模块 31对其进行预处理, 得到处理 后的辨识数据; 接着, 采用与特征提取模块 22中的特征提取方法对待辨识音频数据进行 特征提取, 得到待辨识音频数据的特征向量; 之后, 把提取到的特征向量作为分类器 33 (由训练模块 23得到) 的输入, 该分类器根据输入的特征向量输出辨识结果。 例如, 当 待分类的特征向量分布在直线 1的上半部分和直线 2的下半部分所夹的空间中(如图 4中 的六边形), 本发明就将这个待分类特征向量判别为第 1类。 如果待分类的特征向量分布 在直线 1和直线 2上半部分所夹的空间中 (如图 4中的七边形), 本发明就将该待分类特 征向量判别为圆形特征向量所属的第 2类,依次类推,可以将附图中的八边形分为第 3类, 六角星型分为第 4类。 After the audio sample to be identified is obtained, it is first pre-processed by the pre-processing module 31 to obtain the processed identification data. Then, the feature extraction method in the feature extraction module 22 is used to identify the audio data. Feature extraction, obtaining a feature vector of the audio data to be recognized; thereafter, the extracted feature vector is input as a classifier 33 (obtained by the training module 23), and the classifier outputs the recognition result according to the input feature vector. For example, when the feature vector to be classified is distributed in the space sandwiched by the upper half of the straight line 1 and the lower half of the straight line 2 (such as the hexagon in FIG. 4), the present invention discriminates the feature vector to be classified as Category 1. If the feature vector to be classified is distributed in the space sandwiched by the upper half of the straight line 1 and the straight line 2 (such as the heptagon in FIG. 4), the present invention discriminates the feature vector to be classified into the first part of the circular feature vector. In the second category, and so on, the octagon in the drawing can be divided into the third category, and the hexagonal star is divided into the fourth category.

由此可见,分类器根据输入的待辨识音频数据的特征向量给出了待辨识音频类别的辨 识结果, 并且当音频数据集采集的样本音频越多, 所划分的分类区域越多, 则对待辨识音 频数据分类越细, 分类结果越接近于真实的声音类别。  It can be seen that the classifier gives the identification result of the audio class to be identified according to the input feature vector of the to-be-identified audio data, and the more sample audios are collected in the audio data set, the more the classified regions are divided, the more to be identified The finer the audio data classification, the closer the classification result is to the real sound category.

在模式分类系统中, 较常使用的分类器有神经网络、 支持向量机、 Adaboost 等。 下 面介绍基于线性支持向量机的分类器获取线性分类面的过程。  In the pattern classification system, the more commonly used classifiers are neural networks, support vector machines, Adaboost, and so on. The process of obtaining a linear classification surface by a classifier based on a linear support vector machine is introduced below.

首先以区分两类问题为例:  First, let's distinguish between two types of problems:

给定两类的特征向量及其所属的类别: ' )^' ') ^ {±1}。 线性支持向量机 的分界面 w可通过求解下面的优化问题得到:

Figure imgf000007_0001
Given two types of eigenvectors and their associated categories: ')^'') ^ {±1}. The interface w of the linear support vector machine can be obtained by solving the following optimization problems:
Figure imgf000007_0001

s.t. ^ [w-X; -^] + ?7,≥1  S.t. ^ [w-X; -^] + ?7, ≥1

≥0, = l,...,l 其中 c>0是固定的惩罚参数。 当我们得到一个新的特征向量 χ, 如果 w'x_b≥o, 则认为该特征向量属于类别 1; 如果 w'x- <0, 则认为该特征向量属于类别 -1。 nb 的绝对值可作为拒判指数, 当1 W'X_ 大于某个阀值 则认为该分类是可靠的。 ≥0, = l,...,l where c>0 is a fixed penalty parameter. When we get a new eigenvector χ, if w'x_b ≥ o, then the eigenvector is considered to belong to category 1; if w 'x- < 0 , then the eigenvector is considered to belong to category-1. The absolute value of n b can be used as a rejection index, which is considered reliable when 1 W ' X _ is greater than a certain threshold.

然后采用 one-against-one方法将其扩展到多类问题。 对一个 类的分类问题而言, one-against-one方法构造^ _1)/2个分类面。这些分类面通过从 类中取出各种两类的 组合, 然后采用上述针对两类问题的分类面构造方法得到^^_1)/2个分类面。 我们采用 投票的方法来确定特征向量 X所属的类别。 设: 针对第 类和第 类的分类面为 w 如果 ' X _ b >= 0,则给第 i类投一票;如果1 ' X _ & k Q,则给第 j类投一票。当根据^ _ / 2 个分类面完成投票后, 取得票数最多的类别将作为最后的分类结果。 同时, 每一类在取得 投票的同时, 需要对1 W 'X_&I进行累加, 最后的和将作为拒判指数, 当该累加和大于某 个阀值 则认为该分类是可靠的。 分类器输出的该辨识结果包括待辨识音频所属类别的分类结果和拒判指数, 以图 4 中待辨识音频数据的特征向量是六边形为例, 由于该六边形分布在直线 1的上半部分和直 线 2的下半部分所夹的第 1类空间中, 所属类别为第 1类, 当六边形落入第 1类空间的位 置越居中, 与第 1类样本的特征向量越类似, 说明分类结果可信度越高, 当六边形落入第 1类空间的位置越靠近分类线, 说明分类结果可信度越低。 Then use the one-against-one method to extend it to multiple types of problems. For a class classification problem, the one-against-one method constructs ^ _ 1)/2 classification faces. These classification planes obtain ^^_ 1)/2 classification planes by taking out the combination of the two types from the class and then using the above-mentioned classification surface construction method for the two types of problems. We use the voting method to determine the category to which the feature vector X belongs. Let: For the class and class, the classification face is w. If 'X _ b >= 0, vote for the i-th class; if 1 ' X _ & k Q , vote for the j-th class. When the voting is completed according to ^ _ / 2 classification faces, the category with the highest number of votes will be used as the final classification result. At the same time, each class needs to accumulate 1 W 'X_&I at the same time as the voting, and the last sum will be used as the rejection index. When the accumulated sum is greater than a certain threshold, the classification is considered to be reliable. The identification result output by the classifier includes a classification result and a rejection index of the category to which the audio to be recognized belongs, and the feature vector of the audio data to be recognized in FIG. 4 is a hexagon as an example, since the hexagon is distributed on the straight line 1. In the first type of space sandwiched between the half and the lower half of the line 2, the category is the first category. The more the position of the hexagon falls into the space of the first type, the more similar the feature vector of the first type of sample is. The higher the credibility of the classification result, the closer the hexagon falls to the classification line when it falls into the space of the first type, indicating that the classification result is less reliable.

拒判指数是用来衡量分类结果可信度的参数。对于基于概率准则的分类器来说,输出 的分类结果为属于某类的概率, 该概率即可作为拒判指数, 如果输出的结果中属于所有类 的概率均小于某个概率, 则拒绝对该样本类别进行判别; 对基于分类面的分类器来说, 样 本的特征向量和离它最近的分类面的距离即可作为拒判指数,如果样本的特征向量和离它 最近的分类面的距离小于某个数值, 则拒绝对该样本类别进行判别。  The refusal index is a parameter used to measure the credibility of the classification results. For a classifier based on probability criteria, the output classification result is a probability belonging to a certain class, and the probability can be used as a rejection index. If the probability of belonging to all classes in the output result is less than a certain probability, then the rejection is The sample category is discriminated; for the classifier based on the classification surface, the distance between the feature vector of the sample and the nearest classification surface can be used as the rejection index, if the distance between the feature vector of the sample and the nearest classification surface is smaller than For a certain value, the sample category is rejected.

拒判指数用于判定分类结果的可信度,在实际应用中可以根据实验设定门限值 (例如: 本发明可以建立一个小规模的测试集,然后寻找一个门限值可以将测试集中大部分的不可 信的样本据判, 则设该值为门限值) 预设一门限值, 当拒判指数大于预设的门限值时, 分类器给出的分类结果可信; 当拒判指数小于预设的门限值时, 说明分类器的分类结果可 信度较低,分类器在给出待辨识音频数据所属类别的同时,提示该分类结果不可信。例如: 如图 5所示, 针对两类的分类问题, 当特征向量为两维时, 训练线性分类器就等价于找到 一条直线使得一类样本的特征向量分布在直线的一边,另一类样本的特征向量分布在直线 的另一边。 图 5中的三角形为训练时得到的一类特征向量, 圆形为训练时得到的另一类特 征向量, 直线是由这两类特征向量得到的分类线。 当待辨识音频数据的特征向量分布在直 线的左边 (如图 5中的正方形), 本发明就将该待分类的特征向量判别为三角形特征向量 所属的类。 如果待分类的特征向量分布在直线的右边 (如图 4中的五角星), 本发明就将 该待分类特征向量判别为圆形特征向量所属的类。 此时, 线性分类面参数 (通过求取线性 分类面法向量即可获得线性分类面参数)与待分类特征向量点积的符号 (正或负) 将用于 区分该待分类特征向量, 而点积的绝对值即为拒判指数, 则用于衡量分类的可信度, 拒判 指数(点积的绝对值)越大, 分类的可信度越高, 当点积的绝对值大于预设门限时, 则认 为分类是可靠的。  The rejection index is used to determine the credibility of the classification result. In practical applications, the threshold can be set according to the experiment (for example: the invention can establish a small-scale test set, and then find a threshold to concentrate the test. For some untrustworthy samples, the value is set to a threshold value. A threshold value is preset. When the rejection index is greater than the preset threshold, the classification result given by the classifier is credible; When the index is less than the preset threshold, the classification result of the classifier is less reliable, and the classifier indicates that the classification result is not credible while giving the category of the audio data to be identified. For example: As shown in Figure 5, for two types of classification problems, when the feature vector is two-dimensional, training the linear classifier is equivalent to finding a straight line so that the eigenvectors of one type of sample are distributed on one side of the line, and the other type The feature vector of the sample is distributed on the other side of the line. The triangle in Fig. 5 is a kind of eigenvector obtained during training. The circle is another type of eigenvector obtained during training. The straight line is the classification line obtained by these two kinds of eigenvectors. When the feature vector of the audio data to be recognized is distributed to the left of the line (such as a square in Fig. 5), the present invention discriminates the feature vector to be classified into a class to which the triangle feature vector belongs. If the feature vector to be classified is distributed to the right of the straight line (such as the five-pointed star in Fig. 4), the present invention discriminates the feature vector to be classified into a class to which the circular feature vector belongs. At this time, the linear classification surface parameter (the linear classification surface parameter can be obtained by obtaining the linear classification surface normal vector) and the symbol (positive or negative) of the dot product of the feature vector to be classified will be used to distinguish the feature vector to be classified, and the point The absolute value of the product is the rejection index, which is used to measure the credibility of the classification. The larger the rejection index (the absolute value of the dot product), the higher the reliability of the classification. When the absolute value of the dot product is greater than the default. At the threshold, the classification is considered reliable.

在实际应用中, 可以利用该方法和系统实现对自然界存在的各种不用的音频进行辨 识, 也可以利用该系统先对特定的几种音频进行辨识, 并以辨识的结果为基础, 实现后续 的功能, 训练出快速、 扩展能力好的分类器, 保证系统具有良好的实时性和扩展能力。  In practical applications, the method and system can be used to identify various unused audios in nature, and the system can be used to identify specific audios first, and based on the identification results, to implement subsequent Function, training a fast, scalable classifier to ensure that the system has good real-time and scalability.

本发明的智能音频辨识系统可以用于多种场合下的智能监控。如: 可在电梯内安装该 系统, 对尖叫声、 打闹声、 撞击声等不正常声响进行自动辨识, 并发出报警信号给监控人 员,从而提高对电梯内异常情况处理的反应时间,同时可以减轻电梯监控人员的工作负担。 该系统还可用于家庭监控。 在户内安装该系统后, 系统可对玻璃破碎声、 门口的撞击声、 爆炸声、枪击声等在室内可能发生的异常声响进行辨识, 当辨识到这些异常声响后立即发 出报警信号, 从而有效的防止通过破坏门窗入室盗窃等犯罪行为的发生。该系统还可安装 在室外, 自动的辨识雷声、风声、雨声等与天气相关的声响, 实时的对天气状况进行监测。 另外, 该系统还可帮助在野外工作的野生动物学家进行科研工作。野生动物学家往往需要 花费几星期甚至几个月的时间追踪一些罕见的野生动物, 本发明可以通过在指定区域撒 播安装有该系统的无线传感器, 来辨识某种野生动物的叫声, 当辨识出该种动物的声音后 发出信号, 帮助野生动物学家进行追踪。 该系统还可用于机械故障的诊断。 当机器发生故 障时,会发出异于机器工作正常时发出的声响,而且不同的故障发出的故障声响也不相同。 该系统就可以根据几种不同的故障音频进行学习,然后安装在机器附近实时的对机器工作 声响进行辨识, 当辨识出故障声响后报警并给出可能的故障类别, 该结果可帮助人们及时 的发现机器故障, 并为机器的故障诊断提供依据。该系统还可以应用于基于互联网的音频 检索以及基于音频的场景分析中。 The intelligent audio recognition system of the invention can be used for intelligent monitoring in various situations. Such as: can be installed in the elevator The system automatically recognizes abnormal sounds such as screams, screams, and percussive sounds, and sends an alarm signal to the monitoring personnel, thereby improving the reaction time for handling abnormal conditions in the elevator, and at the same time reducing the workload of the elevator monitoring personnel. . The system can also be used for home monitoring. After installing the system indoors, the system can identify the abnormal sounds that may occur in the room such as glass breaking sound, door crashing sound, explosion sound, gunshot sound, etc., and immediately send out an alarm signal after identifying these abnormal sounds, thereby effectively Prevent the occurrence of criminal acts such as theft of doors and windows into the room. The system can also be installed outdoors to automatically identify weather-related sounds such as thunder, wind, and rain, and monitor weather conditions in real time. In addition, the system can help wildlife researchers working in the field to conduct research. Wild zoologists often need to spend weeks or even months to track some rare wild animals. The present invention can identify the sound of a certain wild animal by broadcasting a wireless sensor installed in the designated area. The sound of the animal is signaled to help the wildlifeologist track it. The system can also be used for the diagnosis of mechanical faults. When the machine malfunctions, it will emit a sound different from the normal operation of the machine, and the sound of different faults will be different. The system can learn according to several different fault audios, and then install the real-time sound of the machine work near the machine. When the fault sound is recognized, the alarm will be given and the possible fault categories will be given. The result can help people in time. Found machine failures and provided a basis for machine troubleshooting. The system can also be applied to Internet-based audio retrieval and audio-based scene analysis.

应当理解的是, 对本领域普通技术人员来说, 可以根据上述说明加以改进或变换, 而 所有这些改进和变换都应属于本发明所附权利要求的保护范围。  It is to be understood that those skilled in the art can devise modifications and changes in accordance with the above description, and all such modifications and changes are intended to be included within the scope of the appended claims.

Claims

权利要求 Rights request 1、 一种智能音频辨识方法, 包括以下步骤: 1. An intelligent audio recognition method, comprising the following steps: A、 采集各种样本音频数据, 对采集到的样本音频数据进行标注;  A. Collect various sample audio data, and mark the collected sample audio data; B、 逐一从所述样本音频数据中提取出反映其本质特征的特征向量;  B. Extracting feature vectors reflecting the essential features from the sample audio data one by one; C、 根据所述特征向量划分所属类别区域, 使得划分后的各个不同类别区域中包含 尽量多的该类样本的特征向量, 建立从特征向量到所属类别之间映射关系的分类器; C. Dividing the category region according to the feature vector, so that each of the divided different category regions includes as many feature vectors as the sample, and establishing a classifier from the feature vector to the mapping relationship between the categories; D、 对待辨识音频数据进行处理, 提取其特征向量; D. processing the identified audio data, and extracting its feature vector; E、 将待辨识音频数据的特征向量输入到所述分类器, 分类器根据其特征向量进行 判别, 得到对该待辨识音频数据的辨识结果。  E. The feature vector of the audio data to be identified is input to the classifier, and the classifier performs discrimination according to the feature vector to obtain an identification result of the to-be-identified audio data. 2、 根据权利要求 1所述的方法, 其特征在于: 所述的步骤 B包括如下步骤: Bl、 对所述样本音频数据进行预处理, 得到训练数据;  2. The method according to claim 1, wherein: the step B comprises the following steps: Bl, preprocessing the sample audio data to obtain training data; B2、 从训练数据中提取反映训练数据本质特征的特征成分;  B2, extracting feature components reflecting the essential characteristics of the training data from the training data; B3、 将所述特征成分进行组合, 得到所述特征向量。  B3. Combining the feature components to obtain the feature vector. 3、 根据权利要求 1所述的方法, 其特征在于: 所述的步骤 D包括如下步骤: 3. The method according to claim 1, wherein: said step D comprises the following steps: Dl、 对所述待辨识音频数据进行预处理, 得到辨识数据; Dl, preprocessing the to-be-identified audio data to obtain identification data; D2、 从辨识数据中提取反映辨识数据本质特征的特征成分;  D2. Extracting characteristic components reflecting the essential characteristics of the identification data from the identification data; D3、 将所述特征成分进行组合, 得到所述特征向量。  D3. Combining the feature components to obtain the feature vector. 4、 根据权利要求 2或 3所述的方法, 其特征在于: 所述步骤 B2或 D2所述的特征 成分包括: 音频的中心频率、一些特定频率段内音频的能量特征或多个时段内音频的能 量分布特征。  The method according to claim 2 or 3, wherein: the characteristic component described in the step B2 or D2 comprises: a center frequency of the audio, an energy characteristic of the audio in some specific frequency segments or an audio in a plurality of time periods Energy distribution characteristics. 5、 根据权利要求 4所述的方法, 其特征在于: 所述步骤 B3或 D3所述的特征向量 为音频的中心频率与一些特定频率段内音频能量谱之和的矢量和。  5. The method according to claim 4, wherein: the feature vector described in step B3 or D3 is a vector sum of a center frequency of the audio and a sum of audio energy spectra in some specific frequency segments. 6、 根据权利要求 5所述的方法, 其特征在于: 步骤 C 所述的所属类别区域根据所 述特征向量的数值来划分, 并由曲线或曲面来限定。  6. The method according to claim 5, wherein: the category region to be described in step C is divided according to the value of the feature vector, and is defined by a curve or a curved surface. 7、 根据权利要求 6所述的方法, 其特征在于: 所述步骤 E包括如下处理:  7. The method according to claim 6, wherein: said step E comprises the following processing: El、将待辨识音频数据的特征向量输入到所述分类器, 分类器根据其特征向量进行 判别, 得到待辨识音频数据划分到所属类别的分类结果以及拒判指数, 所述拒判指数是 用来衡量分类结果可信度的参数;  El, inputting a feature vector of the audio data to be identified to the classifier, and the classifier performs discrimination according to the feature vector, and obtains a classification result of the audio data to be identified into the belonging category and a rejection index, where the rejection index is used a parameter to measure the credibility of the classification results; E2、根据拒判指数判断分类结果的可信度, 当类别指数高于预设门限时, 判定所述 分类结果可信, 分类器给出待辨识音频数据所属类别; 当拒判指数低于预设门限时, 分 类器给出待辨识音频数据所属类别的同时, 提示该分类结果不可信。 E2, determining the credibility of the classification result according to the rejection index, and determining that the category index is higher than a preset threshold The classification result is credible, and the classifier gives the category of the audio data to be identified; when the rejection index is lower than the preset threshold, the classifier gives the category of the audio data to be identified, and indicates that the classification result is not credible. 8 、根据权利要求 7所述的方法, 其特征在于: 所述步骤 A包括对采集到的样本音 频数据进行辨识, 确定并注明该样本音频数据是什么声音。  The method according to claim 7, wherein the step A comprises: identifying the collected sample audio data, determining and indicating what sound the sample audio data is. 9、 一种智能音频辨识系统, 其特征在于: 包括一用于采集并存储各类样本音频数 据的音频数据集、 一训练单元、 以及辨识单元; 所述训练单元用于提取样本音频数据的 特征向量, 并寻找和建立从样本音频数据特征向量到所属类别之间的映射关系; 所述辨 识单元用于存放已建立的音频数据特征向量与所属类别之间映射关系的数据, 以及提取 待辨识音频数据特征向量, 并根据待辨识音频数据的特征向量, 给出辨识结果。  9. An intelligent audio recognition system, comprising: an audio data set for collecting and storing various types of sample audio data, a training unit, and an identification unit; wherein the training unit is configured to extract characteristics of sample audio data Vector, and searching and establishing a mapping relationship from the sample audio data feature vector to the belonging category; the identifying unit is configured to store data of the mapping relationship between the established audio data feature vector and the belonging category, and extract the to-be-identified audio The data feature vector, and the identification result is given according to the feature vector of the audio data to be recognized. 10、 根据权利要求 9所述的系统, 其特征在于: 训练单元包括第一预处理模块、 第 一特征提取模块以及训练模块, 所述预处理模块用于对样本音频数据进行去噪处理, 得 到训练数据; 所述特征提取模块用于从训练数据中提取样本音频数据的特征向量, 所述 训练模块用于寻找和建立从样本音频数据特征向量到所属类别之间的映射关系。  The system according to claim 9, wherein: the training unit comprises a first pre-processing module, a first feature extraction module, and a training module, wherein the pre-processing module is configured to perform denoising processing on the sample audio data to obtain Training data; the feature extraction module is configured to extract feature vectors of sample audio data from the training data, and the training module is configured to find and establish a mapping relationship from the sample audio data feature vector to the belonging category. 11、 根据权利要求 9或 10所述的系统, 其特征在于: 所述辨识单元包括第二预处 理模块、第二特征提取模块以及分类器, 所述第二预处理模块用于对待辨识音频数据进 行去噪处理, 得到辨识数据; 所述第二特征提取模块用于从辨识数据中提取待辨识音频 数据的特征向量,所述分类器用于存放所述训练模块输出的音频数据特征向量与所属类 别之间映射关系的数据, 并根据输入的待辨识音频数据的特征向量, 输出辨识结果。  The system according to claim 9 or 10, wherein: the identification unit comprises a second pre-processing module, a second feature extraction module, and a classifier, wherein the second pre-processing module is configured to identify audio data Performing a denoising process to obtain identification data; the second feature extraction module is configured to extract a feature vector of the to-be-identified audio data from the identification data, where the classifier is configured to store the audio data feature vector and the category of the training module The data of the relationship is mapped, and the identification result is output according to the feature vector of the input audio data to be recognized.
PCT/CN2008/000765 2007-06-07 2008-04-15 An intelligent audio identifying system and method Ceased WO2008148289A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710075008.4 2007-06-07
CN 200710075008 CN101067930B (en) 2007-06-07 2007-06-07 An intelligent audio identification system and identification method

Publications (1)

Publication Number Publication Date
WO2008148289A1 true WO2008148289A1 (en) 2008-12-11

Family

ID=38880462

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/000765 Ceased WO2008148289A1 (en) 2007-06-07 2008-04-15 An intelligent audio identifying system and method

Country Status (2)

Country Link
CN (1) CN101067930B (en)
WO (1) WO2008148289A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184732A (en) * 2011-04-28 2011-09-14 重庆邮电大学 Fractal-feature-based intelligent wheelchair voice identification control method and system
CN111370025A (en) * 2020-02-25 2020-07-03 广州酷狗计算机科技有限公司 Audio recognition method and device and computer storage medium

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067930B (en) * 2007-06-07 2011-06-29 深圳先进技术研究院 An intelligent audio identification system and identification method
CN101587710B (en) * 2009-07-02 2011-12-14 北京理工大学 Multiple-codebook coding parameter quantification method based on audio emergent event
CN102623007B (en) * 2011-01-30 2014-01-01 清华大学 Classification method of audio features based on variable duration
CN102664004B (en) * 2012-03-22 2013-10-23 重庆英卡电子有限公司 Forest theft behavior identification method
CN103198838A (en) * 2013-03-29 2013-07-10 苏州皓泰视频技术有限公司 Abnormal sound monitoring method and abnormal sound monitoring device used for embedded system
CN103743477B (en) * 2013-12-27 2016-01-13 柳州职业技术学院 Method and device for detecting and diagnosing mechanical faults
CN104464733B (en) * 2014-10-28 2019-09-20 百度在线网络技术(北京)有限公司 A kind of more scene management method and devices of voice dialogue
CN104700833A (en) * 2014-12-29 2015-06-10 芜湖乐锐思信息咨询有限公司 Big data speech classification method
CN106531191A (en) * 2015-09-10 2017-03-22 百度在线网络技术(北京)有限公司 Method and device for providing danger report information
CN105138696B (en) * 2015-09-24 2019-11-19 深圳市冠旭电子股份有限公司 A music push method and device
CN105679313A (en) * 2016-04-15 2016-06-15 福建新恒通智能科技有限公司 Audio recognition alarm system and method
CN107801090A (en) * 2017-11-03 2018-03-13 北京奇虎科技有限公司 Utilize the method, apparatus and computing device of audio-frequency information detection anomalous video file
CN108764304B (en) * 2018-05-11 2020-03-06 Oppo广东移动通信有限公司 Scene recognition method, device, storage medium and electronic device
CN108764114B (en) * 2018-05-23 2022-09-13 腾讯音乐娱乐科技(深圳)有限公司 Signal identification method and device, storage medium and terminal thereof
CN108764341B (en) * 2018-05-29 2019-07-19 中国矿业大学 A fault diagnosis method for rolling bearings under variable working conditions
CN110658006B (en) * 2018-06-29 2021-03-23 杭州萤石软件有限公司 Sweeping robot fault diagnosis method and sweeping robot

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001242880A (en) * 2000-03-01 2001-09-07 Nippon Telegr & Teleph Corp <Ntt> Signal detection method, signal search method and recognition method, and recording medium
CN1316726A (en) * 2000-02-02 2001-10-10 摩托罗拉公司 Method and device for speech recognition
CN1614685A (en) * 2004-09-29 2005-05-11 上海交通大学 Quick refusing method for non-command in inserted speech command identifying system
CN101067930A (en) * 2007-06-07 2007-11-07 深圳先进技术研究院 An intelligent audio identification system and identification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1316726A (en) * 2000-02-02 2001-10-10 摩托罗拉公司 Method and device for speech recognition
JP2001242880A (en) * 2000-03-01 2001-09-07 Nippon Telegr & Teleph Corp <Ntt> Signal detection method, signal search method and recognition method, and recording medium
CN1614685A (en) * 2004-09-29 2005-05-11 上海交通大学 Quick refusing method for non-command in inserted speech command identifying system
CN101067930A (en) * 2007-06-07 2007-11-07 深圳先进技术研究院 An intelligent audio identification system and identification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUANG H. ET AL.: "A NEW EFFECTIVE METHOD ON AUDIO INFORMATION RETRIEVAL", COMPUTER APPLICATION INVESTIGATION, no. 3, 2004, pages 85 - 87 *
YANG X. ET AL.: "Multi-class signal classification algorithm based on wavelet subspace", SVM AND FUZZY INTEGRAL, INFORMATION AND CONTROL, vol. 36, no. 2, April 2007 (2007-04-01), pages 211 - 217 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184732A (en) * 2011-04-28 2011-09-14 重庆邮电大学 Fractal-feature-based intelligent wheelchair voice identification control method and system
CN111370025A (en) * 2020-02-25 2020-07-03 广州酷狗计算机科技有限公司 Audio recognition method and device and computer storage medium

Also Published As

Publication number Publication date
CN101067930A (en) 2007-11-07
CN101067930B (en) 2011-06-29

Similar Documents

Publication Publication Date Title
WO2008148289A1 (en) An intelligent audio identifying system and method
CN109309630B (en) Network traffic classification method and system and electronic equipment
CN102163427B (en) Method for detecting audio exceptional event based on environmental model
Carletti et al. Audio surveillance using a bag of aural words classifier
Conte et al. An ensemble of rejecting classifiers for anomaly detection of audio events
CN105424395A (en) Method and device for determining equipment fault
CN114137410A (en) Hydraulic mechanism circuit breaker fault identification method based on voiceprint detection technology
CN111460441A (en) A network intrusion detection method based on batch normalized convolutional neural network
CN118918926B (en) Baling event detection method and system based on acoustic event recognition and emotion recognition
Sharma et al. Two-stage supervised learning-based method to detect screams and cries in urban environments
CN110065867B (en) Method and system for evaluating elevator comfort level based on audio and video
Whitehill et al. Whosecough: In-the-wild cougher verification using multitask learning
CN119106360A (en) A machine equipment fault diagnosis system based on vibration signals
CN116416665B (en) Face recognition method and device based on security system and storage medium
Dong et al. At the speed of sound: Efficient audio scene classification
CN106251861A (en) A kind of abnormal sound in public places detection method based on scene modeling
CN115062725B (en) Hotel income anomaly analysis method and system
CN119646690A (en) A train plug door fault diagnosis method and system based on acoustic signals
CN119879334A (en) Multi-mode data fusion type museum air conditioner anomaly identification method
CN116488843B (en) A user behavior anomaly detection system and method based on cluster analysis
Zhao et al. Event classification for living environment surveillance using audio sensor networks
CN117219088A (en) Continuous cough voice recognition method for live pigs in complex environment
CN117692588A (en) An intelligent visual noise monitoring and traceability device
CN117782198A (en) A method and system for operating monitoring of highway electromechanical equipment based on cloud-edge architecture
Zhang et al. A study of sound recognition algorithm for power plant equipment fusing mfcc and imfcc features

Legal Events

Date Code Title Description
DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08733963

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08733963

Country of ref document: EP

Kind code of ref document: A1