[go: up one dir, main page]

CN100508587C - A News Video Retrieval Method Based on Speech Classification Recognition - Google Patents

A News Video Retrieval Method Based on Speech Classification Recognition Download PDF

Info

Publication number
CN100508587C
CN100508587C CNB2006100079659A CN200610007965A CN100508587C CN 100508587 C CN100508587 C CN 100508587C CN B2006100079659 A CNB2006100079659 A CN B2006100079659A CN 200610007965 A CN200610007965 A CN 200610007965A CN 100508587 C CN100508587 C CN 100508587C
Authority
CN
China
Prior art keywords
speech
frames
classification
standard
news video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006100079659A
Other languages
Chinese (zh)
Other versions
CN1825936A (en
Inventor
彭宇新
房翠华
陈晓鸥
吴於茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Peking University Founder Research and Development Center
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University, Peking University Founder Group Co Ltd filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CNB2006100079659A priority Critical patent/CN100508587C/en
Publication of CN1825936A publication Critical patent/CN1825936A/en
Application granted granted Critical
Publication of CN100508587C publication Critical patent/CN100508587C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

This invention relates to a news video search method based on phone sort identification, which divides all phone fragments of standard phones automatically in news video then identifies the standard phones by a phone identification system, since the standard phone can express the main content of the video, it's easy to realize the news searches from the context to the video.

Description

一种基于语音分类识别的新闻视频检索方法 A News Video Retrieval Method Based on Speech Classification Recognition

技术领域 technical field

本发明属于计算机语音识别及视频检索技术领域,具体涉及一种基于语音分类识别的新闻视频检索方法。The invention belongs to the technical field of computer voice recognition and video retrieval, and in particular relates to a news video retrieval method based on voice classification recognition.

背景技术 Background technique

目前,语音识别技术有着广泛的应用,它不仅可以应用于音频领域,而且在视频领域里也有着重要的应用,因为视频里也包含音频信息。如果能够通过语音识别技术识别出视频中的语音内容,则可以为视频检索提供有力的支持,实现从语音文本到视频内容的检索。现有的视频检索技术,一般是从视频中提取出颜色、纹理等底层特征,然后根据这些特征进行视频检索。但是,这种方法带来下列两个问题:(1)人们在检索视频时,是根据人类的高级语义特征如足球比赛、伊拉克战争、禽流感等进行检索,这与计算机描述的视频底层特征如颜色、纹理等特征具有很大的矛盾,两者无法达到一致;(2)现有的视频检索方法不能很好实现由文字到视频的检索,查询方式也不适合人们惯用的方式,应用非常不方便。现有的视频检索方法是:一般由用户向系统提交一个查询镜头或查询片断,然后系统返回与查询例子相似的结果,但是,与此同时带来的问题是:用户如何得到查询例子?另外,大多数用户习惯的查询方式是输入查询文字,然后系统返回与查询文字相关的视频资料,例如,用户输入查询文字“伊拉克战争”,希望系统能够返回与“伊拉克战争”相关的视频资料,类似于目前的搜索引擎如google和百度等,但与这些搜索引擎不同的是,输入的是文字,检索的结果却是视频资料。At present, speech recognition technology has a wide range of applications, not only in the field of audio, but also in the field of video, because the video also contains audio information. If the speech content in the video can be recognized through speech recognition technology, it can provide powerful support for video retrieval and realize the retrieval from speech text to video content. Existing video retrieval technologies generally extract low-level features such as color and texture from videos, and then perform video retrieval based on these features. However, this method brings the following two problems: (1) When people retrieve videos, they retrieve them based on human high-level semantic features such as football matches, Iraq war, bird flu, etc., which are different from the underlying features of videos described by computers, such as Features such as color and texture have great contradictions, and the two cannot be consistent; (2) The existing video retrieval methods cannot realize the retrieval from text to video well, and the query method is not suitable for people's usual methods, and the application is very inappropriate. convenient. The existing video retrieval method is: generally, the user submits a query shot or query segment to the system, and then the system returns a result similar to the query example. However, at the same time, the problem is: how does the user get the query example? In addition, the query method that most users are accustomed to is to enter query text, and then the system returns video materials related to the query text. For example, the user enters the query text "Iraq War" and hopes that the system can return video materials related to "Iraq War". Similar to current search engines such as Google and Baidu, but different from these search engines, the input is text, but the result of retrieval is video data.

为了实现上述从语音文本到视频内容的检索,需要得到视频中的文字信息,而为了得到视频中的文字信息,一个可行的方法是使用语音识别技术,识别出视频中的语音文字。但是,现有的语音识别系统,为了识别不同人的语音,往往需要先由说话人对语音识别系统进行训练,然后再由语音识别系统识别说话人的语音。这种方法对于包括多人的语音片断,难于应用,因为很难找到每一个人对语音识别系统进行训练,即使少数人的语音片断,经常也无法找到说话人进行语音训练,例如对于新闻视频的语音识别,是不可能找到每个说话人进行语音训练的;另外,即使经过语音训练,对非标准语音,依然很难识别,识别率非常低。但是,如果不经过语音训练,直接使用语音识别系统对新闻视频进行语音识别,那么识别效果会更差,识别率更低,因为视频的新闻节目通常包括了下列各种声音:(1)带音乐背景的新闻节目预告;(2)广告;(3)天气预报;(4)非标准语音,如被采访人的方言等;(5)标准语音。上述几种语音中,非标准语音的识别率非常低,而(1)-(3)的识别率更低,基本不能识别。因此,如果直接使用语音识别系统不加区分地对整个新闻视频进行语音识别,这样带来的结果是:语音识别系统对新闻视频所包含的各种声音均进行识别,最后导致语音识别的结果中包括了正确的识别结果(主要是对上述5中的标准语音的识别)和错误的识别结果(主要是对上述1至4中的其他语音的识别),而计算机无法知道哪些是正确结果,哪些是错误结果,因此,以此进行视频检索时,如查询文字“伊拉克战争”对应的视频,则会出现很多错误的结果。In order to realize the above retrieval from speech text to video content, it is necessary to obtain the text information in the video, and in order to obtain the text information in the video, a feasible method is to use speech recognition technology to recognize the speech text in the video. However, in the existing speech recognition system, in order to recognize the speech of different people, it is often necessary to train the speech recognition system by the speaker first, and then the speech recognition system recognizes the speech of the speaker. This method is difficult to apply to speech clips that include many people, because it is difficult to find everyone to train the speech recognition system. Even for a few people's speech clips, it is often impossible to find the speaker for speech training, such as for news videos. For speech recognition, it is impossible to find every speaker for speech training; in addition, even after speech training, it is still difficult to recognize non-standard speech, and the recognition rate is very low. However, if the speech recognition system is directly used for speech recognition of the news video without speech training, the recognition effect will be worse and the recognition rate will be lower, because the news programs of the video usually include the following various sounds: (1) with music Background news program preview; (2) advertisement; (3) weather forecast; (4) non-standard voice, such as the dialect of the person being interviewed; (5) standard voice. Among the above-mentioned speeches, the recognition rate of non-standard speeches is very low, and the recognition rates of (1)-(3) are even lower, and basically cannot be recognized. Therefore, if the speech recognition system is directly used to perform speech recognition on the entire news video indiscriminately, the result is: the speech recognition system recognizes all kinds of sounds contained in the news video, and finally leads to the result of speech recognition. Including correct recognition results (mainly the recognition of the standard speech in the above 5) and wrong recognition results (mainly the recognition of other speeches in the above 1 to 4), and the computer cannot know which are the correct results and which are is a wrong result. Therefore, when searching for videos based on this, if you search for videos corresponding to the text "Iraq War", many wrong results will appear.

发明内容 Contents of the invention

针对现有技术的不足,本发明的目的是提出一种基于语音分类识别的新闻视频检索方法,该方法能够实现不需说话人训练即能自动识别新闻视频中的标准普通话等标准语音,从而实现从文本到视频的新闻检索。For the deficiencies in the prior art, the purpose of this invention is to propose a kind of news video retrieval method based on voice classification and recognition, which can realize automatic recognition of standard voices such as standard mandarin in the news video without speaker training, thereby realizing News retrieval from text to video.

为达到以上目的,本发明采用的技术方案是:一种基于语音分类识别的新闻视频检索方法,包括以下步骤:In order to achieve the above object, the technical solution adopted in the present invention is: a kind of news video retrieval method based on voice classification recognition, comprising the following steps:

(1)运用声音分类器,分割出新闻视频中标准语音的语音片断,所述的标准语音是指发音标准的语音;(1) utilize sound classifier, segment out the speech segment of standard speech in the news video, described standard speech refers to the speech of pronunciation standard;

(2)采用语音识别系统识别出新闻视频中标准语音的语音片断,转化为文本内容;(2) Use the speech recognition system to recognize the speech fragments of the standard speech in the news video and convert them into text content;

(3)根据步骤(2)得到的文本内容,进行相应视频资料的检索,实现从语音文本到新闻视频的检索。(3) According to the text content obtained in step (2), carry out the retrieval of the corresponding video material, and realize the retrieval from the voice text to the news video.

进一步,本发明所述的标准语音最好是发音标准的普通话。Further, the standard voice described in the present invention is preferably Mandarin with standard pronunciation.

进一步,步骤(1)中,音频分类采用了基于支持向量机的分类模型,分为两部分:分类器模型训练和分类预测;音频分类时提取的音频特征采用的是对数能量(log energy)和梅尔倒频谱系数(Mel-scale Frequency CepstralCoefficients,简称MFCC)组成的13维特征向量。Further, in step (1), audio classification adopts a classification model based on a support vector machine, which is divided into two parts: classifier model training and classification prediction; audio features extracted during audio classification are logarithmic energy (log energy) A 13-dimensional feature vector composed of Mel-scale Frequency Cepstral Coefficients (MFCC for short).

再进一步,为使本发明具有更好的效果,步骤(1)中,分类器模型训练的过程是:首先选择训练样本,然后提取每一个样本的对数能量和梅尔倒频谱系数组成的音频特征,并将所有这些特征写入一个特征文件中,然后利用支持向量机生成分类器模型。训练样本包含下列5类:1)标准语音;2)音乐;3)背景噪声;4)无声;5)非标准语音;分类以帧为单位,给每个音频帧赋值一个相应的类别,训练样本的类别标注也是以帧为单位,利用标注好的类别进行模型训练。Further, in order to make the present invention have better effect, in step (1), the process of classifier model training is: first select training sample, then extract the audio frequency that the logarithmic energy of each sample and Mel's cepstral coefficient form features, and write all these features into a feature file, and then use the support vector machine to generate a classifier model. The training samples include the following five categories: 1) standard speech; 2) music; 3) background noise; 4) silent; The category labeling of is also based on the frame unit, and the model training is performed using the labeled category.

再进一步,为使本发明具有更好的效果,步骤(1)中,分类预测的过程是:对于要进行分类的新闻视频,提取新闻音频的对数能量和梅尔倒频谱系数组成的音频特征,然后利用支持向量机训练出来的分类器模型进行自动分类标注。Still further, in order to make the present invention have better effect, in the step (1), the process of classification prediction is: for the news video that will be classified, extract the audio feature that the logarithmic energy of news audio and Mel's cepstral coefficient form , and then use the classifier model trained by the support vector machine for automatic classification and labeling.

再进一步,为使本发明具有更好的效果,步骤(1)中,对初步分割出的新闻视频中标准语音的语音片断进行修正处理:即在分类结果中如果在连续相同类别的帧中突然出现独立的一个或者M个不同类别的帧,M为正整数,则将这些帧判断成错误识别的帧,并将这些孤立帧修正为连续同类别的帧。这是因为连续相同类别的帧中,不可能零星出现极少数其它类别的帧,所以此时可以将这些帧判断成错误识别的帧,并将这些孤立帧修正为连续同类别的帧。Still further, in order to make the present invention have better effect, in step (1), the speech clip of the standard speech in the news video that preliminarily divides out is carried out correction processing: When one or M frames of different categories appear independently, and M is a positive integer, these frames are judged as misrecognized frames, and these isolated frames are corrected as consecutive frames of the same category. This is because in consecutive frames of the same category, it is impossible to have very few frames of other categories sporadically, so these frames can be judged as misrecognized frames at this time, and these isolated frames can be corrected as consecutive frames of the same category.

再更进一步,为使本发明具有更好的效果:对初步分割出的新闻视频中标准普通话的语音片断进行修正处理时,在实际应用中,选择M小于或等于10,即如果一段连续同类别的音频中间出现了小于或等于10帧是不同类别的,则判断这些帧是错误识别的帧。Still further, in order to make the present invention have a better effect: when the speech fragments of standard Mandarin in the initially segmented news video are corrected, in practical applications, select M to be less than or equal to 10, that is, if a section of continuous same category If there are less than or equal to 10 frames of different categories in the middle of the audio, it is judged that these frames are misrecognized frames.

本发明的效果在于:与现有方法相比,本发明可以实现不需说话人训练即能自动识别新闻视频中的标准语音,从而得到反映新闻视频的最主要的文本内容,然后通过文本内容实现从文本到视频的新闻检索,从而充分发挥音频分析和检索技术在信息检索中的巨大作用。The effect of the present invention is that: compared with the existing method, the present invention can automatically recognize the standard speech in the news video without speaker training, thereby obtaining the most important text content reflecting the news video, and then realizing it through the text content News retrieval from text to video, so as to give full play to the great role of audio analysis and retrieval technology in information retrieval.

本发明之所以具有上述发明效果,其原因在于:Why the present invention has above-mentioned invention effect, its reason is in:

新闻视频中播音员的标准发音,可以反映该新闻视频的最主要内容;而现有的语音识别系统,可以不需说话人训练即可较好识别标准语音,因此,本发明首先在新闻视频中,自动分割出标准语音的所有语音片断;然后运用语音识别系统,对分割出的标准语音进行识别,得到反映新闻视频的最主要文本内容,从而实现从文本到视频的新闻检索。The standard pronunciation of the announcer in the news video can reflect the main content of the news video; and the existing speech recognition system can better recognize the standard voice without speaker training. Therefore, the present invention is first used in the news video. , automatically segment all the speech segments of the standard speech; and then use the speech recognition system to recognize the segmented standard speech to obtain the most important text content reflecting the news video, thereby realizing news retrieval from text to video.

附图说明 Description of drawings

图1是本发明的流程示意图。Fig. 1 is a schematic flow chart of the present invention.

具体实施方式 Detailed ways

下面结合附图和具体实施例对本发明作进一步详细的描述。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示,一种基于语音分类识别的新闻视频检索方法,包括以下步骤:As shown in Figure 1, a news video retrieval method based on speech classification and recognition includes the following steps:

(1)运用声音分类器,分割出新闻视频中标准语音的语音片断,本实施例中的标准语音以标准普通话为例加以说明;(1) Utilize sound classifier, segment out the speech segment of standard speech in the news video, the standard speech in the present embodiment is illustrated with standard mandarin as example;

音频分类采用了基于支持向量机的分类模型,分为两部分:分类器模型训练和分类预测。音频特征采用的是对数能量(log energy)和梅尔倒频谱系数(MFCC)组成的13维特征向量。Audio classification uses a classification model based on support vector machines, which is divided into two parts: classifier model training and classification prediction. The audio feature uses a 13-dimensional feature vector composed of log energy (log energy) and Mel cepstral coefficient (MFCC).

本实施例中,分类器模型训练的过程是:首先选择训练样本,然后提取每一个样本的对数能量和梅尔倒频谱系数组成的音频特征,并将所有这些特征写入一个特征文件中,然后利用支持向量机生成分类器模型。训练样本包含下列5类:1)标准普通话;2)音乐;3)背景噪声;4)无声;5)非标准普通话。分类以帧为单位,给每个音频帧赋值一个相应的类别,训练样本的类别标注也是以帧为单位,但是因为每个音频帧的时间长度为23毫秒(采样频率22050赫兹,音框包含512个采样点),不可能在这个时间长度上进行准确的手工类别标注,因此本发明是选择一段音频,以人耳进行判断,这段音频中的内容属于一个类别,且以人耳判断,没有明显的其他类别出现,则将这段音频中的所有帧都赋值为一个类别,利用标注好的类别进行模型训练。In this embodiment, the process of classifier model training is: first select training samples, then extract the audio features formed by the logarithmic energy and Mel cepstral coefficients of each sample, and write all these features in a feature file, A classifier model is then generated using a support vector machine. The training samples include the following five categories: 1) Standard Mandarin; 2) Music; 3) Background noise; 4) Silence; 5) Non-standard Mandarin. Classification is in units of frames, assigning a corresponding category to each audio frame, and the category labeling of training samples is also in units of frames, but because the time length of each audio frame is 23 milliseconds (sampling frequency 22050 Hz, sound frame contains 512 sampling points), it is impossible to perform accurate manual category labeling in this length of time, so the present invention selects a section of audio and judges it with the human ear. The content in this audio belongs to a category, and it is judged by the human ear. If other obvious categories appear, assign all frames in this audio to a category, and use the labeled category for model training.

本实施例中,分类预测的过程是:对于要进行分类的新闻视频,提取新闻音频的对数能量和梅尔倒频谱系数组成的音频特征,然后利用支持向量机训练出来的分类器模型进行自动分类标注。In this embodiment, the process of classification prediction is: for the news video to be classified, extract the audio feature formed by the logarithmic energy of the news audio and the Mel cepstral coefficient, and then use the classifier model trained by the support vector machine to perform automatic Category labeling.

本实施例中,对分类结果进行修正处理的过程是:对初步分割出的新闻视频中标准普通话的语音片断进行修正处理,因为分类标注是以帧为单位,每一帧的长度是23毫秒,分类结果中会出现这样的情况:连续相同类别的帧,会出现独立的一个或者M(M为正整数)个不同类别的帧,由于连续相同类别的帧,不可能零星出现极少数其它类别的帧,所以本实施例中将这些帧判断成错误识别的帧,并将这些孤立帧修正为连续同类别的帧。在实际应用中,可选择M小于或等于10,即如果一段连续同类别的音频中间出现了少于或等于10帧是不同类别的,则可判断这些帧是错误识别的帧。In this embodiment, the process of correcting the classification results is: correcting the speech segments of standard Mandarin in the initially segmented news video, because the classification label is based on frames, and the length of each frame is 23 milliseconds. There will be such a situation in the classification results: consecutive frames of the same category, there will be one independent or M (M is a positive integer) frames of different categories, due to continuous frames of the same category, it is impossible to sporadically appear a very small number of other categories frames, so in this embodiment, these frames are judged as misrecognized frames, and these isolated frames are corrected as consecutive frames of the same type. In practical applications, M can be selected to be less than or equal to 10, that is, if there are less than or equal to 10 frames of different categories in a continuous piece of audio of the same category, it can be judged that these frames are misrecognized frames.

(2)采用语音识别系统识别出新闻视频中标准普通话的语音片断,转化为文本内容;(2) Use the speech recognition system to recognize the speech fragments of standard Mandarin in the news video and convert them into text content;

利用现有的语音识别软件进行识别,识别的过程是输入本发明分割出的标准普通话语音,然后语音识别软件进行识别。Utilize existing speech recognition software to carry out recognition, and the process of recognition is to input the standard mandarin speech that the present invention divides, and then speech recognition software carries out recognition.

(3)根据步骤(2)得到的文本内容,进行相应视频资料的检索,实现从语音文本到新闻视频的检索。(3) According to the text content obtained in step (2), carry out the retrieval of the corresponding video material, and realize the retrieval from the voice text to the news video.

下面的实验结果表明,与现有方法相比,本发明不仅可以不需说话人训练即能识别新闻视频的语音,而且可以得到反映新闻视频的最主要的文本内容,从而为文本到视频的新闻检索提供了有力支持,证明了本发明在基于语音分类识别的新闻视频检索中取得了很好的效果。The following experimental results show that, compared with the existing methods, the present invention can not only recognize the speech of the news video without speaker training, but also can obtain the most important text content reflecting the news video, thus providing a new way for text-to-video news Retrieval provides strong support, which proves that the present invention has achieved good results in news video retrieval based on speech classification recognition.

本实施例中采用了一个1小时4分钟6秒的新闻节目,它主要包括了下列声音:1)带音乐背景的新闻节目预告;2)广告;3)天气预报;4)非标准普通话,如被采访人的方言等;5)标准普通话。其中,1)-4)的识别率非常低,以该新闻节目为例,使用现有的语音识别软件,对4)非标准普通话的正确识别率只有6.2%,而1)-3)的识别率更低,基本不能识别。In the present embodiment, a news program of 1 hour, 4 minutes and 6 seconds has been adopted, which mainly includes the following sounds: 1) news program preview with music background; 2) advertisement; 3) weather forecast; 4) non-standard Mandarin, such as The dialect of the person being interviewed; 5) Standard Mandarin. Wherein, the recognition rate of 1)-4) is very low, take this news program as example, use existing speech recognition software, only have 6.2% to the correct recognition rate of 4) non-standard mandarin, and the recognition of 1)-3) The rate is lower and basically unrecognizable.

本发明的目标是从新闻视频中分割出标准普通话片断,然后由语音识别软件识别标准普通话部分,得到新闻视频的主要文本内容。The object of the present invention is to segment the standard mandarin segment from the news video, and then recognize the standard mandarin part by voice recognition software to obtain the main text content of the news video.

1、标准普通话的分割结果1. Segmentation results of standard Mandarin

本实施例使用了下列2个标准评价本发明分割出新闻视频中的标准普通话的结果:The present embodiment has used the following 2 standard evaluations that the present invention divides out the result of the standard mandarin in the news video:

查准率=本发明分割的正确标准普通话/本发明分割的标准普通话结果Accuracy rate=the standard Mandarin result of the correct standard mandarin of the present invention division/the present invention division

查全率=本发明分割的正确标准普通话/新闻视频中包括的所有标准普通话Recall = all standard Mandarin included in the correct standard Mandarin/news video segmented by the present invention

其中,新闻视频中包括的所有标准普通话,是由人手工听新闻视频得到的,而本发明分割的标准普通话是由计算机自动实现的。最后,在这个新闻视频中,本发明分割标准普通话的查准率=99.05%,查全率=96.69%,取得了很好的结果。Wherein, all the standard mandarin included in the news video are obtained by people manually listening to the news video, while the standard mandarin segmented by the present invention is automatically realized by a computer. Finally, in this news video, the precision rate=99.05% and the recall rate=96.69% of the standard mandarin segmented by the present invention have achieved very good results.

2、标准普通话的识别结果2. Recognition results of standard Mandarin

使用现有的语音识别软件,对本发明分割出的标准普通话进行识别,正确识别率可以达到61.42%。如果不使用本发明,对整个新闻节目进行识别,则正确识别率只有47.6%,这带来的结果是:因为除标准普通话以外的声音,识别率极低,如果不去掉这些部分,则最后的识别结果包括了正确的识别结果和错误的识别结果,而计算机无法知道哪些是正确结果,哪些是错误结果。这样做视频检索时,如查询“伊拉克战争”显示对应的视频,则会出现很多错误的结果。但是,使用本发明,因为基本分割出了标准普通话,而标准普通话包括播音员播报的部分,该内容已可以基本发映出整个新闻视频的主要内容,本发明仅仅对标准普通话进行识别。这样,本发明可以得到反映新闻视频的最主要文本内容,检索结果可以大大提高。Using the existing speech recognition software to recognize the standard Mandarin segmented by the present invention, the correct recognition rate can reach 61.42%. If the present invention is not used to identify the whole news program, then the correct recognition rate is only 47.6%. The result that this brings is: because of the sound other than standard Mandarin, the recognition rate is extremely low. If these parts are not removed, the final The recognition results include correct recognition results and wrong recognition results, and the computer cannot know which are correct results and which are wrong results. When doing video retrieval in this way, such as querying "Iraq War" to display the corresponding video, many wrong results will appear. But, using the present invention, because the standard mandarin is basically segmented, and the standard mandarin includes the part broadcast by the announcer, this content can basically reflect the main content of the whole news video, the present invention only recognizes the standard mandarin. In this way, the present invention can obtain the main text content reflecting the news video, and the retrieval result can be greatly improved.

本发明所述的方法并不限于具体实施方式中所述的实施例,比如说标准语音除了我国的标准普通话外,其他任何国家或是地区只要存在标准发音的语言,也都可以基于相同的原理和方法来实现从文本到视频的新闻检索。本领域技术人员根据本发明的技术方案得出其他的实施方式,同样属于本发明的技术创新范围。The method described in the present invention is not limited to the embodiments described in the specific implementation mode. For example, the standard pronunciation can be based on the same principle as long as there is a standard pronunciation language in any other country or region except my country's standard Mandarin. and methods to achieve news retrieval from text to video. Other implementations obtained by those skilled in the art according to the technical solution of the present invention also belong to the technical innovation scope of the present invention.

Claims (8)

1、一种基于语音分类识别的新闻视频检索方法,包括以下步骤:1. A news video retrieval method based on speech classification and recognition, comprising the following steps: (1)运用声音分类器,分割出新闻视频中标准语音的语音片断,所述的标准语音是指发音标准的语音;(1) utilize sound classifier, segment out the speech segment of standard speech in the news video, described standard speech refers to the speech of pronunciation standard; (2)采用语音识别系统识别出新闻视频中标准语音的语音片断,转化为文本内容;(2) Use the speech recognition system to recognize the speech fragments of the standard speech in the news video and convert them into text content; (3)根据步骤(2)得到的文本内容,进行相应视频资料的检索,实现从语音文本到新闻视频的检索。(3) According to the text content obtained in step (2), carry out the retrieval of the corresponding video material, and realize the retrieval from the voice text to the news video. 2、如权利要求1所述的一种基于语音分类识别的新闻视频检索方法,其特征在于:标准语音为发音标准的普通话。2. A news video retrieval method based on speech classification and recognition as claimed in claim 1, wherein the standard speech is Mandarin with standard pronunciation. 3、如权利要求1所述的一种基于语音分类识别的新闻视频检索方法,其特征在于:步骤(1)中运用声音分类器分割出新闻视频中标准语音的语音片断时,音频分类采用了基于支持向量机的分类模型,分为两部分:分类器模型训练和分类预测;音频分类时提取的音频特征采用的是对数能量和梅尔倒频谱系数组成的13维特征向量。3. A kind of news video retrieval method based on speech classification and recognition as claimed in claim 1, characterized in that: in the step (1), when using a voice classifier to segment out the speech segment of the standard speech in the news video, the audio classification adopts The classification model based on support vector machine is divided into two parts: classifier model training and classification prediction; audio features extracted during audio classification use 13-dimensional feature vectors composed of logarithmic energy and Mel cepstral coefficients. 4、如权利要求3所述的一种基于语音分类识别的新闻视频检索方法,其特征在于:步骤(1)中,分类器模型训练的过程是:首先选择训练样本,然后提取每一个样本的对数能量和梅尔倒频谱系数组成的音频特征,并将所有这些特征写入一个特征文件中,然后利用支持向量机生成分类器模型,训练样本包含下列5类:1)标准语音;2)音乐;3)背景噪声;4)无声;5)非标准语音;分类以帧为单位,给每个音频帧赋值一个相应的类别,训练样本的类别标注也是以帧为单位,利用标注好的类别进行模型训练。4. A kind of news video retrieval method based on speech classification recognition as claimed in claim 3, it is characterized in that: in step (1), the process of classifier model training is: at first select training samples, then extract each sample Audio features composed of logarithmic energy and Mel cepstral coefficients, and write all these features into a feature file, and then use the support vector machine to generate a classifier model. The training samples include the following five categories: 1) standard speech; 2) Music; 3) background noise; 4) silence; 5) non-standard speech; classification is in units of frames, assigning a corresponding category to each audio frame, and the category labeling of training samples is also in units of frames, using the marked category Do model training. 5、如权利要求3或4所述的一种基于语音分类识别的新闻视频检索方法,其特征在于:步骤(1)中,分类预测时,对于要进行分类的新闻视频,提取新闻音频的对数能量和梅尔倒频谱系数组成的音频特征,然后利用支持向量机训练出来的分类器模型进行自动分类标注。5. A kind of news video retrieval method based on speech classification and recognition as claimed in claim 3 or 4, characterized in that: in step (1), when classifying and predicting, for the news video to be classified, the pair of news audio is extracted The audio features composed of numerical energy and Mel cepstral coefficients are used, and then the classifier model trained by the support vector machine is used for automatic classification and labeling. 6、如权利要求1、3或4所述的一种基于语音分类识别的新闻视频检索方法,其特征在于:步骤(1)中,对初步分割出的新闻视频中标准语音的语音片断进行修正处理,即在分类结果中如果在连续相同类别的帧中突然出现独立的一个或者M个不同类别的帧,M为正整数,则将这些帧判断成错误识别的帧,并将这些孤立帧修正为连续同类别的帧。6. A kind of news video retrieval method based on speech classification and recognition as claimed in claim 1, 3 or 4, characterized in that: in step (1), the speech segment of the standard speech in the initially segmented news video is corrected Processing, that is, in the classification results, if one or M frames of different categories suddenly appear in consecutive frames of the same category, and M is a positive integer, these frames are judged as misrecognized frames, and these isolated frames are corrected for consecutive frames of the same category. 7、如权利要求5所述的一种基于语音分类识别的新闻视频检索方法,其特征在于:步骤(1)中,对初步分割出的新闻视频中标准语音的语音片断进行修正处理,即在分类结果中如果在连续相同类别的帧中突然出现独立的一个或者M个不同类别的帧,M为正整数,则将这些帧判断成错误识别的帧,并将这些孤立帧修正为连续同类别的帧;上述的标准语音为发音标准的普通话。7. A kind of news video retrieval method based on speech classification and recognition as claimed in claim 5, characterized in that: in the step (1), the speech segment of the standard speech in the initially segmented news video is corrected, that is, in In the classification results, if one or M frames of different categories suddenly appear in consecutive frames of the same category, and M is a positive integer, these frames are judged as misrecognized frames, and these isolated frames are corrected as consecutive frames of the same category frame; the above-mentioned standard voice is Mandarin with standard pronunciation. 8、如权利要求7所述的一种基于语音分类识别的新闻视频检索方法,其特征在于:对初步分割出的新闻视频中标准语音的语音片断进行修正处理时,M小于或等于10,即如果一段连续同类别的音频中间出现了小于或等于10帧是不同类别的,则判断这些帧是错误识别的帧。8. A news video retrieval method based on speech classification and recognition as claimed in claim 7, characterized in that: when the speech segment of the standard speech in the initially segmented news video is corrected, M is less than or equal to 10, that is If there are less than or equal to 10 frames of different categories in a continuous piece of audio of the same category, it is determined that these frames are incorrectly recognized frames.
CNB2006100079659A 2006-02-24 2006-02-24 A News Video Retrieval Method Based on Speech Classification Recognition Expired - Fee Related CN100508587C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100079659A CN100508587C (en) 2006-02-24 2006-02-24 A News Video Retrieval Method Based on Speech Classification Recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100079659A CN100508587C (en) 2006-02-24 2006-02-24 A News Video Retrieval Method Based on Speech Classification Recognition

Publications (2)

Publication Number Publication Date
CN1825936A CN1825936A (en) 2006-08-30
CN100508587C true CN100508587C (en) 2009-07-01

Family

ID=36936335

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100079659A Expired - Fee Related CN100508587C (en) 2006-02-24 2006-02-24 A News Video Retrieval Method Based on Speech Classification Recognition

Country Status (1)

Country Link
CN (1) CN100508587C (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI387341B (en) 2008-12-15 2013-02-21 Wistron Corp Television and operating method thereof
CN101764970B (en) * 2008-12-23 2013-08-07 纬创资通股份有限公司 Television and operation method thereof
KR100999655B1 (en) * 2009-05-18 2010-12-13 윤재민 Digital video recorder system and application method thereof
CN104866477B (en) * 2014-02-21 2021-08-17 联想(北京)有限公司 An information processing method and electronic device
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification
CN106231399A (en) * 2016-08-01 2016-12-14 乐视控股(北京)有限公司 Methods of video segmentation, equipment and system
CN108573712B (en) * 2017-03-13 2020-07-28 北京贝塔科技股份有限公司 Voice activity detection model generation method and system and voice activity detection method and system
CN110324726B (en) * 2019-05-29 2022-02-18 北京奇艺世纪科技有限公司 Model generation method, video processing method, model generation device, video processing device, electronic equipment and storage medium
CN112233667B (en) * 2020-12-17 2021-03-23 成都索贝数码科技股份有限公司 Synchronous sound recognition method based on deep learning
CN113420178B (en) * 2021-07-14 2025-02-18 腾讯音乐娱乐科技(深圳)有限公司 A data processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
开放架构的数字视频管理系统iView研究与实现. 王炜,吕荣聪,武德峰,张军,汤大权,李志强.国防科技大学学报,第25卷第5期. 2003 *
音频分类与分割技术研究. 白亮.. 2005 *

Also Published As

Publication number Publication date
CN1825936A (en) 2006-08-30

Similar Documents

Publication Publication Date Title
CN100508587C (en) A News Video Retrieval Method Based on Speech Classification Recognition
US6434520B1 (en) System and method for indexing and querying audio archives
US10977299B2 (en) Systems and methods for consolidating recorded content
CN103700370B (en) A kind of radio and television speech recognition system method and system
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN103956169B (en) A kind of pronunciation inputting method, device and system
US7487094B1 (en) System and method of call classification with context modeling based on composite words
Zhou et al. Efficient audio stream segmentation via the combined T/sup 2/statistic and Bayesian information criterion
WO2020211354A1 (en) Speaker identity recognition method and device based on speech content, and storage medium
US20190318737A1 (en) Dynamic gazetteers for personalized entity recognition
JP4220449B2 (en) Indexing device, indexing method, and indexing program
CN112133277B (en) Sample generation method and device
CN111128223A (en) Text information-based auxiliary speaker separation method and related device
CN1662956A (en) Mega speaker identification (ID) system and corresponding methods therefor
CN1293428A (en) Information check method based on speed recognition
CN102122506A (en) Method for recognizing voice
CN1870728A (en) Method and system for automatic subtilting
CN107480152A (en) A kind of audio analysis and search method and system
CN112599114B (en) Voice recognition method and device
CN112614514B (en) Effective voice fragment detection method, related equipment and readable storage medium
CN117153175A (en) Audio processing method, device, equipment, medium and product
CN119673173A (en) A streaming speaker log method and system
JP2011053569A (en) Audio processing device and program
Hafen et al. Speech information retrieval: a review
JP4132590B2 (en) Method and apparatus for simultaneous speech recognition, speaker segmentation and speaker classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220916

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Patentee after: PEKING University FOUNDER R & D CENTER

Address before: 100871, fangzheng building, 298 Fu Cheng Road, Beijing, Haidian District

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Patentee before: PEKING University FOUNDER R & D CENTER

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090701

CF01 Termination of patent right due to non-payment of annual fee