CN1924996B

CN1924996B - System and method for selecting audio content by using speech recognition

Info

Publication number: CN1924996B
Application number: CN2005100991147A
Authority: CN
Inventors: 沈家麟; 洪健洲
Original assignee: Delta Electronics Inc
Current assignee: Delta Electronics Inc
Priority date: 2005-08-31
Filing date: 2005-08-31
Publication date: 2011-06-29
Anticipated expiration: 2025-08-31
Also published as: CN1924996A

Abstract

The present invention is a system and method for selecting sound content by using voice recognition, which is used to obtain a sound sentence in a sound content played in sequence, and then process it in a processing system. The system includes: a playback module, which is used to play the sound content; a receiving module, which is used to instantly receive a voice input sentence issued by a user; a buffer module, which is used to temporarily store the sound content in a specified interval played by the playback module and the voice input sentence issued by the user; a recognition module, which is used to capture the sound content and the voice input sentence in the specified interval in the buffer module to perform voice recognition; and a conversion module, which is used to convert a text sentence corresponding to the sound sentence according to the sound sentence that best matches the voice input sentence compared by the recognition module, so as to transmit it to the processing system for processing.

Description

System and method for selecting audio content by using speech recognition

技术领域technical field

本发明涉及一种选取声音内容的系统及其方法，本发明尤其涉及一种利用语音辨识技术以在声音内容中选取出特定声音片段而可进一步进行其后续处理的系统及其方法。 The present invention relates to a system and method for selecting audio content, in particular, the present invention relates to a system and method for selecting a specific audio segment from audio content by using speech recognition technology for further subsequent processing. the

背景技术Background technique

现行的资讯表达形式多以书面文字的内容(content)为主，而在此书面的文字内容之中常常有一些重要或是关键性的文字语句，对于这些关键语句(key phrases)可以透过系统主动加上标记来便于让使用者识别，这些标记像是反白、底线、引号、不同的颜色或是字体变更，或是由使用者主动以键盘、鼠标或输入笔等各式选取工具加以移动标记而选取出，被选取的关键语句可以用来作为进阶搜寻(advanced search)或是关键词索引(keyword index)等等用途。譬如系统可以在互联网的网站中对于其网页内容中的关键语句加上超链接，点选后可链接至其他的网页，而一般使用者则是可在电脑屏幕上观看一篇书面文章时，也可将一段关键语句以鼠标选取后，转贴到互联网上的各式搜寻引擎中以寻找出相关的对应文章。 The current form of information expression is mainly based on the content of the written text, and there are often some important or key text sentences in the written text content, for these key phrases (key phrases) can be obtained through the system Actively add marks for easy identification by users, such as highlighting, underlining, quotation marks, different colors or font changes, or are actively moved by users with various selection tools such as keyboard, mouse or stylus The selected key words can be used for advanced search or keyword index. For example, the system can add hyperlinks to key sentences in the content of the web pages on Internet sites, and after clicking, they can be linked to other web pages. After selecting a key sentence with the mouse, it can be forwarded to various search engines on the Internet to find relevant corresponding articles. the

由于，现行的资讯内容多是以文字呈现为主，对于纯粹只有声音表现的资讯内容仍是属于少数，但是在各式移动装置愈来愈普及的情况下，由于屏幕大小的限制，有些讯息由“看”改成“听”更为方便，再加上蓝牙耳机与无限上网等技术的逐渐普及，愈来愈多的资讯表达形式是采用“听”的声音内容，而对于在这些声音内容中如何去选取关键语句(key phrase)成了需要解决的问题。 Because most of the current information content is mainly presented in text, there are still a minority of information content that is purely expressed in sound. However, as various mobile devices become more and more popular, due to the limitation of screen size, some information is It is more convenient to change "watch" to "listen". Coupled with the gradual popularization of technologies such as Bluetooth headsets and unlimited Internet access, more and more forms of information expression use "listen" sound content, and for these sound content How to select the key phrase (key phrase) has become a problem that needs to be solved. the

此外，因为前述的采用“看”的书面文字内容是以并行的方式(parallel)来表达其讯息，而“听”的声音内容却是以顺序的方式(sequential)来表示其资讯内容，因此显然无法像对书面的文字内容一样使用既有选取工具，如超链接或是由鼠标选取其关键语句等等方式，来选取声音内容，因此使用者如何与声音内容有效进行互动成了逐渐增加的需求。In addition, because the written content of "seeing" is used to express its information in a parallel manner (parallel), while the audio content of "listening" is expressed in a sequential manner (sequential), it is obvious that It is impossible to use the existing selection tools like the written text content, such as hyperlinks or key words selected by the mouse, etc. to select the sound content, so how to effectively interact with the sound content has become an increasing demand for users .

综上所述，由于现今声音内容在选取其关键语句的技术仍有不足之处，因此发明人有鉴于上述现有技术的缺点而发明出本发明“利用语音辨识以选取声音内容的系统及其方法”。 To sum up, since the technology for selecting key sentences of voice content still has deficiencies, the inventors have invented the present invention "system for selecting voice content by using speech recognition and its method". the

发明内容Contents of the invention

本发明的主要目的在于提供一种利用语音辨识以选取声音内容的系统及其方法，其可以利用现有的语音辨识方法并经过适当搭配运用以达到声音内容与使用者的有效互动。 The main purpose of the present invention is to provide a system and method for selecting voice content by voice recognition, which can utilize the existing voice recognition method and use it properly to achieve effective interaction between the voice content and the user. the

本发明的另一目的在于提供一种利用语音辨识以选取声音内容的系统及其方法，其在播放一段声音内容后，对使用者所发出的语音输入语句与该使用者发出语音输入语句前的指定区间内所播出的声音内容来进行语音辨识，而选取出此段声音内容中的特定声音语句，进而进行后续的处理。 Another object of the present invention is to provide a system and method for selecting sound content by using speech recognition. Speech recognition is performed on the audio content played in the specified interval, and specific audio sentences in this segment of audio content are selected for subsequent processing. the

本发明的又一目的为提供选取声音内容的系统，用以在依序播放的一声音内容中取得一声音语句，进而于一处理系统中进行处理，其包含：一播放模组，用以播放该声音内容；一接收模组，用以即时接收一使用者所发出的一语音输入语句；一缓冲模组，用以暂存该播放模组所播放的一指定区间内的该声音内容与该使用者所发出的该语音输入语句，且该指定区间为当该接收模组接收到该语音输入语句时，该播放模组在一最后指定时间内所播放的该声音内容；一辨识模组，用以撷取该缓冲模组中该指定区间中的该声音内容与该语音输入语句而进行语音辨识，进而比对辨识出该指定区间的该声音内容中最符合该使用者所发出的该语音输入语句的该声音语句；以及一转换模组，与该辨识模组连接，用以依照该辨识模组所比对出的最符合该语音输入语句的该声音语句而转换出对应该声音语句的一文字语句，进而提供给该处理系统进行处理。 Yet another object of the present invention is to provide a system for selecting audio content, for obtaining an audio sentence from an audio content played sequentially, and then processing it in a processing system, which includes: a playback module for playing The sound content; a receiving module, used to immediately receive a voice input sentence issued by a user; a buffer module, used to temporarily store the sound content and the sound content in a specified interval played by the playing module. The voice input sentence issued by the user, and the specified interval is the sound content played by the playback module within a last specified time when the receiving module receives the voice input sentence; a recognition module, It is used to extract the voice content in the specified section in the buffer module and the voice input sentence for voice recognition, and then compare and identify the voice content in the specified section that best matches the voice uttered by the user The sound sentence of the input sentence; and a conversion module, connected with the recognition module, in order to convert the corresponding sound sentence according to the sound sentence that is most consistent with the voice input sentence compared by the recognition module A literal statement of , and then provided to the processing system for processing. the

根据上述构想，该系统还包含一来源数据库，而该来源数据库可以包含有多个文字内容，因此该转换模组还可与该来源数据库和该播放模组连接，用以撷取该来源数据库的一文字内容并转换成该声音内容而透过该播放模组播放。 According to the above idea, the system also includes a source database, and the source database can contain multiple text contents, so the conversion module can also be connected with the source database and the playback module to retrieve the source database A text content is converted into the audio content and played through the playback module. the

根据上述构想，该来源数据库也可以是包含有多个文字内容与语音资讯，因此该播放模组则是该播放模组撷取该来源数据库的语音数据以播出该声音内容。 According to the above idea, the source database may also include a plurality of text contents and voice information, so the playback module captures the voice data of the source database to play the audio content. the

根据上述构想，其最后指定时间为20秒。 According to the above conception, the last specified time is 20 seconds. the

根据上述构想，该处理系统为一语音对话系统、一索引分类系统、一操控系统或是一进阶搜寻系统，倘若该处理系统为该进阶搜寻系统，则可以透过一检索模组以检索出对应该文字语句的相关文字或是语音资讯以供该使用者使用。 According to the above idea, the processing system is a voice dialogue system, an index classification system, a control system or an advanced search system, if the processing system is the advanced search system, it can be retrieved through a retrieval module Output relevant text or voice information corresponding to the text sentence for the user to use. the

本案的又一目的为提供一种选取声音内容的系统，用以在依序播放的一声音内容中取得一声音语句，其中该声音内容更具有多个声音标记，用以标记出该声音内容中的多个关键用语，其包含：一播放模组，用以播放带有该声音标记的该声音内容；一接收模组，用以即时接收一使用者所发出的一语音输入语句；一辨识模组，对该声音内容的多个关键用语与该语音输入语句进行语音辨识，进而比对辨识出该等关键用语中最符合该使用者所发出的该语音输入语句的该声音语句；一缓冲模组，用以暂存所述播放模组所播放的一指定区间内的所述声音内容与所述使用者所发出的所述语音输入语句，其中所述辨识模组撷取所述缓冲模组中的该指定区间内的所述声音内容与所述使用者所发出的所述语音输入语句进行辨识；以及一转换模组，用以依照该辨识模组所比对出的最符合该语音输入语句的该声音语句而转换出对应该声音语句的一文字语句。 Another object of this case is to provide a system for selecting audio content, which is used to obtain an audio sentence in an audio content played sequentially, wherein the audio content has a plurality of audio tags for marking the audio content in the audio content. A plurality of key terms, which include: a playback module, used to play the sound content with the sound mark; a receiving module, used to receive a voice input sentence issued by a user in real time; a recognition module group, performing voice recognition on a plurality of key terms of the voice content and the voice input sentence, and then comparing and identifying the voice sentence that best matches the voice input sentence issued by the user among the key terms; a buffer module set, used to temporarily store the sound content in a specified interval played by the playback module and the voice input sentence issued by the user, wherein the recognition module retrieves the buffer module Identify the voice content in the specified interval and the voice input sentence issued by the user; and a conversion module for matching the voice input according to the recognition module The voice sentence of the sentence is converted into a text sentence corresponding to the voice sentence. the

根据上述构想，该辨识模组透过一直接声波比对出最相近的可能的直接比对双方的声音波形方式或是。 According to the above idea, the recognition module uses a direct sound wave comparison to find out the closest possible sound waveform of the two sides directly compared or. the

该根据上述构想，该辨识模组透过选自一隐藏式马可夫模型方式(Hidden Markov Model，HMM)、一神经网络方式(Neural Networks)、一动态时间校准方式(Dynamic Time Warping，DTW)或一语音模版比对方式(Template Matching)来进行语音辨识。 According to the above idea, the identification module is selected from a hidden Markov model (Hidden Markov Model, HMM), a neural network (Neural Networks), a dynamic time calibration (Dynamic Time Warping, DTW) or a Voice template comparison (Template Matching) for voice recognition. the

根据上述构想，该声音标记为以不同快慢、不同声调或不同音量来表示该关键用语，或是该声音标记为对该关键用语的前后加上提示音的方法标记。 According to the above idea, the sound mark is to express the key term with different speeds, different tones or different volumes, or the sound mark is a method mark for adding prompt sounds before and after the key term. the

根据上述构想，该转换模组所转换出的该文字语句，进而提供一处理系统中进行后续处理。 According to the above idea, the text sentence converted by the conversion module is further provided to a processing system for subsequent processing. the

本案的又一目的为提供一种选取声音内容的方法，用以在依序播放的一声音内容中取得一声音语句，进而进行一后续处理程序，其包含下列步骤：(a)播放该声音内容；(b)接收一使用者所发出的一语音输入语句；(c)将该语音输入语句与在一指定区间内所播放的该声音内容进行语音辨识；以及(d)从指定区间内的该声音内容中比对出最符合该使用者所发出的该语音输入语句的该声音内容，进而进行该后续处理程序。 Another object of this case is to provide a method for selecting sound content, for obtaining a sound sentence in a sound content played sequentially, and then performing a follow-up processing procedure, which includes the following steps: (a) playing the sound content ; (b) receiving a voice input sentence issued by a user; (c) performing speech recognition on the voice input sentence and the sound content played in a specified interval; and (d) from the specified interval The voice content is compared with the voice content that most matches the voice input sentence uttered by the user, and then the subsequent processing procedure is performed. the

根据上述构想，该声音内容还具有多个声音标记，用以标记出该声音内容中的多个关键用语，因此 According to the above idea, the sound content also has a plurality of sound marks, which are used to mark out a plurality of key terms in the sound content, so

根据上述构想，该步骤(c)还包含将该语音输入语句与该指定区间内的该声音内容中带有该多个其中之一的关键用语进行语音辨识。 According to the above idea, the step (c) further includes performing voice recognition on the voice input sentence and the key words with one of the plurality of key words in the voice content in the specified interval. the

根据上述构想，该步骤(d)还包含由该多个关键用语中比对出最符合该使用者所发出的该语音输入语句的该声音语句。 According to the above idea, the step (d) further includes comparing the plurality of key words to find the voice sentence that best matches the voice input sentence uttered by the user. the

根据上述构想，该步骤(c)透过一比对出最相近的可能的直接比对双方的声音波形方式或是透过选自一隐藏式马可夫模型方式、一神经网络方式、一动态时间校准方式或一语音模版比对方式来进行语音辨识。 According to the above idea, the step (c) is through a comparison of the sound waveforms of the two sides which is the closest possible direct comparison or through a method selected from a hidden Markov model, a neural network method, and a dynamic time calibration The voice recognition is carried out by means of a voice template comparison mode or a voice template comparison mode. the

根据上述构想，该步骤(d)还包含一步骤(d1)转换该声音内容为一文字语句。 According to the above idea, the step (d) further includes a step (d1) converting the audio content into a text sentence. the

根据上述构想，该后续处理步骤为一进阶搜寻步骤、一关键字索引步骤、一语音对话系统或是一操控程序。 According to the above idea, the subsequent processing step is an advanced search step, a keyword indexing step, a voice dialogue system or a control program. the

本案的功效与目的，可藉由下列实施方式说明，对其有更深入的了解。 The effect and purpose of this case can be explained through the following embodiments, and it has a deeper understanding. the

附图说明Description of drawings

图1(A)为本发明第一较佳实施例的一种利用语音辨识以选取声音内容的系统的简要配置架构示意图。 FIG. 1(A) is a schematic configuration diagram of a system for selecting audio content by using speech recognition according to the first preferred embodiment of the present invention. the

图1(B)为本发明第二较佳实施例的一种利用语音辨识以选取声音内容的系统的简要配置架构示意图。 FIG. 1(B) is a schematic configuration diagram of a system for selecting audio content by using speech recognition according to a second preferred embodiment of the present invention. the

图2为本发明较佳实施例的一种利用语音辨识以选取声音内容的方法的流程示意图。 FIG. 2 is a schematic flowchart of a method for selecting audio content by using speech recognition according to a preferred embodiment of the present invention. the

具体实施方式Detailed ways

对于下文中说明本发明，本领域普通技术人员须了解下文中的说明仅作为例证用，而不用于限制本发明。 For the following descriptions of the present invention, those skilled in the art should understand that the following descriptions are only for illustration purposes and not intended to limit the present invention. the

以下针对本案较佳实施例的利用语音辨识以选取声音内容的系统及其方法进行描述，但实际架构与所采行的方法并不必须完全符合描述的架构与方法，本领域普通技术人员当能在不脱离本发明的实际精神及范围的情况下，做出种种变化及修改。 The following is a description of the system and method for selecting sound content by using speech recognition in the preferred embodiment of this case, but the actual structure and the method adopted do not necessarily completely conform to the described structure and method, and those of ordinary skill in the art will be able to Various changes and modifications may be made without departing from the true spirit and scope of the invention. the

请参阅图1(A)和(B)，其分别为本发明所揭示的一种利用语音辨识以选取声音内容的系统及其方法的简要系统架构示意图。本发明的选取系统10包含有一播放模组11、一接收模组12、一缓冲模组13、一辨识模组14、一转换模组15和一来源数据库16，其借着从该播放模组11所播放出的声音内容中选取出一声音语句，进而可提供给一处理系统17进行一后续处理。 Please refer to FIG. 1 (A) and (B), which are schematic diagrams of a system structure of a system and method for selecting audio content by using speech recognition disclosed in the present invention, respectively. The selection system 10 of the present invention comprises a playing module 11, a receiving module 12, a buffering module 13, an identification module 14, a conversion module 15 and a source database 16, and it is obtained from the playing module 11 Select a voice sentence from the played voice content, and then provide it to a processing system 17 for subsequent processing. the

其中，由该播放模组11是用来播放出该声音内容以让一使用者依照时间顺序听到该声音内容，而该接收模组12则是用以即时接收该使用者所发出的一语音输入语句，此外，该缓冲模组13则是暂存着该播放模组11所播放的一指定区间内的该声音内容与由该接收模组12所接收的该使用者所发出的该语音输入语句，因此，该辨识模组14是撷取该缓冲模组13中该指定区间中的该声音内容与该语音输入语句而进行语音辨识，进而比对辨识出该指定区间的该声音内容中最符合该使用者所发出的该语音输入语句的该声音语句，于是，该转换模组15是用以依照该辨识模组14所比对出的最符合该语音输入语句的该声音语句而转换出对应该声音语句的一文字语句，而该来源数据库16则是提供该播放模组11所播放的声音内容来源。 Wherein, the playing module 11 is used to play the sound content so that a user can hear the sound content according to time sequence, and the receiving module 12 is used to receive a voice from the user in real time Input sentence, in addition, this buffering module 13 is to temporarily store this sound content in a specified interval played by this playing module 11 and this voice input that this user that receives by this receiving module 12 sends out sentence, therefore, the identification module 14 extracts the voice content in the designated interval in the buffer module 13 and the speech input sentence for speech recognition, and then compares and recognizes the best voice content in the designated interval. The voice sentence that matches the voice input sentence issued by the user, so the conversion module 15 is used to convert the voice sentence that best matches the voice input sentence that the recognition module 14 compares. A text sentence corresponding to the sound sentence, and the source database 16 provides the source of the sound content played by the playing module 11 . the

此外，根据该来源数据库16的所储存资讯的种类不同，该选取系统10的组成架构亦略有不同。 In addition, according to the different types of information stored in the source database 16 , the structure of the selection system 10 is also slightly different. the

于是，请参阅图1(A)，其为本案第一实施例的选取系统10，其中该来源数据库16包含有多个文字内容，因此该转换模组15还可与该来源数据库16和该播放模组11相互连接，而该转换模组15可撷取该来源数据库16中多个文字内容其中的一文字内容并转换成该声音内容而透过该播放模组11来播出，同时，透过该转换模组15，同时也可将欲播放的声音内容储存在该缓冲模组13中。 Then, referring to Fig. 1 (A), it is the selection system 10 of the first embodiment of the present case, wherein the source database 16 contains a plurality of text contents, so the conversion module 15 can also be connected with the source database 16 and the player The modules 11 are connected to each other, and the conversion module 15 can extract a text content among a plurality of text contents in the source database 16 and convert it into the audio content and play it out through the playback module 11. At the same time, through The conversion module 15 can also store the audio content to be played in the buffer module 13 . the

此外，若是该来源数据库16是包含有多个文字内容与语音资讯时，在此情况下，请参阅图1(B)，该来源数据库16则是无须与该转换模组15连接，而是直接可以由该播放模组11撷取该来源数据库16中的语音数据而播放的该声音内容，且该来源数据库16也可将欲播放的声音内容储存在该缓冲模组13中。 In addition, if the source database 16 includes multiple text content and voice information, in this case, please refer to FIG. 1(B), the source database 16 does not need to be connected with the conversion module 15, but directly The sound content can be played by the playback module 11 from the voice data in the source database 16 , and the source database 16 can also store the sound content to be played in the buffer module 13 . the

且由于使用者是以时间顺序听到该声音内容，因此该使用者所发出的语音输入语句通常是属于刚听过的声音内容，因此本发明设定出该指定区间为当该接收模组12接收到该语音输入语句时，该播放模组11在一最后指定时间内所播放的该声音内容，并且将该指定区间的声音内容暂存在该缓冲模组13中，其中该最后指定时间可以设定为20秒或是其他的任意时间。此外，当该接收模组12接收到该使用者所发出的该语音输入语句时，该语音输入语句也会储存在该缓冲模组13，于是该辨识模组14只要撷取该缓冲模组13所储存的该声音内容与该语音输入语句并利用语音辨识技术加以比对选取出在该指定区间的该声音内容中最符合该使用者所发出的该语音输入语句的该声音语句，同时也可透过该转换模组15将所比对选取出的该声音语句转换为一文字语句，进而提供给该处理系统17进行处理。 And because the user hears the sound content in chronological order, the voice input sentence sent by the user usually belongs to the sound content just heard, so the present invention sets the designated interval as when the receiving module 12 When receiving the voice input sentence, the playing module 11 plays the sound content within a last designated time, and temporarily stores the sound content of the designated interval in the buffer module 13, wherein the last designated time can be set Set to 20 seconds or any other arbitrary time. In addition, when the receiving module 12 receives the voice input sentence issued by the user, the voice input sentence will also be stored in the buffer module 13, so the recognition module 14 only needs to retrieve the buffer module 13 The stored voice content is compared with the voice input sentence by using voice recognition technology to select the voice sentence that best matches the voice input sentence issued by the user in the voice content of the specified interval. Through the conversion module 15, the voice sentence selected by comparison is converted into a text sentence, and then provided to the processing system 17 for processing. the

其中该处理系统17可以是一语音对话系统、一索引分类系统、一操控系统或是一进阶搜寻系统等等，可以根据不同需求而进行不同的后续处理程序，譬如：该语音对话系统可以依据该文字语句的涵义而进行一语音对话、该索引分类系统可以将其声音内容进行关键字索引程序、该操控系统则是可以透过了解其文字语句意义而进而去操控其他程序、或是该进阶搜寻系统可将其文字语句透过一检索模组(图中未揭示)以检索出对应该文字语句的相关文字或是语音资讯以供该使用者使用。 Wherein the processing system 17 can be a voice dialogue system, an index classification system, a control system or an advanced search system, etc., and can carry out different follow-up processing procedures according to different needs, for example: the voice dialogue system can be based on The meaning of the text sentence is used to carry out a voice dialogue, the index classification system can perform keyword indexing program on the sound content, the control system can control other programs by understanding the meaning of the text sentence, or the processing The advanced search system can pass the text sentence through a retrieval module (not shown in the figure) to retrieve relevant text or voice information corresponding to the text sentence for the user to use. the

且因该处理系统17是因应不同需求而进行不同的后续处理程序，譬如：若该处理系统17是该索引分类系统，则可以仅需要该选取系统10提供该声音内容以来进行索引分类，而若该处理系统17是该语音对话系统、该操控系统或是该进阶搜寻系统，则可能需要该选取系统10提供该文字语句以供该处理系统17进一步判断分析。于是，该选取系统10即可因应该处理系统17的不同类型而传送该声音语句或是该文字语句至该处理系统17中来进行后续处理，而在其实际资讯流传送流程上，倘若该选取系统10欲传送该声音语句至该处理系统17中，则是可以由该辨识模组14传送该声音语句至该处理系统17，反之，若是该选取系统10欲传送该文字语句至该处理系统17中，则可以透过该转换模组15传送转换后的文字语句至该处理系统17中。 And because the processing system 17 performs different follow-up processing procedures in response to different needs, for example: if the processing system 17 is the index classification system, then only the selection system 10 is required to provide the audio content to perform index classification, and if If the processing system 17 is the voice dialogue system, the control system or the advanced search system, the selection system 10 may be required to provide the text statement for further judgment and analysis by the processing system 17 . Therefore, the selection system 10 can transmit the voice sentence or the text sentence to the processing system 17 for subsequent processing according to the different types of the processing system 17, and in its actual information flow transmission process, if the selection If the system 10 wants to send the voice sentence to the processing system 17, the recognition module 14 can send the voice sentence to the processing system 17; otherwise, if the selection system 10 wants to send the text sentence to the processing system 17 In the process, the converted text statement can be sent to the processing system 17 through the conversion module 15 . the

再则，该辨识模组14是透过一直接声波比对方式或是以一声学模型比对方式来进行语音辨识，其中该直接声波比对方式即是直接比对双方的声音波形，而比对出最相近的可能，而该声学模型比对方式则是透过一隐藏式马可夫模型(Hidden Markov Model，HMM)、一神经网络(Neural Networks)、一动态时间校准(Dynamic Time Warping，DTW)或是一语音模版比对(Template Matching)等各式声学模型来进行语音辨识。 Furthermore, the identification module 14 performs speech recognition through a direct acoustic wave comparison method or an acoustic model comparison method, wherein the direct acoustic wave comparison method is to directly compare the sound waveforms of both parties, and compare Match the closest possibility, and the acoustic model comparison method is through a hidden Markov model (Hidden Markov Model, HMM), a neural network (Neural Networks), a dynamic time calibration (Dynamic Time Warping, DTW) Or a variety of acoustic models such as template matching (Template Matching) for speech recognition. the

请再参阅图2，其为本发明利用语音辨识以选取声音内容的系统及其方法的实施方法流程图。本发明方法先由系统播放一段声音内容21，随后再接收使用者所发出的语音输入语句22，且将该语音输入语句与该段播放声音内容中的一指定区间内的声音内容进行语音辨识23，并从该指定区间内的该声音内容中比对选取出最符合该使用者所发出的该语音输入语句的该声音内容24，进而进行一后续处理程序25，其中该后续处理程序可以是一进阶搜寻步骤、一关键字索引步骤、一语音对话系统或是一操控程序，且如上面内容所述，当该后续处理程序需要利用文字资讯来进行时，则本发明方法还可以将该声音内容转换成一文字语句以供该后续处理程序进行处理。 Please refer to FIG. 2 again, which is a flow chart of the implementation method of the system and method for selecting audio content by using speech recognition according to the present invention. In the method of the present invention, the system first plays a section of audio content 21, then receives the voice input sentence 22 sent by the user, and performs voice recognition 23 on the voice input sentence and the audio content in a specified interval of the playback audio content. , and compare and select the voice content 24 that best matches the voice input sentence issued by the user from the voice content in the specified interval, and then perform a follow-up processing procedure 25, wherein the follow-up processing procedure can be a Advanced search steps, a keyword indexing step, a voice dialogue system or a control program, and as described above, when the subsequent processing procedure needs to be carried out using text information, the method of the present invention can also use the voice The content is converted into a text statement for the subsequent processing program to process. the

此外，为了让语音辨识的效率更高，本发明还可以对该声音内容主动加上标记，以使该声音内容拥有多个声音标记来标记出该声音内容中的多个关键用语，如此可以让使用者在听的时候知道这是属于关键用语，其中该声音标记为以不同快慢、不同声调或不同音量来表示该关键用语或是对该关键用语的前后加上提示音的方法标记。 In addition, in order to make speech recognition more efficient, the present invention can also actively mark the voice content, so that the voice content has multiple voice marks to mark multiple key terms in the voice content, so that The user knows that this is a key term when listening to it, and the sound mark is a method mark for expressing the key term with different speeds, different tones or different volumes or adding prompting sounds before and after the key term. the

其中该声音标示可以储存在如图1(A)和(B)所示的来源数据库16中，无论该来源数据库16所储存是纯为文字内容或是同时拥有文字内容和语音资讯，只要透过系统的简单设定(譬如：在语音资讯中可以直接储存带有特定声音标记的语音关键语句，而在文字内容中则是可以直接对文字内容中的特定文字片段直接标注出欲标记的声音形式，以便于以后文字转语音时可以播出该特定声音标记)，即可播放出带有声音标记的声音内容。 Wherein the sound mark can be stored in the source database 16 shown in Figure 1 (A) and (B), no matter the storage of the source database 16 is pure text content or has text content and voice information at the same time, as long as through The simple setting of the system (for example: in the voice information, the voice key sentence with a specific sound mark can be directly stored, and in the text content, the sound form to be marked can be directly marked on the specific text segment in the text content , so that the specific sound mark can be played during text-to-speech later), the sound content with the sound mark can be played. the

于是，其语音辨识方式即可以只对该指定区间内的带有声音标记的该声音内容进行语音辨识，因此不但有效节省辨识时间，且辨识率也会相对提高。然而，若单纯以技术讨论，本发明的选取系统也可以无须特别指定声音内容的区间，而可以直接将全部的声音内容与其语音输入语句进行比对，或是将这些全部的声音内容中带有声音标记的关键用语与该语音输入语句进行比对。 Therefore, the voice recognition method can only perform voice recognition on the voice content with the voice mark in the specified interval, so not only the recognition time is effectively saved, but also the recognition rate is relatively improved. However, if it is simply discussed in terms of technology, the selection system of the present invention can also directly compare all the voice content with its voice input sentence without specifying the interval of the voice content, or compare all the voice content with the The key words of the sound mark are compared with the voice input sentence. the

因此，根据本发明所提供的声音内容选取技术来即时选取适当的声音语句，其提供了一种便利的互动机制以让使用者与以顺序方式呈现的(sequential)声音内容有效互动，大幅改善了过去使用者只能一直处在被动的立场倾听该声音内容来撷取资讯，且改进了过去的声音内容不能像以并行方式呈现(parallel)的书面文字内容一样同样拥有很多的工具帮助人与其内容的互动。 Therefore, according to the sound content selection technology provided by the present invention, the appropriate sound sentence is selected in real time, which provides a convenient interaction mechanism to allow the user to interact effectively with the (sequential) sound content presented in a sequential manner, greatly improving the In the past, users could only listen to the audio content in a passive position to extract information, and the improved audio content in the past cannot have as many tools as the written text content presented in parallel to help people and content of interaction. the

于是在实际应用上，本发明可适用在各种以声音内容传达资讯的各式互动设备(如移动装置、蓝牙设备或上网装置)中，只要透过本发明所提供的声音内容选取机制，就可以让使用者在声音内容中轻易的选取出所欲指定的声音语句，进而可提供作为后续的相关处理或服务项目中，而此使用者并不需要特别的训练或是记忆特殊的操作指令。 Therefore, in practical applications, the present invention can be applied to various interactive devices (such as mobile devices, bluetooth devices, or Internet access devices) that convey information through sound content. As long as the sound content selection mechanism provided by the present invention is used, the It allows the user to easily select the desired voice sentence from the voice content, and then provide it as a follow-up related processing or service item, and the user does not need special training or memorization of special operation instructions. the

综上所述，本案确实可提供一种利用语音辨识以选取声音内容的系统及其方法，其突破了在固有播放声音内容无法与使用者进行互动的问题，而是利用既有语音识别的技术并搭配适当的资讯存取技术以及特殊的语音标记模式，以让使用者所发出的语音输入语句和所播放的声音内容进行语音辨识，进而选取出此段声音内容中的特定声音语句，进而进行后续的各式处理程序，此技术无须增加许多繁复的软硬体设备，而实施成本极为低廉。因此，本发明声音内容选取系统及其选取声音内容的方法的技术相对简单但却可提供极高的便利性，使用者无须特别训练或学习并可运用到各种以声音表达资讯的领域，且可以有效增进产业的进步，本发明技术简单，可运用领域广泛，实具产业的价值，遂依法提出发明专利申请。 To sum up, this case can indeed provide a system and method for selecting voice content by using voice recognition. And with the appropriate information access technology and special voice marking mode, the voice input sentence issued by the user and the sound content played can be recognized by voice, and then the specific voice sentence in this piece of voice content can be selected, and then carried out For subsequent various processing procedures, this technology does not need to add many complicated software and hardware devices, and the implementation cost is extremely low. Therefore, the technology of the sound content selection system and the method for selecting sound content of the present invention is relatively simple but can provide extremely high convenience, and the user does not need special training or learning and can be applied to various fields of expressing information with sound, and It can effectively promote the progress of the industry. The technology of the invention is simple, it can be used in a wide range of fields, and has real industrial value. Therefore, an application for an invention patent is filed according to law. the

以上所述利用较佳实施例详细说明本发明，而非限制本发明的范围，因此本领域普通技术人员应能明了，适当而作些微小的改变与调整，仍将不失本发明的要义所在，也不脱离本发明的精神和范围，故都应视为本发明的进一步实施状况。 The above description utilizes the preferred embodiments to illustrate the present invention in detail, rather than limit the scope of the present invention, so those of ordinary skill in the art should be able to understand that it is appropriate to make some minor changes and adjustments without losing the gist of the present invention Therefore, it should be regarded as a further implementation of the present invention. the

本发明所主张的范围应以权利要求书中的权利要求所述的为准。The claimed scope of the present invention should be determined by what is stated in the claims.

Claims

1. system that chooses sound-content in order to obtaining a sound statement in a sound-content of playing in regular turn, and then handles in a disposal system, comprises:

One plays module, in order to play described sound-content;

One receives module, a phonetic entry statement that is sent in order to instant reception one user;

One buffering module, the described phonetic entry statement that described sound-content between a designation area of being play in order to temporary described broadcast module and described user are sent, and be between this designation area when this reception module receives this phonetic entry statement, the sound-content of having listened that this broadcast module was play in a last fixed time;

One identification module, in order to the described sound-content of having listened in capturing described in the described buffering module between designation area and described phonetic entry statement and carry out speech recognition, and then comparison picks out the described sound statement in the described sound-content between described designation area, and wherein this sound statement meets the described phonetic entry statement that described user sends most; And

One conversion module is connected with described identification module, changes out a literal statement of corresponding described sound statement in order to the described sound statement of being compared out according to described identification module, and then offers this disposal system and handle.

2. the system as claimed in claim 1, it is characterized in that also comprising: one comes source database, this comes source database to have a plurality of word contents, wherein said conversion module also comes source database to be connected with described broadcast module with described, describedly comes a word content of source database and converts described sound-content to and see through described broadcast module and play in order to capture; And/or one come source database, and this comes source database to have a plurality of word contents and voice information, and wherein said broadcast module acquisition is described to come the speech data of source database to broadcast described sound-content.

3. the system as claimed in claim 1 is characterized in that the described last fixed time is 20 seconds; Described disposal system is an Advanced Search system, and wherein said disposal system sees through related text or the voice information of a retrieval module to retrieve corresponding described literal statement

Use for this user; And/or described disposal system for be selected from a speech dialogue system, an index classification system and a control system one of them.

4. system that chooses sound-content, in order to obtain a sound statement in a sound-content of playing in regular turn, wherein said sound-content also has a plurality of voice marks, and in order to mark a plurality of crucial term in the described sound-content, this system comprises:

One plays module, has the described sound-content of described voice mark in order to broadcast;

One identification module, a plurality of crucial term and described phonetic entry statement to described sound-content carry out speech recognition, and then compare the described sound statement that picks out in described these crucial terms, wherein this sound statement meets the described phonetic entry statement that described user sends most;

One buffering module, the described phonetic entry statement that described sound-content between a designation area of being play in order to temporary described broadcast module and described user are sent, wherein said identification module capture the described phonetic entry statement that described sound-content interior between this designation area in the described buffering module and described user sent and carry out identification; And

One changes module, changes out a literal statement of corresponding described sound statement in order to the described sound statement of being compared out according to described identification module.

5. system as claimed in claim 4 is characterized in that described identification module sees through one and directly compares both sides' sound waveform mode to compare the most close sound waveform or to carry out speech recognition through being selected from a concealed markov model mode, a neural network mode, a dynamic time calibrating mode or a voice masterplate comparison mode; This voice mark of representing this key term is represented as different speeds, not same tone or different volume, or the front and back that this voice mark is represented as described crucial term add prompt tone, and described literal statement offers a disposal system to carry out subsequent treatment.

6. method of choosing sound-content in order to obtaining a sound statement in a sound-content of playing in regular turn, and then is carried out a subsequent processing steps, and this method comprises the following step:

(a) play described sound-content;

(b) receive the phonetic entry statement that a user is sent;

(c) with described phonetic entry statement and the described sound-content of being play between a designation area carry out speech recognition; And

(d) compare out the described sound-content that meets the described phonetic entry statement that described user sends most the described sound-content in between designation area, and then carry out described subsequent processing steps; Changing described sound-content is a literal statement; And/or described subsequent processing steps is an Advanced Search step, a key word index step, a speech dialogue system or an operating steps.

7. method as claimed in claim 6 is characterized in that described sound-content also has a plurality of voice marks, in order to mark a plurality of crucial term in the described sound-content; The described sound-content of described step (c) comprises described a plurality of crucial term; Described step (d) also comprises by comparing out described sound statement in described a plurality of crucial terms, and wherein this sound statement meets the described phonetic entry statement that described user sends most; And/or the described voice mark of representing described crucial term is represented as different speeds, not same tone or different volume, or the front and back that this voice mark is represented as described crucial term add prompt tone.

8. method as claimed in claim 6 is characterized in that described step (c) sees through one and directly compares both sides' sound waveform mode to compare the most close sound waveform or to carry out speech recognition through being selected from a concealed markov model mode, a neural network mode, a dynamic time calibrating mode or a voice masterplate comparison mode.