CN101169780A

CN101169780A - A Semantic Ontology-Based Retrieval System and Method

Info

Publication number: CN101169780A
Application number: CNA2006101498039A
Authority: CN
Inventors: 王伟; 舒琦; 方琦; 钟杰萍
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2006-10-25
Filing date: 2006-10-25
Publication date: 2008-04-30

Abstract

The embodiment of the invention discloses a semantic ontology-based retrieval system, which includes a semantic ontology index database and a semantic ontology index processing unit. The semantic ontology search processing unit obtains the text hit file list, and matches the text hit file list with the semantic ontology index in the semantic ontology index database to obtain the document semantic classification table. This enables the retrieval system to identify the semantic information of the files to be retrieved, and makes the search results present semantic classification results. The embodiment of the present invention also discloses a semantic ontology-based retrieval method. The method first establishes a semantic ontology index for a file with an established text index, and performs semantic ontology index matching processing on the text matching result when the user performs a search, so that The final output results present a semantic classification on the traditional text matching results, which is convenient for users to query.

Description

A Semantic Ontology-Based Retrieval System and Method

技术领域 technical field

本发明涉及信息检索技术，特别涉及一种基于语义本体的检索系统和方法。The invention relates to information retrieval technology, in particular to a semantic ontology-based retrieval system and method.

背景技术 Background technique

随着检索技术的飞速发展，基于文本的信息检索技术也逐渐趋于成熟，形成了一套完整的思路和完善的算法，并被广泛应用到了各类搜索引擎中，如谷歌(Google)、AltaVista、Lycos、雅虎(Yahoo)等。With the rapid development of retrieval technology, text-based information retrieval technology has gradually matured, forming a complete set of ideas and perfect algorithms, and has been widely used in various search engines, such as Google (Google), AltaVista, etc. , Lycos, Yahoo (Yahoo), etc.

图1为现有的一种文本搜索引擎的结构框图。如图1所示，现有的文本搜索引擎包括：蜘蛛控制模块101、统一资源定位(URL)数据库102、网络蜘蛛103、URL提取模块104、网页数据库105、链接信息提取模块106、文本索引模块107、链接数据库108、索引数据库109、网页评级模块110和查询服务器111。Fig. 1 is a structural block diagram of an existing text search engine. As shown in Figure 1, existing text search engine comprises: spider control module 101, uniform resource location (URL) database 102, web spider 103, URL extraction module 104, webpage database 105, link information extraction module 106, text index module 107 , link database 108 , index database 109 , web page rating module 110 and query server 111 .

网络蜘蛛103从互联网上抓取网页，并把网页送入网页数据库105。URL提取模块104从网络蜘蛛103抓取的网页中提取URL，并把URL送入URL数据库102。蜘蛛控制模块101从URL数据库102获取网页的URL，并控制网络蜘蛛103抓取其它网页，重复上述步骤直到把所有的网页抓取完。The web spider 103 crawls web pages from the Internet, and sends the web pages into the web page database 105 . The URL extracting module 104 extracts URLs from the web pages crawled by the web spider 103 and sends the URLs to the URL database 102 . The spider control module 101 obtains the URL of the webpage from the URL database 102, and controls the web spider 103 to crawl other webpages, and repeats the above steps until all the webpages are crawled.

系统从网页数据库105中获取文本信息，并送入文本索引模块107，由文本索引模块107建立索引，再送入索引数据库109。同时链接信息提取模块106从网页数据库105中获取链接信息，并送入链接数据库108。链接数据库108中的链接信息为网页评级模块110提供网页评级的依据。The system acquires text information from the webpage database 105 and sends it to the text index module 107 , which builds an index and then sends it to the index database 109 . At the same time, the link information extraction module 106 acquires link information from the webpage database 105 and sends it to the link database 108 . The link information in the link database 108 provides the webpage rating module 110 with a basis for webpage rating.

当用户通过查询服务器111提交查询请求时，查询服务器111在索引数据库109中查找与用户查询请求相关的网页，同时网页评级模块110把用户查询请求和链接数据库108中的链接信息结合起来对搜索结果进行相关度的评价，并通过查询服务器111对搜索结果按照其相关度进行排序，组织最后的页面返回给用户。When a user submits a query request through the query server 111, the query server 111 searches the index database 109 for webpages related to the user query request, and the web page rating module 110 combines the user query request with the link information in the link database 108 to evaluate the search results. Evaluation of the degree of relevance is performed, and the search results are sorted according to their degree of relevance through the query server 111, and the final page is organized to be returned to the user.

现有的文本检索技术虽然能搜索到包含用户的文本查询信息的文件，但是无法识别出搜索到的文件的内容及意义。这是因为现有的文本检索技术是基于文本字符串匹配的，这种检索技术的问题是，当不同的词可以表示相同的意义或一个词在不同的语境中有不同的意义时，将会限制检索的查准率和查全率，导致搜索到的结果远远不能满足用户的需求，例如，当用户的搜索关键词为“天堂”时，无法判断符合用户搜索条件的文件是反映“天堂游戏”还是“天堂音乐”的内容。而语义网的提出为解决这些问题提供了契机。Although the existing text retrieval technology can search for files containing the user's text query information, it cannot identify the content and meaning of the searched files. This is because the existing text retrieval technology is based on text string matching. The problem with this retrieval technology is that when different words can represent the same meaning or a word has different meanings in different contexts, it will It will limit the precision rate and recall rate of the search, resulting in the search results far from meeting the user's needs. For example, when the user's search keyword is "heaven", it is impossible to judge whether the files that meet the user's search criteria reflect " Paradise Game" or "Paradise Music" content. The proposal of Semantic Web provides an opportunity to solve these problems.

语义网是由一群能够被计算机自动控制和识别其内容的网页构成的网络，是在现有的互联网基础上，为网页扩展计算机能够识别的数据，并增加专供计算机使用的文档，即用本体论语言对网页进行标注，明确其语义，从而使得网页信息不但被人所理解，也能被计算机自动控制和识别。语义标注的网页一般以可扩展标记语言(XML)或超文本置标语言(Html)为数据做标注，以资源描述框架(RDF)作为数据描述模型，并结合语义本体，使被标注的数据具有明确的语义。本体是一个源于哲学的概念，原意是指关于存在及其本质和规律的学说，后被人工智能领域引入，特指对概念化的一个显式的规格说明。本体能够将领域中的各种概念及相互关系显式地、形式化地表达出来，从而将术语的语义显式地表达出来，因而在语义查询方面发挥着重要的作用。这里指的语义本体定义了组成主体领域概念的基本术语和它们之间的关系，并规定了组合基本术语和它们之间的关系定义词汇的外延规则。The Semantic Web is a network composed of a group of webpages that can be automatically controlled and identified by computers. It expands the data that computers can recognize for webpages on the basis of the existing Internet, and adds documents for computer use, that is, ontology On the language to label web pages and clarify their semantics, so that web page information can not only be understood by humans, but also automatically controlled and recognized by computers. Semantically annotated webpages generally use Extensible Markup Language (XML) or Hypertext Markup Language (Html) as data for annotation, Resource Description Framework (RDF) as data description model, combined with semantic ontology, so that the annotated data has clear semantics. Ontology is a concept derived from philosophy. Its original meaning refers to the theory of existence, its essence and laws. It was later introduced by the field of artificial intelligence, specifically referring to an explicit specification of conceptualization. Ontology can explicitly and formally express various concepts and interrelationships in the domain, thereby explicitly expressing the semantics of terms, so it plays an important role in semantic query. The semantic ontology referred to here defines the basic terms that make up the concept of the subject domain and the relationship between them, and stipulates the extension rules for combining the basic terms and the relationship between them to define the vocabulary.

语义检索的目的是通过从语义网上获取的数据，增强并改进传统的搜索结果。图2是现有的一种语义搜索系统的结构框图。如图2所示，现有的语义搜索系统包括：查询接口201、查询预处理模块202、语义本体推理引擎203、标注本体库204、传统搜索模块205和结果返回接口206。The purpose of semantic retrieval is to enhance and improve traditional search results through data obtained from the Semantic Web. Fig. 2 is a structural block diagram of an existing semantic search system. As shown in FIG. 2 , the existing semantic search system includes: query interface 201 , query preprocessing module 202 , semantic ontology reasoning engine 203 , annotation ontology library 204 , traditional search module 205 and result return interface 206 .

查询接口201获取用户的查询信息，将其发送给查询预处理模块202。The query interface 201 acquires user query information and sends it to the query preprocessing module 202 .

查询预处理模块202分析用户的查询信息，通过切分词技术，将其切分成查询关键词，并发送给语义本体推理引擎203。The query preprocessing module 202 analyzes the user's query information, segments it into query keywords through word segmentation technology, and sends them to the semantic ontology reasoning engine 203 .

语义本体推理引擎203根据标注本体库204中定义的本体概念词汇及概念与概念之间的关系，匹配推理出查询关键词所对应的本体概念词汇，并将其返回给查询预处理模块202。The semantic ontology inference engine 203 matches and infers the ontology concept vocabulary corresponding to the query keyword according to the ontology concept vocabulary and the relationship between concepts defined in the annotation ontology library 204 , and returns it to the query preprocessing module 202 .

查询预处理模块202将语义本体推理引擎203返回的本体概念词汇发送给传统搜索模块205，并指明按照语义搜索。这里按照语义搜索是指在网页已被标注语义的情况下，按照网页标注的语义概念进行字符串匹配，而不是直接对网页自身的内容进行字符串匹配。The query preprocessing module 202 sends the ontology concept vocabulary returned by the semantic ontology reasoning engine 203 to the traditional search module 205, and indicates to search according to the semantics. Searching according to semantics here refers to performing string matching according to the semantic concepts marked on the webpage when the webpage has been semantically annotated, rather than directly performing string matching on the content of the webpage itself.

传统搜索模块205进行语义搜索，并将搜索结果发送给结果返回接口206。结果返回接口206再将搜索结果返回给用户。The traditional search module 205 performs semantic search and sends the search results to the result return interface 206 . The result returning interface 206 returns the search result to the user.

可以看出，上述语义搜索系统是将用户查询关键词与标注网页的语义概念词汇进行匹配。It can be seen that the above-mentioned semantic search system matches the user's query keywords with the semantic concept vocabulary of the marked webpage.

综上所述，现有的文本检索技术虽然能搜索到包含查询关键词的文件，但无法识别出搜索到的文件的语义信息；而现有的语义检索技术不再做关键词检索，导致搜索到的文件包含太多与用户查询信息不相符的结果，而且基于用户查询关键词与语义概念词汇的匹配效率也不尽如人意。所以，现有的检索技术的搜索准确度不高。To sum up, although the existing text retrieval technology can search for files containing query keywords, it cannot identify the semantic information of the searched files; and the existing semantic retrieval technology no longer performs keyword retrieval, resulting in The obtained files contain too many results that do not match the user's query information, and the matching efficiency based on user query keywords and semantic concept vocabulary is not satisfactory. Therefore, the search accuracy of the existing retrieval technology is not high.

发明内容 Contents of the invention

有鉴于此，本发明实施例的主要目的在于提供一种基于语义本体的检索系统，以提高搜索的准确度。In view of this, the main purpose of the embodiments of the present invention is to provide a semantic ontology-based retrieval system to improve search accuracy.

本发明实施例的另一个目的在于提供一种基于语义本体的检索方法，以提高搜索的准确度。Another object of the embodiments of the present invention is to provide a semantic ontology-based retrieval method to improve search accuracy.

为达到上述目的，本发明的技术方案是这样实现的：In order to achieve the above object, technical solution of the present invention is achieved in that way:

本发明实施例公开了一种基于语义本体的检索系统，该系统包括：The embodiment of the present invention discloses a semantic ontology-based retrieval system, which includes:

语义本体索引数据库，用于保存语义本体索引；Semantic ontology index database, used to save semantic ontology index;

语义本体搜索处理单元，用于获取文本命中文件列表，并将文本命中文件列表与语义本体索引数据库中的语义本体索引进行匹配处理，得到文档语义分类表。The semantic ontology search processing unit is configured to obtain a text hit file list, and match the text hit file list with the semantic ontology index in the semantic ontology index database to obtain a document semantic classification table.

本发明实施例还公开了一种基于语义本体的检索方法，该方法包括以下步骤：The embodiment of the present invention also discloses a semantic ontology-based retrieval method, which includes the following steps:

A、获取已建立文本索引的文件，并为获取的文件建立语义本体索引；A. Obtain the files with established text indexes, and establish semantic ontology indexes for the obtained files;

B、获取文本命中文件列表，对文本命中文件列表进行语义本体索引匹配处理，得到文档语义分类表。B. Obtain a list of text hit files, perform semantic ontology index matching processing on the list of text hit files, and obtain a document semantic classification table.

因此，本发明实施例提供的基于语义本体的检索系统和方法，具有以下优点：先为已建立文本索引的文件建立语义本体索引，在用户搜索时，对用户输入的文本查询信息进行文本索引匹配处理得到文本命中文件列表，再对文本命中文件列表进行语义本体索引匹配处理，得到文档语义分类表，使得文本检索结果具有了语义分类信息，提高了搜索的准确度。Therefore, the semantic ontology-based retrieval system and method provided by the embodiments of the present invention have the following advantages: Firstly, a semantic ontology index is established for a file with an established text index, and when the user searches, the text index matching is performed on the text query information input by the user The text hit file list is obtained through processing, and then the semantic ontology index matching process is performed on the text hit file list to obtain the document semantic classification table, so that the text retrieval results have semantic classification information, and the search accuracy is improved.

附图说明 Description of drawings

图1是现有的文本搜索引擎的结构框图；Fig. 1 is the structural block diagram of existing text search engine;

图2是现有的语义搜索系统的结构框图；Fig. 2 is a structural block diagram of an existing semantic search system;

图3是本发明实施例一种基于语义本体的检索系统的结构框图；Fig. 3 is a structural block diagram of a semantic ontology-based retrieval system according to an embodiment of the present invention;

图4是本发明实施例中的语义本体索引处理单元建立语义本体索引的流程图；Fig. 4 is a flow chart of establishing a semantic ontology index by a semantic ontology index processing unit in an embodiment of the present invention;

图5是图3所示的本发明实施例检索系统为用户执行搜索过程的流程图；Fig. 5 is a flow chart of the retrieval system of the embodiment of the present invention shown in Fig. 3 performing a search process for a user;

图6是本发明实施例定义的两个资源描述示意图；FIG. 6 is a schematic diagram of two resource descriptions defined by the embodiment of the present invention;

图7是由图6推理出的结果示意图；Fig. 7 is a schematic diagram of the results deduced from Fig. 6;

图8是本发明实施例中的标注本体库为对实施例中的语义本体词汇建立的关系图；Fig. 8 is a relation diagram established for the semantic ontology vocabulary in the embodiment by the annotation ontology library in the embodiment of the present invention;

图9是图8中的语义本体词汇经过推理后的关系图。FIG. 9 is a relationship diagram of the semantic ontology vocabulary in FIG. 8 after reasoning.

具体实施方式 Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面结合附图及具体实施例对本发明作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图3是本发明实施例一种基于语义本体的检索系统的结构框图。如图3所示，该系统包括：搜索接口模块301、文档语义分类规则引擎302、搜索处理模块303、语义本体推理引擎304、标注本体库305、索引数据库306、索引处理模块307、文件数据库308和网络文件抓取模块309。其中，搜索处理模块303包括：文本搜索处理单元310、语义本体搜索处理单元311和排序处理单元312；索引数据库306包括：文本索引315和语义本体索引316；索引处理模块包括：文本索引处理单元313和语义本体索引处理单元314。Fig. 3 is a structural block diagram of a semantic ontology-based retrieval system according to an embodiment of the present invention. As shown in Figure 3, the system includes: a search interface module 301, a document semantic classification rule engine 302, a search processing module 303, a semantic ontology reasoning engine 304, an annotation ontology library 305, an index database 306, an index processing module 307, and a file database 308 And network file capture module 309. Wherein, the search processing module 303 includes: a text search processing unit 310, a semantic ontology search processing unit 311 and a sorting processing unit 312; the index database 306 includes: a text index 315 and a semantic ontology index 316; the index processing module includes: a text index processing unit 313 and a semantic ontology index processing unit 314.

网络文件抓取模块309主要负责从互联网上抓取网页，并将抓取的网页保存到文件数据库308中。网络文件抓取模块309一般是通过网页抓取程序，例如“网络机器人”或“网络蜘蛛”等，遍历网页空间，扫描一定网际协议(IP)地址范围内的网站，并沿着网络上的链接从一个网页到另一个网页，从一个网站到另一个网站，采集网络文件。The network file grabbing module 309 is mainly responsible for grabbing webpages from the Internet, and saving the grabbed webpages into the file database 308 . The network file grabbing module 309 generally traverses the webpage space through a webpage grabbing program, such as "network robot" or "web spider", scans websites within a certain range of Internet Protocol (IP) addresses, and follows links on the network. Capture web files from one web page to another, and from one website to another.

文件数据库308用于存储供用户检索的文件，包括音频文件、视频文件和文本文件。这些文件可以是网络文件，也可以是非网络文件。文件数据库308中的每一个文件都有一个唯一的文件标识(DocID)。The file database 308 is used to store files for user retrieval, including audio files, video files and text files. These files can be network files or non-network files. Each document in document database 308 has a unique document identification (DocID).

索引处理模块307主要负责对已保存在文件数据库308中的文件进行分析，提取出文件内容的关键词、消除重复的文件等，为文件数据库308中的文件建立不同类型的索引信息。索引处理模块307包括文本索引处理单元313和语义本体索引处理单元314。The index processing module 307 is mainly responsible for analyzing the files stored in the file database 308, extracting keywords of file content, eliminating duplicate files, etc., and establishing different types of index information for the files in the file database 308. The index processing module 307 includes a text index processing unit 313 and a semantic ontology index processing unit 314 .

文本索引处理单元313是传统的建立文本索引的处理单元，通过分析文件内容，提取关键词和文件的标识信息，建立文本索引。鉴于传统的文本索引建立流程是成熟的现有技术，这里不再复述。The text index processing unit 313 is a traditional processing unit for establishing a text index, and establishes a text index by analyzing file content, extracting keywords and identification information of the file. Since the traditional text index building process is a mature prior art, it will not be repeated here.

语义本体索引处理单元314负责为已建立文本索引的文件建立语义本体索引。首先分析已经建立文本索引的文件，判断其是否含有语义标注信息，如果某个文件含有语义标注信息，则提取相关的语义标注信息和文件标识信息，建立该文件的语义本体索引。The semantic ontology index processing unit 314 is responsible for establishing a semantic ontology index for the documents whose text index has been established. First, analyze the document that has established a text index to determine whether it contains semantic annotation information. If a file contains semantic annotation information, extract the relevant semantic annotation information and file identification information, and establish the semantic ontology index of the file.

索引数据库306用来保存索引处理模块307建立的索引信息，即保存文本索引处理单元313建立的文本索引315和语义本体索引处理单元314建立的语义本体索引316。The index database 306 is used to store the index information created by the index processing module 307 , that is, to store the text index 315 created by the text index processing unit 313 and the semantic ontology index 316 created by the semantic ontology index processing unit 314 .

搜索处理模块303负责处理用户的查询请求，通过匹配用户的文本查询信息和文件的索引信息，将符合用户查询条件的文件以一定的排序顺序反馈给用户。搜索处理模块303包括文本搜索处理单元310、语义本体搜索处理单元311和排序处理单元312。The search processing module 303 is responsible for processing the user's query request, by matching the user's text query information with the index information of the file, and feeding back the files that meet the user's query condition to the user in a certain order. The search processing module 303 includes a text search processing unit 310 , a semantic ontology search processing unit 311 and a ranking processing unit 312 .

文本搜索处理单元310负责将用户输入的文本查询信息与文本索引315进行匹配，查询出符合用户查询条件的文本命中文件标识信息。The text search processing unit 310 is responsible for matching the text query information input by the user with the text index 315 , and find out the text matching file identification information that meets the user query condition.

语义本体搜索处理单元311负责把文本搜索处理单元310得出的文本命中文件标识信息与语义本体索引316进行匹配处理，对这些文本命中文件标识信息进行语义分类，得到文档语义分类表。The semantic ontology search processing unit 311 is responsible for matching the text hit document identification information obtained by the text search processing unit 310 with the semantic ontology index 316, performing semantic classification on these text hit document identification information, and obtaining a document semantic classification table.

标注本体库305和语义本体推理引擎304负责对语义本体搜索处理单元311所产生的文档语义分类表中的本体概念词汇集进行语义推理，得到扩展的语义本体词汇集。其中标注本体库305保存了定义的语义本体概念词汇集及其语义本体概念之间的关系，语义本体推理引擎304定义了推理规则并执行推理操作。The annotation ontology library 305 and the semantic ontology reasoning engine 304 are responsible for performing semantic reasoning on the ontology concept vocabulary set in the document semantic classification table generated by the semantic ontology search processing unit 311 to obtain an extended semantic ontology vocabulary set. The annotation ontology library 305 stores the defined vocabulary of semantic ontology concepts and the relationship between semantic ontology concepts, and the semantic ontology reasoning engine 304 defines reasoning rules and performs reasoning operations.

文档语义分类规则引擎302根据语义本体推理引擎304推理出的情况，触发自身定义的语义分类规则，对文档语义分类表进行扩展整合。The document semantic classification rule engine 302 triggers the semantic classification rules defined by itself according to the situation deduced by the semantic ontology reasoning engine 304, and extends and integrates the document semantic classification table.

排序处理单元312负责最后结果的排序优化，即对经过一系列处理，如文本索引匹配、语义本体索引匹配和语义推理扩展等，得到的语义文档分类表，计算其文档的相关性和重要性，并根据计算结果将搜索到的文件排序反馈给搜索接口模块301。The sorting processing unit 312 is responsible for sorting and optimizing the final results, that is, calculating the relevance and importance of the semantic document classification table obtained through a series of processing, such as text index matching, semantic ontology index matching and semantic reasoning extension, etc. And feed back the sorting of the searched files to the search interface module 301 according to the calculation result.

搜索接口模块301负责本系统和用户的交互操作，将用户输入的文本查询信息转发给搜索处理模块303；并将排序处理单元312的排序结果反馈给用户。The search interface module 301 is responsible for the interactive operation between the system and the user, and forwards the text query information input by the user to the search processing module 303; and feeds back the sorting results of the sorting processing unit 312 to the user.

索引数据库306保存的文本索引315包括文本正向索引和文本倒排索引。表1是文本正向索引表，表2是文本倒排索引表，如表1和表2所示：The text index 315 stored in the index database 306 includes a text forward index and a text inverted index. Table 1 is the text forward index table, and Table 2 is the text inverted index table, as shown in Table 1 and Table 2:

表1Table 1

文件标识(DocID)Document ID (DocID) 关键词 Key words 1 1 天堂、音乐、...Paradise, music,... 2 2 应用、软件、... application,... 33 应用、...application,... 44 天堂、游戏、...Paradise, games,... ...... ......

表2Table 2

关键词 Key words 文件标识序列(DocID)Document Identification Sequence (DocID) 天堂 Heaven 1、4、...1, 4,... 应用application 2、3、...2, 3,... ...... ......

从以上两个表格可以看出，文本正向索引是以文件标识为键值，建立文件标识与关键词之间的映射关系；而文本倒排索引以关键词为键值，建立关键词与文件标识之间的映射关系。As can be seen from the above two tables, the text forward index uses the file identifier as the key value to establish the mapping relationship between the file identifier and the keyword; while the text inverted index uses the keyword as the key value to establish the keyword and file The mapping relationship between identifiers.

同样，索引数据库306保存的语义本体索引315包括语义本体正向索引和语义本体倒排索引。表3是语义本体正向索引表，表4语义本体倒排索引表，如表3和表4所示：Similarly, the semantic ontology index 315 stored in the index database 306 includes a semantic ontology forward index and a semantic ontology inverted index. Table 3 is the semantic ontology forward index table, and Table 4 is the semantic ontology inverted index table, as shown in Table 3 and Table 4:

表3table 3

文件标识(DocID)Document ID (DocID) 语义标识Semantic ID 1 1 流行音乐 Pop music 2 2 古典音乐 classical music

33 小说 novel 44 电脑游戏 Computer Games 55 流行音乐 Pop music ...... ......

表4Table 4

语义标识Semantic ID 文件标识序列(DocID)Document Identification Sequence (DocID) 流行音乐 Pop music 1、5、...1, 5, ... 古典音乐 classical music 2、... 2,... 小说 novel 3、...3.... 电脑游戏 Computer Games 4、...4.... ...... ......

语义本体正向索引是以文件标识为键值，建立文件标识与语义标识之间的映射关系；而语义本体倒排索引以语义标识为键值，建立语义标识与文件标识之间的映射关系。Semantic Ontology Forward Index uses document identifiers as key values to establish the mapping relationship between document identifiers and semantic identifiers; while Semantic Ontology Inverted Index uses semantic identifiers as key values to establish the mapping relationship between semantic identifiers and document identifiers.

图4是本发明实施例中的语义本体索引处理单元314建立语义本体索引316的流程图。语义本体索引的建立流程是在文本索引处理单元建立了文本索引的基础上进行的，其执行触发条件是文本索引处理单元313已经对某个文件建立了文本索引。参见图4，语义本体索引的建立流程包括以下步骤：FIG. 4 is a flow chart of establishing a semantic ontology index 316 by the semantic ontology index processing unit 314 in the embodiment of the present invention. The establishment process of the semantic ontology index is carried out on the basis of the text index established by the text index processing unit, and the execution trigger condition is that the text index processing unit 313 has established a text index for a certain file. Referring to Figure 4, the establishment process of the semantic ontology index includes the following steps:

步骤401，语义本体索引处理单元314首先读取经过文本索引处理单元313处理，建立了文本索引的文件。In step 401 , the semantic ontology index processing unit 314 first reads the document that has been processed by the text index processing unit 313 and has established a text index.

步骤402，语义本体索引处理单元314判断所读取的文件是否被标注了语义标记。如果该文件标注了语义标记，执行步骤403，否则结束对该文件建立语义本体索引的流程。In step 402, the semantic ontology index processing unit 314 judges whether the read file is marked with a semantic mark. If the file is marked with semantic tags, go to step 403; otherwise, end the process of establishing a semantic ontology index for the file.

语义标注的文件与没有经过语义标注的文件之间的不同之处在于，语义标注的文件建立了本体概念映射信息。例如，一个文件标识为9，网址为http://grids.ucs.indiana.edu/ptliupages/publications/index.html的网页的内容主要是描述了有关做研究需要注意的事项，则可以将该网页标注为“研究(Research)”概念。现有的语义标注信息有些是以注释形式，有些是以XML包形式嵌入网页中的。在本例中，给出一个用斯坦福大学的文本标注工具OntoMat标注的，以注释形式表示的语义标注信息：The difference between semantically annotated documents and non-semantically annotated documents is that semantically annotated documents establish ontology-concept mapping information. For example, if a file is identified as 9, and the content of the web page with the URL http://grids.ucs.indiana.edu/ptliupages/publications/index.html mainly describes the matters that need to be paid attention to when doing research, then the web page can be Labeled as "Research (Research)" concept. Some of the existing semantic annotation information is in the form of annotations, and some are embedded in web pages in the form of XML packages. In this example, a semantic annotation information expressed in annotation form is given, which is annotated with Stanford University's text annotation tool OntoMat:

<！--<rdf:RDF xmlns:rdf＝″http://www.w3.org/1999/02/22-rdf-syntax-ns#″<! --<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:daml＝″http://www.daml.org/2001/03/daml+oil#″xmlns:daml="http://www.daml.org/2001/03/daml+oil#"

xmlns＝″http://annotation.semanticweb.org/iswc/iswc.daml#″xmlns="http://annotation.semanticweb.org/iswc/iswc.daml#"

<Research rdf:about＝″http://grids.ucs.indiana.edu/ptliupages/publications/index.html″<Research rdf:about=″http://grids.ucs.indiana.edu/ptliupages/publications/index.html″

</rdf:RDF></rdf:RDF>

-->-->

<title>Community Grids Publications</title><title>Community Grids Publications</title>

本例表示网页http://grids.ucs.indiana.edu/ptliupages/publications/index.html的内容主要是关于“Research”。对于用OntoMat工具标注的网页，其语义标注信息放置在Html头部中的注释信息中，以<rdf:RDF开头，以</rdf:RDF>结尾。因此，当语义本体索引处理单元314检测到语义标注信息是以<rdf:RDF开头，以</rdf:RDF>结尾的，则判定该网页文件是被语义标记标注过的。This example indicates that the content of the webpage http://grids.ucs.indiana.edu/ptliupages/publications/index.html is mainly about "Research". For webpages marked with OntoMat tools, the semantic annotation information is placed in the annotation information in the Html head, starting with <rdf:RDF and ending with </rdf:RDF>. Therefore, when the semantic ontology index processing unit 314 detects that the semantic annotation information starts with <rdf:RDF and ends with </rdf:RDF>, it determines that the webpage file has been annotated by semantic tags.

步骤403，语义本体索引处理单元314读取文件的语义标注信息。In step 403, the semantic ontology index processing unit 314 reads the semantic annotation information of the document.

在本实施例中语义本体索引处理单元314读取文件标识为9的网页的语义标注信息，即读取Html头部中的注释信息。表5是提取语义标注信息格式表，如表5所示：In this embodiment, the semantic ontology index processing unit 314 reads the semantic annotation information of the webpage whose file ID is 9, that is, reads the comment information in the Html header. Table 5 is the format table for extracting semantic annotation information, as shown in Table 5:

表5table 5

文件标识(DocID)Document ID (DocID) 语义标注信息Semantic annotation information 99 <rdf:RDF xmlns:rdf＝″http://www.w3.org/1999/02/22-rdf-svntax-ns#″xmlns:daml＝http://www.daml.org/2001/03/daml+oil#″xmlns＝″http://annotation.semanticweb.org/iswc/iswc.daml#″<Researchrdf:about＝″http://grids.ucs.indiana.edu/ptliupages/publications/index.html″</rdf:RDF><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-svntax-ns#"xmlns:daml=http://www.daml.org/2001/03/daml+oil#″xmlns="http://annotation.semanticweb.org/iswc/iswc.daml#"<Researchrdf:about="http://grids.ucs.indiana.edu/ptliupages/publications/index.html"</rdf:RDF> …... …...

步骤404，语义本体索引处理单元314从读取的语义标注信息当中提取语义本体概念词汇，建立语义本体索引。In step 404, the semantic ontology index processing unit 314 extracts semantic ontology concept vocabulary from the read semantic annotation information, and establishes a semantic ontology index.

在本实施例中语义本体索引处理单元314调用相关的RDF文档处理应用编程接口(API)，从语义标注信息中提取语义本体概念词汇“Research”，建立网页9的语义本体正向索引，并同时转换成语义本体倒排索引，如表6和表7所示。表6是网页9的语义本体正向索引，表7是网页9的语义本体倒排索引，如表6和表7所示：In this embodiment, the semantic ontology index processing unit 314 invokes the relevant RDF document processing application programming interface (API), extracts the semantic ontology concept vocabulary "Research" from the semantic annotation information, establishes the semantic ontology forward index of the webpage 9, and simultaneously Converted into semantic ontology inverted index, as shown in Table 6 and Table 7. Table 6 is the semantic ontology forward index of webpage 9, and table 7 is the semantic ontology inverted index of webpage 9, as shown in Table 6 and Table 7:

表6Table 6

文件标识(DocID)Document ID (DocID) 语义标识Semantic ID 9 9 ResearchResearch

表7Table 7

语义标识Semantic ID 文件标识(DocID)Document ID (DocID) ResearchResearch 9 9

步骤405，语义本体索引处理单元314将建立的语义本体正向索引和语义本体倒排索引保存到索引数据库306中，即形成了语义本体索引316的内容。Step 405 , the semantic ontology index processing unit 314 saves the established semantic ontology forward index and semantic ontology inverted index into the index database 306 , that is, the content of the semantic ontology index 316 is formed.

建立语义本体索引之前，之所以要先经过文本索引处理单元313的处理步骤，是因为在用户搜索时要先查询出符合用户输入的文本查询信息的文件，然后再对这些文件进行语义本体索引匹配处理。文本索引处理单元313的处理步骤保证了每个建立了文本索引，并且有语义信息的文件，在语义本体索引316中都有对应的语义本体索引信息，从而避免因为直接从文件数据库308读取文件进行语义本体索引匹配而产生的文件具有语义本体索引而没有文本索引的情况。Before establishing the semantic ontology index, the reason why the processing steps of the text index processing unit 313 must first be performed is that when the user searches, the files matching the text query information input by the user must first be queried, and then the semantic ontology index matching is performed on these files deal with. The processing steps of the text index processing unit 313 ensure that each file with a text index and semantic information has corresponding semantic ontology index information in the semantic ontology index 316, thereby avoiding the problem of directly reading files from the file database 308. The case where the document generated by semantic ontology index matching has semantic ontology index but no text index.

图5是图3所示的本发明实施例检索系统为用户执行搜索过程的流程图，如图5所示，包括以下步骤：Fig. 5 is a flow chart of the retrieval system of the embodiment of the present invention shown in Fig. 3 performing a search process for a user, as shown in Fig. 5, comprising the following steps:

步骤501，搜索接口模块301获取用户输入的文本查询信息，并将其发送给搜索处理模块303。本实施例中假设用户输入的查询信息为“天堂”。Step 501 , the search interface module 301 obtains the text query information input by the user, and sends it to the search processing module 303 . In this embodiment, it is assumed that the query information input by the user is "paradise".

步骤502，搜索处理模块303接收搜索接口模块301发送的文本查询信息，对其进行切分预处理，然后将切分后的查询关键词发送给文本搜索处理单元310。Step 502 , the search processing module 303 receives the text query information sent by the search interface module 301 , performs segmentation preprocessing on it, and then sends the segmented query keywords to the text search processing unit 310 .

切分处理的具体过程在现有的描述搜索引擎的相关文献中都有描述，这里不再复述。本实施例中文本查询信息“天堂”经过切分预处理后的结果为关键词“天堂”。The specific process of segmentation processing has been described in existing related documents describing search engines, and will not be repeated here. In this embodiment, the text query information "paradise" is segmented and preprocessed and the result is the keyword "paradise".

步骤503，文本搜索处理单元310匹配切分后的查询关键词与文本倒排索引，将匹配命中的文本命中文件列表发送给语义本体搜索处理单元311。Step 503 , the text search processing unit 310 matches the segmented query keywords with the text inverted index, and sends the matched text hit file list to the semantic ontology search processing unit 311 .

文本搜索处理单元310接收到查询关键词后，向索引数据库306发送读取文本倒排索引的请求信息，索引数据库306根据请求返回文本索引315中的文本倒排索引。文本搜索处理单元310将用户查询关键词“天堂”与文本倒排索引进行匹配，获得一系列包含该关键词的网页文件标识——文本命中文件标识列表，并将文本命中文件列表发送给语义本体搜索处理单元311进行处理。After receiving the query keyword, the text search processing unit 310 sends request information for reading the text inverted index to the index database 306, and the index database 306 returns the text inverted index in the text index 315 according to the request. The text search processing unit 310 matches the user query keyword "paradise" with the text inverted index, obtains a series of web page file identifiers containing the keyword—the text hit file identifier list, and sends the text hit file list to the semantic ontology The search processing unit 311 performs processing.

为简单起见，在本实施例中假设只对20个文件建立了索引。表8是索引数据库306返回给文本搜索处理单元310的文本倒排索引表，如表8所示：For simplicity, it is assumed in this embodiment that only 20 files are indexed. Table 8 is the text inverted index table returned to the text search processing unit 310 by the index database 306, as shown in Table 8:

表8Table 8

关键词 Key words 文件标识序列file ID sequence …... …... 应用application 0101110010001101101001011100100011011010 天堂 Heaven 1101101111111000101111011011111110001011 …... …...

表8中，每一行对应一个关键词和出现了该关键词的文件标识序列。其中，文件标识序列的二进制总位数20表示建立索引的总文件个数，每个二进制位代表一个文件，二进制位的位置序号与文件标识序号相同，即第一个二进制位表示标识序号为1的文件，第二个二进制位表示标识序号为2的文件，依次类推。若某个二进制位为0，表示相应的关键词没有在对应的文件中出现，若为1则表示相应的关键词在对应的文件中出现。In Table 8, each row corresponds to a keyword and the file identification sequence in which the keyword appears. Among them, the total number of binary digits of the file identification sequence is 20, indicating the total number of files indexed, each binary digit represents a file, and the position number of the binary digit is the same as the file identification serial number, that is, the first binary digit indicates that the identification serial number is 1 , the second binary bit indicates the file with the identification number 2, and so on. If a certain binary bit is 0, it means that the corresponding keyword does not appear in the corresponding file, and if it is 1, it means that the corresponding keyword appears in the corresponding file.

文本搜索处理单元310将用户查询关键词“天堂”匹配到表8中的“天堂”关键词，将其后的文件标识序列，即文本命中文件列表11011011111110001011取出，发送到语义本体搜索处理单元311。文本命中文件列表中二进制位为1的就是命中的文件了。The text search processing unit 310 matches the user query keyword "paradise" to the "paradise" keyword in Table 8, takes out the following file identification sequence, that is, the text hit file list 11011011111110001011, and sends it to the semantic ontology search processing unit 311. The file whose binary bit is 1 in the text hit file list is the hit file.

同理若用户输入的文本查询信息为“天堂应用”，经过切分预处理后得到关键词“天堂”和关键词“应用”，因此只要分别匹配到文本倒排索引中的“天堂”和“应用”两个关键词，将其后的文件标识序列做与操作得到结果01011000100010001010，其中二进制位为1的表示在对应的文件中同时出现了“天堂”和“应用”两个关键词。Similarly, if the text query information entered by the user is "paradise application", the keyword "paradise" and the keyword "application" will be obtained after segmentation and preprocessing, so as long as they match "paradise" and " Apply the two keywords, and perform an AND operation on the subsequent file identification sequence to get the result 01011000100010001010, where the binary bit is 1, which means that the two keywords "paradise" and "application" appear in the corresponding file at the same time.

步骤504，语义本体搜索处理单元311获得文本命中文件列表后，首先判断是否进行语义本体倒排索引匹配处理。In step 504, after the semantic ontology search processing unit 311 obtains the list of text hit files, it first judges whether to perform semantic ontology inverted index matching processing.

语义本体搜索处理单元311进行判断的依据是文本命中文件的个数，若命中文件的个数大于某个阀值，则进行语义本体倒排索引匹配处理，执行步骤505；否则进行语义本体正向索引匹配处理，执行步骤506。阀值可以作为预定义的数值存储在语义本体搜索处理单元311中，也可以是检索系统根据统计规律或其它条件动态调整的数值。The basis for judging by the semantic ontology search processing unit 311 is the number of text hit files, if the number of hit files is greater than a certain threshold, the semantic ontology inverted index matching process is performed, and step 505 is performed; otherwise, the semantic ontology forward For index matching processing, step 506 is executed. The threshold value can be stored in the semantic ontology search processing unit 311 as a predefined value, or it can be a value dynamically adjusted by the retrieval system according to statistical laws or other conditions.

语义本体搜索处理单元311接收到文本命中文件列表11011011111110001011后，累加计算得到这个二进制序列中1的个数为14，即文本命中文件个数为14。假设阀值为10，由于14大于10，因此进行语义本体倒排索引匹配处理。若阀值为15，则由于14小于15，进行语义本体正向索引匹配处理。After receiving the text hit file list 11011011111110001011, the semantic ontology search processing unit 311 accumulates and calculates to obtain 14 numbers of 1s in the binary sequence, that is, the number of text hit files is 14. Assuming that the threshold value is 10, since 14 is greater than 10, the semantic ontology inverted index matching process is performed. If the threshold value is 15, since 14 is less than 15, the semantic ontology forward index matching process is performed.

步骤505，语义本体搜索处理单元311对文本命中文件列表中的文件进行语义本体倒排索引匹配处理，得到文档语义分类表。Step 505 , the semantic ontology search processing unit 311 performs semantic ontology inverted index matching processing on the files in the text hit file list to obtain a document semantic classification table.

首先，语义本体搜索处理单元311向索引数据库306发送读取语义本体倒排索引的请求消息。索引数据库306根据请求返回语义本体倒排索引。语义本体搜索处理单元311依次读出语义本体倒排索引中的每一条记录，将记录中的文件标识序列与文本命中文件列表做交集操作，即将两个二进制序列进行按位与操作，然后用操作结果覆盖语义本体倒排索引表中对应的文件标识序列。最后，过滤掉交集为空的记录，则原来的语义本体倒排索引表就变成了文档语义分类表。执行步骤507。First, the semantic ontology search processing unit 311 sends a request message for reading the semantic ontology inverted index to the index database 306 . The index database 306 returns the semantic ontology inverted index according to the request. The semantic ontology search processing unit 311 sequentially reads each record in the semantic ontology inverted index, performs an intersection operation on the file identification sequence in the record and the text hit file list, that is, performs a bitwise AND operation on the two binary sequences, and then uses the operation The result covers the corresponding file identification sequence in the semantic ontology inverted index table. Finally, the records whose intersection is empty are filtered out, and the original semantic ontology inverted index table becomes the document semantic classification table. Execute step 507.

表9是本实施例中索引数据库306返回给语义本体搜索处理单元311的语义本体倒排索引表，如表9所示：Table 9 is the semantic ontology inverted index table returned to the semantic ontology search processing unit 311 by the index database 306 in this embodiment, as shown in table 9:

表9Table 9

语义标识Semantic ID 文件标识序列file ID sequence 流行音乐 Pop music 0101101011000110000001011010110001100000 电脑游戏 Computer Games 1010010100010000101110100101000100001011 古典音乐 classical music 0001000000101100000100010000001011000001 小说 novel 10000000000000000110100000000000000000110 体育明星 athletic star 0000000000000001000000000000000000010000

表9中假设建立索引的20个文件只涉及五个语义本体概念，即全部文件中的语义标识有五种。每个语义标识后的文件标识序列表示该本体概念在20个文件中出现的情况。其表示方法同文本倒排索引中的文件标识序列，每个二进制位代表一个文件，二进制位的位置序号与文件的标识序号相同。若某个二进制位为0，表示对应的文件没有标注相应的本体概念，若为1表示标注了相应的本体概念。例如流行音乐的文件标识序列是01011010110001100000，表示文件标识为2、4、5、7、9、10、14、15的文件被标注成流行音乐的概念，反映了这些文件的内容与流行音乐有关。In Table 9, it is assumed that the 20 indexed documents only involve five semantic ontology concepts, that is, there are five kinds of semantic identifiers in all documents. The document identification sequence after each semantic identification indicates the occurrence of the ontology concept in 20 documents. Its representation method is the same as the file identification sequence in the text inverted index, each binary bit represents a file, and the position number of the binary bit is the same as the identification number of the file. If a binary bit is 0, it means that the corresponding ontology concept is not marked in the corresponding file, and if it is 1, it means that the corresponding ontology concept is marked. For example, the file identification sequence of pop music is 01011010110001100000, which means that files with file IDs of 2, 4, 5, 7, 9, 10, 14, and 15 are labeled as pop music, reflecting that the content of these files is related to pop music.

语义本体搜索单元311读取表9所示语义本体倒排索引中的每一个文件标识序列，与文本命中文件列表11011011111110001011做按位与操作，将操作结果存入表9中对应的文件标识序列的位置，并覆盖原来的文件标识序列，最后过滤掉交集为空，既与操作结果为全零的语义标识项，产生文档语义分类表。表10是产生的文档语义分类表，如表10所示：The semantic ontology search unit 311 reads each file identification sequence in the semantic ontology inverted index shown in Table 9, performs a bitwise AND operation with the text hit file list 11011011111110001011, and stores the operation result in the corresponding file identification sequence in Table 9 position, and overwrite the original file identification sequence, and finally filter out the semantic identification items whose intersection is empty, that is, the semantic identification items with all zeros in the operation result, and generate a document semantic classification table. Table 10 is the generated document semantic classification table, as shown in Table 10:

表10Table 10

语义标识Semantic ID 文件标识序列file ID sequence 流行音乐 Pop music 0101101011000000000001011010110000000000 电脑游戏 Computer Games 1000000100010000101110000001000100001011 古典音乐 classical music 0001000000101000000100010000001010000001 小说 novel 10000000000000000010100000000000000000010

这样，就将文本命中文件列表11011011111110001011按语义分类了。In this way, the text hit file list 11011011111110001011 is semantically classified.

步骤506，语义本体搜索处理单元311对文本命中文件列表中的文件进行语义本体正向索引匹配处理，得到文档语义分类表。Step 506 , the semantic ontology search processing unit 311 performs semantic ontology forward index matching processing on the files in the text hit file list to obtain a document semantic classification table.

首先，语义本体搜索处理单元311向索引数据库306发送读取语义本体正向索引的请求消息。表11是索引数据库306根据语义本体搜索处理单元311的请求返回语义本体正向索引表，如表11所示：First, the semantic ontology search processing unit 311 sends a request message to the index database 306 to read the forward index of the semantic ontology. Table 11 is that index database 306 returns semantic ontology forward index table according to the request of semantic ontology search processing unit 311, as shown in table 11:

表11Table 11

文件标识File ID 语义标识Semantic ID 1 1 电脑游戏、小说computer games, novels 2 2 流行音乐 Pop music

33 电脑游戏 Computer Games 44 流行音乐、古典音乐pop music, classical music 55 流行音乐 Pop music 66 电脑游戏 Computer Games 77 流行音乐 Pop music 8 8 电脑游戏 Computer Games 9 9 流行音乐 Pop music 1010 流行音乐 Pop music 1111 古典音乐 classical music 1212 电脑游戏 Computer Games 1313 古典音乐 classical music 1414 流行音乐、古典音乐pop music, classical music 1515 流行音乐 Pop music 1616 体育明星 athletic star 1717 电脑游戏 Computer Games 1818 小说 novel 1919 电脑游戏、小说computer games, novels 2020 电脑游戏、古典音乐computer games, classical music

语义本体搜索处理单元311将文本命中文件列表11011011111110001011转化为具体的文件标识：1、2、4、5、7、8、9、10、11、12、13、17、19、20，并以每一个文件标识为查询条件在语义本体正向索引中匹配对应的记录，得到一个只包含这些文件标识的语义本体正向索引。表12是通过上述过程得到的语义本体正向索引表，如表12所示：The semantic ontology search processing unit 311 converts the text hit file list 11011011111110001011 into specific file identifiers: 1, 2, 4, 5, 7, 8, 9, 10, 11, 12, 13, 17, 19, 20, and each A file identifier matches the corresponding record in the semantic ontology forward index as a query condition, and a semantic ontology forward index containing only these file identifiers is obtained. Table 12 is the semantic ontology forward index table obtained through the above process, as shown in Table 12:

表12Table 12

文件标识File ID 语义标识Semantic ID 1 1 电脑游戏、小说computer games, novels 2 2 流行音乐 Pop music 44 流行音乐、古典音乐pop music, classical music 55 流行音乐 Pop music

77 流行音乐 Pop music 8 8 电脑游戏 Computer Games 9 9 流行音乐 Pop music 1010 流行音乐 Pop music 1111 古典音乐 classical music 1212 电脑游戏 Computer Games 1313 古典音乐 classical music 1717 电脑游戏 Computer Games 1919 电脑游戏、小说computer games, novels 2020 电脑游戏、古典音乐computer games, classical music

最后，以表12中出现的每一个语义本体概念为键值，统计出出现该键值的文件标识，完成正向索引到倒排索引的转换，产生文档语义分类表。表13是通过上述过程得到文档语义分类表，如表13所示：Finally, take each semantic ontology concept that appears in Table 12 as a key value, count the document identifiers that appear in this key value, complete the conversion from forward index to inverted index, and generate a document semantic classification table. Table 13 is the document semantic classification table obtained through the above process, as shown in Table 13:

表13Table 13

然后执行步骤507。Then step 507 is executed.

之所以分为语义本体倒排索引匹配处理和语义本体正向索引匹配处理，是考虑到效率问题。因为在进行语义本体倒排索引匹配处理的过程中，需要用文本命中文件列表依次匹配语义本体倒排索引中的每一条记录，并且做交集操作，这种全表扫描语义本体倒排索引的过程，其计算量开销非常大。因此，当文本命中文件的个数很少时，进行语义本体正向索引匹配处理可以减少计算量。但无论用哪种匹配方法，最后产生的文档语义分类表都是相同的，即表13与表10相同。The reason why it is divided into semantic ontology inverted index matching processing and semantic ontology forward index matching processing is that efficiency is considered. Because in the process of semantic ontology inverted index matching processing, it is necessary to use the text hit file list to match each record in the semantic ontology inverted index in turn, and perform an intersection operation. This process of full table scanning semantic ontology inverted index , which is computationally expensive. Therefore, when the number of text hit files is small, performing semantic ontology forward index matching processing can reduce the amount of calculation. But no matter which matching method is used, the resulting document semantic classification table is the same, that is, Table 13 is the same as Table 10.

步骤507，语义本体搜索处理单元311利用语义本体推理引擎304、标注本体库305和文档语义分类规则引擎对文档语义分类表中的语义词汇进行推理，根据推理结果对语义分类表进行扩展，并将扩展后的文档语义分类表发送给排序处理单元312。Step 507, the semantic ontology search processing unit 311 utilizes the semantic ontology reasoning engine 304, the annotation ontology library 305 and the document semantic classification rule engine to reason the semantic vocabulary in the document semantic classification table, expand the semantic classification table according to the reasoning results, and The extended document semantic classification table is sent to the sorting processing unit 312 .

语义本体搜索处理单元311执行完语义本体索引匹配操作后，首先将文档语义分类表中的语义本体概念词汇发送到语义本体推理引擎304进行语义推理。语义本体推理引擎304根据本体标注库305中定义的语义本体概念及其关系和自身定义的推理规则，产生表示语义本体词汇之间关系的RDF文档，返回给语义本体搜索处理单元311。然后，语义本体搜索处理单元311将这个RDF文档与文档语义分类规则引擎302中定义的语义分类规则中的触发条件进行匹配，判断哪些语义分类规则需要触发，并触发相应的规则，产生经过推理扩展的文档语义分类表。最后，将扩展后的语义文档分类表发送给排序处理单元312。After the semantic ontology search processing unit 311 executes the semantic ontology index matching operation, it first sends the semantic ontology concept vocabulary in the document semantic classification table to the semantic ontology reasoning engine 304 for semantic reasoning. Semantic ontology inference engine 304 generates an RDF document representing the relationship between semantic ontology words according to semantic ontology concepts and their relationships defined in ontology annotation library 305 and its own inference rules, and returns them to semantic ontology search processing unit 311 . Then, the semantic ontology search processing unit 311 matches the RDF document with the trigger conditions in the semantic classification rules defined in the document semantic classification rule engine 302, judges which semantic classification rules need to be triggered, and triggers the corresponding rules to generate the reasoning-extended The semantic taxonomy of documents. Finally, the extended semantic document classification table is sent to the sorting processing unit 312 .

本实施例中，语义本体搜索处理单元311将表10或表13中的四个语义本体概念词，流行音乐、电脑游戏、古典音乐、小说，发送到语义本体推理引擎304进行推理。语义本体推理引擎304的推理原理是：根据资源的RDF三元组的表示形式，依据定义的推理规则进行推理处理。RDF三元组的表现形式为：(主体，谓词，个体)。例如定义两个如图6所示的资源描述：深圳601属于广东602；广东602属于中国603。同时定义一个推理规则为：(？a，属于，？b)，(？b，属于，？c)→(？a，属于，？c)。该推理规则表达的含义是：如果a属于b，并且b属于c，则可以推理出a属于c。因此，从图6所示的关系可以推理出图7所示的结果：深圳601属于中国603。In this embodiment, the semantic ontology search processing unit 311 sends the four semantic ontology concept words in Table 10 or Table 13, pop music, computer game, classical music, and novel, to the semantic ontology reasoning engine 304 for reasoning. The reasoning principle of the semantic ontology reasoning engine 304 is: according to the representation form of the RDF triples of the resources, reasoning is performed according to the defined reasoning rules. The expression form of RDF triple is: (subject, predicate, individual). For example, two resource descriptions as shown in FIG. 6 are defined: Shenzhen 601 belongs to Guangdong 602; Guangdong 602 belongs to China 603. At the same time, an inference rule is defined as: (?a, belongs to, ?b), (?b, belongs to, ?c)→(?a, belongs to, ?c). The meaning expressed by this inference rule is: if a belongs to b, and b belongs to c, then it can be inferred that a belongs to c. Therefore, the result shown in Figure 7 can be deduced from the relationship shown in Figure 6: Shenzhen 601 belongs to China 603.

假设标注本体库305中对本实施例的四个本体概念建立了如图8所示的关系：流行音乐801的父类为通俗音乐802，通俗音乐802和古典音乐803的父类均为音乐804；小说805的父类为文学806；电脑游戏807的父类为游戏。则经过推理规则推理后得到的四个本体概念的RDF关系如图9所示：流行音乐801和古典音乐803的父类均为音乐804；小说805的父类为文学806；电脑游戏807的父类为游戏808。其RDF三元组输出格式为：Assume that the relationship shown in Figure 8 is established for the four ontology concepts of the present embodiment in the annotation ontology library 305: the parent class of popular music 801 is popular music 802, and the parent class of popular music 802 and classical music 803 is music 804; The parent category of novel 805 is literature 806; the parent category of computer game 807 is game. Then, the RDF relationship of the four ontology concepts obtained after reasoning with inference rules is shown in Figure 9: the parent category of pop music 801 and classical music 803 is music 804; the parent category of novel 805 is literature 806; the parent category of computer game 807 is Class 808 for games. Its RDF triple output format is:

(流行音乐，父类，音乐)(pop, parent, music)

(古典音乐，父类，音乐)(classical music, parent, music)

(小说，父类，文学)(fiction, parent genre, literature)

(电脑游戏，父类，游戏)(computer game, parent, game)

文档语义分类规则引擎302中定义了这样一条语义分类规则：若多个三元组存在共同的个体，且谓词为“父类”，则在文档语义分类表中增加新的文档分类，类别名称为该个体的名称，文件标识序列为多个三元组中各主体词汇对应的文件标识序列的并集，即按位或操作的结果序列。表14是上述的语义分类规则表，如表14所示：Such a semantic classification rule is defined in the document semantic classification rule engine 302: if there is a common individual in multiple triples, and the predicate is "parent class", then a new document classification is added to the document semantic classification table, and the category name is The name of the individual and the file identification sequence are the union of the file identification sequences corresponding to the subject words in multiple triplets, that is, the result sequence of the bitwise OR operation. Table 14 is the above-mentioned semantic classification rule table, as shown in Table 14:

表14Table 14

触发条件Triggering conditions 执行操作perform an operation 存在多个(？X1，父类，？Y)(？X2，父类，？Y)…Exists multiple (?X1, parent, ?Y)(?X2, parent, ?Y)... 文档语义分类表增加一条记录。该项记录的语义标识为？Y，文件标识序列为？X1、？X1、…对应的文件标识序列的并集A record is added to the document semantic classification table. The semantic identifier of this record is ? Y, the file ID sequence is ? X1,? The union of the file identification sequences corresponding to X1, ... …… ……

则经过语义推理处理并根据语义分类规则扩展整合后的文档语义分类表。表15是扩展后的文档语义分类表，如表15所示：After semantic reasoning processing and expanding the integrated document semantic classification table according to the semantic classification rules. Table 15 is the extended document semantic classification table, as shown in Table 15:

表15Table 15

语义标识Semantic ID 文件标识序列file ID sequence 流行音乐 Pop music 0101101011000000000001011010110000000000 电脑游戏 Computer Games 1000000100010000101110000001000100001011 古典音乐 classical music 0001000000101000000100010000001010000001 小说 novel 10000000000000000010100000000000000000010 音乐 music 0101101011101000000101011010111010000001

步骤508，排序处理单元312对经过语义推理后的文档语义分类表中的文件进行相关性和重要性的计算，然后按照计算结果对文件进行排序，最后将排序后的结果和文档语义分类信息发送给搜索接口模块301。Step 508, the sorting processing unit 312 calculates the relevance and importance of the files in the document semantic classification table after semantic reasoning, then sorts the files according to the calculation results, and finally sends the sorted results and document semantic classification information to to the search interface module 301.

步骤509，搜索接口模块301将接收到的排序结果和语义分类信息作为搜索结果反馈给用户。In step 509, the search interface module 301 feeds back the received sorting results and semantic classification information as search results to the user.

以上所述，仅为本发明的较佳实施例而已，并非用来限定本发明的保护范围。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention.

Claims

1. A retrieval system based on semantic ontology, characterized in that the system comprises:

Semantic ontology index database, used to save semantic ontology index;

The semantic ontology search processing unit is configured to obtain a text hit file list, and match the text hit file list with the semantic ontology index in the semantic ontology index database to obtain a document semantic classification table.

2. The system according to claim 1, characterized in that the system further comprises a semantic ontology index processing unit, configured to obtain documents for which text indexes have been established, and establish semantic ontology indexes for the obtained documents.

3. The system of claim 2, further comprising:

A text index processing unit, configured to establish a text index for the file;

Text index database for saving text indexes;

The text search processing unit is configured to match the user's text query information with the text index in the text index database to obtain a list of text hit files.

4. The system of claim 1, 2 or 3, further comprising:

The semantic ontology reasoning engine performs semantic reasoning on the semantic ontology vocabulary set in the document semantic classification table according to the semantic ontology vocabulary set in the annotation ontology library and the relationship between the semantic ontology vocabulary, and obtains the extended semantic ontology vocabulary set;

Annotation ontology library, used to save semantic ontology vocabulary set and relationship between semantic ontology vocabulary;

The document semantic classification rule engine is used to save the semantic classification rules, and trigger the corresponding semantic classification rules according to the extended semantic ontology vocabulary set deduced by the semantic ontology reasoning engine, and extend and integrate the document semantic classification table to obtain the extended The semantic taxonomy of documents.

5. The system according to claim 4, further comprising a sorting processing unit, configured to sort the files in the extended document semantic classification table.

6. The system according to claim 5, further comprising a search interface module, configured to send the user's text query information to the text search processing unit; Feedback to users.

7. The system according to claim 3, wherein the system further comprises a file database, which is used to store files for use by the semantic ontology index processing unit to establish a semantic ontology index and the text index processing unit to establish a text index .

8. The system according to claim 7, further comprising a network file capture module, configured to capture network files from the Internet and store them in the file database.

9. The system according to claim 3, wherein the text index includes a text forward index and a text inverted index; and the semantic ontology index includes a semantic ontology forward index and a semantic ontology inverted index.

10. The system according to claim 1, wherein the semantic ontology search processing unit performs semantic ontology forward index matching processing or semantic ontology inverted index matching processing on the text hit file list.

11. The system according to claim 10, wherein the semantic ontology search processing unit, when the number of text hit files in the text hit file list is greater than a threshold value, performs semantic ontology inverted index matching processing, Otherwise, carry out semantic ontology forward index matching processing.

12. The system according to claim 11, wherein the threshold is a predefined fixed value or a dynamically adjustable value.

13. The system according to claim 3, wherein the text search processing unit and the semantic ontology search processing unit are integrated in a search processing module; the text index processing unit and the semantic ontology index processing unit are integrated in a In the index processing module: the text index database and the semantic ontology index database are integrated into an index database.

14. The system according to claim 5, wherein the text search processing unit, semantic ontology search processing unit and sorting processing unit are integrated into one search processing module.

15. A search method based on semantic ontology, characterized in that the method comprises the following steps:

A. Obtain the files with established text indexes, and establish semantic ontology indexes for the obtained files;

B. Obtain a list of text hit files, perform semantic ontology index matching processing on the list of text hit files, and obtain a document semantic classification table.

16. The method of claim 15, wherein,

It further includes before step A, the step of establishing a text index for the files in the file database;

Before step B, it further includes a step of performing text index matching processing on the user's text query information to obtain a text hit file list.

17. The method according to claim 15 or 16, further comprising the steps of:

C. Perform semantic reasoning on the semantic ontology vocabulary set in the document semantic classification table to obtain an extended semantic ontology vocabulary set;

D. According to the deduced extended semantic ontology vocabulary set, perform an extended integration operation on the document semantic classification table to obtain an extended document semantic classification table.

18. The method according to claim 17, further comprising: a step of sorting the files in the extended document semantic classification table.

19. The method as claimed in claim 15, characterized in that, setting up the semantic ontology index described in the step A is setting up a semantic ontology forward index and setting up a semantic ontology inverted index; described in the step B, the text hit file list is carried out Semantic ontology index matching processing is to perform semantic ontology inverted index matching processing or semantic ontology forward index matching processing.

20. The method according to claim 15, further comprising before step B: in step B, performing semantic ontology inverted index matching processing on the text hit file list, or performing semantic ontology forward index matching processing judgment steps.

21. The method according to claim 20, characterized in that, the step of judging is: when the number of text hit files in the text hit file list is greater than a threshold value, carry out semantic ontology inverted index matching processing in step B , otherwise in step B, carry out semantic ontology forward index matching processing.

22. The method according to claim 21, wherein the threshold value is a predefined fixed value or a dynamically adjustable value.

23. The method according to claim 16, wherein said establishing a text index is establishing a text forward index and establishing a text inverted index; said performing text index matching processing on the user's text query information is performing text index matching. Inverted index matching processing or text forward index matching processing.

24. The method according to claim 16, further comprising: a step of establishing a file database before said establishing a text index.