WO2010089248A1 - Procédé et système de recherche sémantique - Google Patents
Procédé et système de recherche sémantique Download PDFInfo
- Publication number
- WO2010089248A1 WO2010089248A1 PCT/EP2010/051055 EP2010051055W WO2010089248A1 WO 2010089248 A1 WO2010089248 A1 WO 2010089248A1 EP 2010051055 W EP2010051055 W EP 2010051055W WO 2010089248 A1 WO2010089248 A1 WO 2010089248A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- semantic
- topic
- annotations
- associating
- key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
Definitions
- the present invention relates to the field of semantic searching, and more particularly to the field of processing source documents for semantic search.
- Typical search engines like Google and Yahoo use a keyword-based approach for information search and retrieval, where queries consist of keywords, and search results are represented as a ranked, flat list of documents, with textual sections containing spans of text with mentions of keywords. Incorporating semantic analysis based on different knowledge representation sources (such as ontologies, topic maps, semantic nets%) and text analytics can improve search results over simple keyword-based search.
- Some search engines such as "OmniFind" allow for the inclusion of semantic information in the indexing phase through integrating Text Analytics Engines via UIMA (Unstructured Information Management Architecture, as described in for example: http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.index.html ).
- Each bubble is a short string of the value of the facet, and a count of how many documents this bubble contains.
- This visualization lacks discovery capabilities; because bubbles are visualized as a black-box showing no clues about what concepts and relations documents in this bubble contains (i.e. it doesn't show how documents in this bubble are related to other facets), so, the user must go inside the bubble and start reading its documents, to know if this information is what he needs.
- KartOO http://www.kartoo.com/: a term-based visualization of search results that allows the user to visualize web pages-terms relationships. It depicts a map of links and high ranked terms, and when the user hovers over a web page, it highlights terms that occurred in this page; and when the user hovers over a term, it highlights web pages where this term occurred. When the user clicks on a term, this term is added to the query to refine it, thus narrowing down the search.
- This visualization helps a user understand term co-occurrences, and to get a grasp of what a web page is all about without the need to read it.
- KartOO doesn't use any semantic information, and relies just on terms mentioned in web pages. Also, it does not show relations between terms; except the "co-occurrence" relation.
- Grokker http://www.grokker.com/: views search results categorized based on their Semantic Web metadata. It visualizes categorization in two forms: a) "Outline view” which shows categories in a hierarchy tree. b) "Map view” which shows categories as a cluster map. When user clicks on a category, search results belonging to this category are shown on the right side as a classical flat ranked list of hits, with textual summarization. The user can also filter content. Organizing search results into a tree structure is helpful to the user, allowing him/her to navigate through the tree hierarchy till reaching relevant information, but, semantic relations are richer than just a tree.
- Aduna AutoFocus creates an index for a source (a folder for example) to enable multi- faceted searching and discovery.
- a user starts the query with a specific value of a facet (for example, documents containing a specific keyword); documents matching this facet value will be shown as a bubble; when this bubble is selected, the "Navigation" view on the left gets updated to show how documents in this bubble can be categorized using other facets (suggested keywords found in documents belonging to this bubble, or type of documents found in this bubble). This allows the user to refine the query by adding other bubbles.
- This visualization allows both narrowing down of retrieved documents and expansion to include other documents using static metadata associated with documents but no focus on semantic concepts.
- WO 2006/116273 A2 describes a system for categorizing terms, phrases, documents and/or clustering term co-occurrence with respect to a taxonomy. It provides an automated means for assigning objects, such as websites, to appropriate categories of a taxonomy.
- WO 2006/128123 A2 describes a search engine that uses Natural Language Processing
- NLP NLP
- WO 2004/075466 A2 describes an integrated implementation framework and resulting medium for knowledge retrieval, management, delivery and presentation.
- the framework is based on two servers that work together to provide context and time-sensitive semantic information retrieval services.
- WO 2007/059287 Al describes a system for searching for information in a data set by syntactically indexing and performing syntactic searching of data sets using relationship queries in order to improve the search result accuracy.
- Figure 1 is a block diagram showing components of a system common to certain embodiments described hereafter;
- Figure 2 shows details of the complementary indices 150 in accordance with a first embodiment;
- Figure 3 shows a flowchart describing certain steps implementing the first embodiment
- Figure 4 shows details of the complementary indices 150 in accordance with a second embodiment
- Figure 5 shows details of the complementary indices 150 in accordance with a third embodiment
- Figure 6 shows a block diagram of a system in accordance with a fourth embodiment
- Figure 7 shows a flowchart for the modified parsing & annotation linking processes according to the fourth embodiment
- Figure 8 shows a flowchart for the behaviour of the modified Topic ID based annotation linker according to the fourth embodiment
- Figure 9 shows a subset of the information provided in a Topic Map representation of a semantic model 103
- Figure 10 shows the mapped Semantic Model 165 from the previous Topic Map shown in figure 9
- Figure 11 shows an example for the annotations created by the domain specific annotator 1222.
- the method comprises the steps of associating type annotations with elements of one or more source artefacts, where said type annotations indicate that adherence to one of a set of predefined types, for each element thus annotated identifying semantic keys corresponding to the annotations associated therewith in a semantic model, said semantic model comprising a plurality of said semantic keys each being associated with a plurality of semantic descriptors; and associating the semantic descriptors associated with each semantic key thus identified as corresponding to the annotations associated with a respective element, with that element.
- FIG. 1 is a block diagram showing components of a system common to certain embodiments described hereafter.
- the search engine 100 comprises a crawler 110, which retrieves source artefacts 101 such as documents, from a specific source (e.g. Web content, Database, File system ... etc.).
- the search engine 100 further comprises a Parser 120, which may perform various Natural Language Processing (NLP) tasks on the artefacts retrieved by the crawler 110.
- NLP Natural Language Processing
- Common NLP tasks carried out by the parser may include language identification, tokenization, stemming, part-of-speech tagging, and normalization.
- the search engine 100 further comprises an Indexer 130, which stores the data created by the parser 120 in the main index 140 that facilitates fast and accurate information retrieval.
- an Indexer 130 which stores the data created by the parser 120 in the main index 140 that facilitates fast and accurate information retrieval.
- a Search Runtime (not shown), which provides an interface to the Main index 140 that is used by the Search application to issue search queries and retrieve the relevant documents for that query.
- the Search runtime should provide facade API 180 for search applications to submit queries and retrieve the results (Search Engine APIs) from the main index 140 and the complementary indices 150.
- Traditional searches are augmented and improved by leveraging semantic data to disambiguate semantic search queries and web text in order to increase relevancy of results. According to the present embodiment, this is done by enabling different semantically aware processing components to be plugged into the parser 120 to perform semantic analysis on the crawled artefacts 101.
- the analysis results are then written to one or more indices using a mapped semantic model 165 derived from a
- the parser component uses UIMA (Unstructured Information Management Architecture) to provide the infrastructure needed to manage and compose multiple analysis components.
- UIMA Unstructured Information Management Architecture
- UIMA http://sourceforge.net/projects/uima-framework
- UIMA refers to an analysis component as an annotator; a software component that implements the UIMA annotator interface to produce and record annotations over regions of an artefact (a text document in this case).
- An annotation is the association of a metadata, such as a label, with a region of text. For example, the label "City" associated with a region of text "Cairo" constitutes an annotation.
- an annotation is represented as a special type in a UIMA type system.
- Annotations are recorded by an annotator in a data structure named the CAS.
- the CAS Common Analysis Structure
- the CAS Common Analysis Structure
- a UIMA Type system is a collection of "types".
- a type is a specification of an object in the CAS used to store the results of analysis.
- Types usually contain Features, which are attributes, or properties of the type.
- a UIMA type system is used to represent the semantic model, where a UIMA type represents a topic type which is the class of topics that a particular topic belongs to (e.g. Country, City, Organization ... etc), with UIMA type features as attributes of the topic type.
- a lexical analysis annotator 1221 and a plurality of domain specific annotators 1222 are examples of the topic type.
- UIMA provides a component to wrap the aggregated annotators in a single unit.
- This unit is named the Aggregate analysis engine component 122.
- this component is composed from a Lexical analysis annotator 1221, which performs the Natural Language Processing (NLP) tasks for the traditional search, and Domain Specific Annotators 1222 (either statistical, rule based or dictionary based) that analyze the documents based on semantic information for a specific domain.
- NLP Natural Language Processing
- Domain Specific Annotators 1222 either statistical, rule based or dictionary based
- the resulting annotations are stored in the CAS 126.
- UIMA defines another component - the CAS consumer - that receives the CAS after processing is done by an Analysis Engine. It is responsible for taking the results from the CAS and using them for some purpose. The CAS Consumer may also perform collection- level analysis, saving these results in an application-specific, aggregate data structure.
- a Complementary Indices Creator component 124 which is responsible for building the complementary indices.
- the Complementary Indices Creator component 124 is implemented as a CAS consumer. Thus it runs in the UIMA analysis pipeline.
- the Complementary Indices Creator 124 scans the CAS 126 for the annotations created by the domain specific annotators 1222. It uses an Annotation Linker component 170 to link an annotation to a topic in a semantic model 103.
- a semantic model mapper 160 is needed to map from a semantic model representation 103 (e.g. ontology, topic maps ... etc) to the topic model used by the system 100.
- the semantic model mapper 160 may be seen as providing a mapped semantic model 165, which is a translation of the semantic model 103 into a format consistent with the structure expected by the complementary indices creator 124. It will be appreciated that this mapped semantic model 165 need not exist as a permanent entity, but rather represent the capacity of the semantic model mapper 160 to provide translations from the semantic model 103 in the required format on demand.
- the complementary indices creator 124 uses the semantic model mapper 160 to retrieve the information needed from the semantic model 103 and add it to one or more complementary indices 150.
- an interface 180 In order for search applications to access and use the information stored in the complementary indices 150, there is provided an interface 180.
- This interface 180 complements the Search runtime providing an API to submit queries and retrieve results from the complementary indices and possibly also the main index.
- This API is the Complementary indices' API component.
- the system described herein stores semantic information in multiple indices. These indices are referred to collectively as the "complementary indices" 150.
- a search application (not shown), running for example at a client side, to provide the end-user with a GUI that contains controls and views to support semantic search visualization.
- semantic information needs to be stored in a semantic model representation 103 that enables different types of semantic information to be represented.
- Exemplary semantic model elements include:
- Egypt, Cairo, Arab League ... etc. Egypt, Cairo, Arab League ... etc. It could have multiple names (e.g. IBM and International Business Machines) and a topic name may not be unique (e.g.
- Topic Type the class of topics that a particular topic belongs to (e.g. Country, City,
- any information that is specified by a topic type as being relevant to its topic instances e.g. Population, GDP ... etc.
- Topic identifier a token that provides an unambiguous indication of the identity of a topic.
- a semantic model sub-system provides the link between the annotation space, e.g. the CAS 126 and the topic space, e.g. the semantic model representation 103. It allows the annotations produced by the domain specific annotators 1222 to be linked to topics in a semantic model 103. Then the relevant information is stored in the indices 150 using a uniform representation that conforms to the semantic model elements explained above.
- the semantic model sub-system contains two components:
- Semantic Model Mapper 160- Semantic information can be represented using different model representations (e.g. topic maps, RDF, Ontolingua ... etc.).
- the "Semantic model mapper" component 160 maps the schema of the underlying semantic model representation 103 into the schema understood by the complementary indices creator 150 and the annotations linker 170 (i.e. Topics, Attributes, Relations ... etc).
- Semantic model mapper component 160 The implementation of a Semantic model mapper component 160 is dependant on the semantic model 103 to be mapped. There is provided an interface to allow using different implementations for different semantic model representations.
- semantic model mapper 160 The idea of the semantic model mapper 160 is to enable using any semantic model representation in the system 100.
- the semantic data is transformed and represented using the classes: Topic, Attribute, and Relation.
- the "Annotation linker" component 170 attempts to link an annotation stored in the CAS 126 to a topic in a semantic model 103 of a specific domain.
- the annotation linker 170 communicates with the model through the Semantic model Mapper.
- the annotation linker 170 analyzes the information in the annotation (i.e. annotated text (surface form), UIMA type, and annotation features) and the semantic model 103 in order to collect sufficient evidence that this annotation represents a certain topic in the mapped semantic model 165.
- the annotation linker 170 analyzes the information provided in the annotation and attempts to find a matching topic, the annotation linker component 170 should understand how the information is structured in both the UIMA type system, that defines the annotation structure, and the semantic model 103, that defines the topic structure.
- annotation linker 170 should consider such differences.
- annotation linker code is developed for a specific UIMA type system definition and a specific semantic model 103.
- the linker code analyzes the annotations created by the domain specific annotator(s) 1222 using the UIMA type system definition and attempts to match it to a topic in a semantic model 103 for the same domain.
- Multiple implementations can be developed for the different domains for which annotators 1222 and semantic models 103 exist.
- First embodiment- Topic and Mention indices Figure 2 shows details of the complementary indices 150 in accordance with a first embodiment.
- this first embodiment there are defined two complementary indices, each comprising entries defined as a key/values pair:
- Topic index 252 whose keys are topic identifiers, each key's associated values being the type, names, attributes, and topic identifiers for related topics.
- Mention index 254 whose keys are topic names, the value of each key being the topic identifier(s) for topic(s) this name refers to.
- the complementary indices 252, 254 need to be built and provided to the search applications by the search engine.
- Figure 3 shows a flowchart describing certain steps implementing the first embodiment.
- the Search Engine Crawler 110 runs periodically to retrieve new and updated artefacts
- the Parser processes the artefacts to retrieve the data needed to build the indices 140, 150. This is performed in two stages:
- the standard lexical analysis annotator 1221 performs the NLP tasks performed by a typical search engine at step 210.
- the results are stored in an intermediate form
- search engine indexer can operate on to create the main index 140.
- the Domain specific annotators 1222 process the documents and annotate the topic mentions at step 215.
- the annotations contain the attribute values for the specific topics being annotated.
- the annotations are stored in the CAS 126. This step thus constitutes associating type annotations with elements of one or more source artefacts, where type annotations indicate that adherence to one of a set of predefined types.
- the complementary indices creator 124 analyzes the domain specific annotations in the CAS 126 and builds the complementary indices 150. This is done in three stages:
- the Annotation Linker component 170 links an annotation to a topic at step 220.
- the Annotation Linker maps the semantic model 103 to the topics model representation used by the system.
- the Annotation linker serves to identify in a semantic model 103, semantic keys corresponding to the annotations associated with each element annotated at step 215, semantic model 103 comprising a plurality of said semantic keys each being associated with a plurality of semantic descriptors.
- the annotation linker 170 uses a Semantic
- Model Mapper 160 as described herein to map the semantic model 103 to the topics model representation used by the system 100.
- the complementary indices creator retrieves the semantic information associated with a topic and maps it into semantic descriptors at step 225, thereby associating the semantic descriptors associated with each semantic key thus identified as corresponding to the annotations associated with a respective element, with that element.
- the complementary indices creator uses the
- Semantic Model Mapper component to retrieve the semantic information associated with a topic and maps it into semantic descriptors.
- step 225 of associating the semantic descriptors associated with each semantic key identified as corresponding to the annotations associated with a respective element, with that element is implemented by compiling a complementary Topic index 152 belonging to the complementary indices 150, storing a reference to each said element and said associated semantic descriptor.
- step 230 the search engine indexer 130 creates the main index 140.
- step 215 of associating type annotations with elements of one or more said source artefacts is carried out by means of an UIMA annotator or a UIMA aggregate analysis engine, where the resulting annotations are stored in a UIMA Common Analysis Structure, and where said steps 220 of identifying semantic keys and 225 of associating the semantic descriptors, are carried out in the role of CAS consumers.
- the key of step 225 comprises a topic identifier and the semantic descriptors include at least one of the type, names, attributes, and topic identifiers for related topics.
- the key associated with a semantic descriptor in said complementary index is a topic name and the semantic descriptors include the topic identifier for topics this name refers to.
- the complementary indices 150 thus comprises the topic index 152.
- FIG. 4 shows details of the complementary indices 150 in accordance with a second embodiment.
- the system description is the similar as that of the 1st embodiment described with respect to figures 2 and 3.
- an additional complementary index is added to relate topics together if they frequently co- occurred in the collection of documents. This allows discovering the relation between the topics even if they are not related in the semantic model.
- This further complementary index is referred to as the "Co-occurrence index 254.
- An index entry may take the form of a key/value pair, wherein:
- the key is a topic identifier.
- the associated value is a list of topic identifiers for topics that co-occurred at least once in a document with the entry topic.
- the topic identifiers are ranked based on how frequent they co-occurred in documents, given a selected criterion of how close a co-occurring topic was to the entry topic (i.e. same sentence, same paragraph or same document). Topics in this list may not have a relation with the entry topic in a semantic model.
- - Complementary indices creator 124 this component creates the Co-occurrence index by tracking the topics that co-occurred within a specified boundary. The boundaries are defined during lexical analysis by the lexical annotator.
- - Interface APIs 180 a new API is added to the interface APIs that receives a topic identifier and a boundary (sentence or paragraph or a document) then returns topic identifiers of the co-occurred topics. It adds to the visualization functionality by enabling the search application to link a topic to its co-occurring topics, which allows a richer graph representation as in the following example:
- the key is a topic identifier and the semantic descriptors include a list of topic identifiers for topics that co-occurred at least once in an artefact.
- the user can change the scope of the co-occurrences by selecting the proximity of the co- occurring word to the search term for example, same Sentence, same paragraph same document, a given number of words, characters etc. This factor may be referred to as the scope of the co-occurrence search.
- FIG. 5 shows details of the complementary indices 150 in accordance with a third embodiment.
- the system description is the similar as that of the 1 st embodiment described with respect to figures 2 and 3.
- An additional complementary index can be added to allow the search application to populate the search fields with values for topic attributes. This helps to simplify building the search query.
- This further complementary index is referred to as an "Attribute index”.
- An entry in this index may take the form of a key/value pair, wherein
- the key is an attribute name for a specific topic type.
- the associated value is a list of this attribute values in all topics of the given topic type.
- Complementary indices creator 124 This component will create this index while the "complementary indices creator" is looping on the annotations linked to the topics.
- Facade APIs 180 an additional API is added to the "Facade APIs" that receives an attribute and returns all possible values for this attribute. This allows a new functionality which is to populate the GUI query fields with attribute values so it simplifies the search query. So a user can now search the attribute name and value.
- the key is an attribute name for a specific topic type and the semantic descriptors include a list of this attribute values in all topics of the given topic type.
- Figure 6 shows a block diagram of a system in accordance with a fourth embodiment.
- the system description is the similar as that of the 1 st embodiment described with respect to figures 2 and 3.
- certain annotators 1221, 1222 create annotations based on dictionary lookup to annotate topic mentions in text, and use dictionaries 690 built from the semantic models integrated with the system, instead of a pre-built dictionary by a dictionary builder 695.
- An entry in a dictionary 695 may contain:
- This key is a sequence of characters. It represents a topic name.
- the associated value is a list of topic ids & topic types for topics with the name in the key.
- the "Keys" are gathered from the topics' names of the semantic model 103 and the values are gathered from the topics' identifiers and types.
- Annotation Linker 170 An annotation created by the dictionary based annotator contains a feature that represents a topic identifier.
- the "Annotation Linker” matches a "Topic ID” of a topic in the semantic representation with the annotation feature of "Topic Identifier”.
- Figure 7 shows a flowchart for the modified parsing & annotation linking processes according to the fourth embodiment.
- the method starts at step 710, and proceeds to step 720 at which the Lexical analysis annotator 1221 analyzes document text and produce boundary annotations (i.e. paragraph, sentence and token annotations).
- the method next proceeds to step 730 at which the dictionary based annotator 1222 uses the token boundaries to perform dictionary lookups in the semantic model dictionaries 690.
- the dictionary based annotator 1222 uses the token boundaries to perform dictionary lookups in the semantic model dictionaries 690.
- the dictionary based annotator 1222 creates an annotation per identified topic type.
- the created annotations contain a feature that represents the topic ID.
- the Topic ID annotation linker 170 links the annotations to topics in the semantic model 103 at step 740.
- the method then terminates at step 750
- Figure 8 shows a flowchart for the behaviour of the modified Topic ID based annotation linker according to the fourth embodiment.
- the method starts at step 810, and proceeds to step 820 at which it is determined whether there remain any annotations to be linked. In a case where no annotations remain to be linked, the method terminates at step 830. Otherwise, the method proceeds to step 840 at which the next annotation to be linked is obtained.
- the method proceeds to step 850 at which it is determined whether the annotation to be linked has a "TopicID" feature, representing a particular a Topic ID. If the feature exists, the method proceeds to step 860 at which the Annotation Linker 170 directly links the annotation to a Topic in the semantic model 103 with the matching ID before returning to step 820. Otherwise the method proceeds directly to step 820.
- the method thus checks each annotation to be linked in turn until none remain.
- Figure 9 shows a subset of the information provided in a Topic Map representation of a semantic model 103.
- the semantic model 103 contains information about countries - cities - political, financial and social organizations - religions - geographical information (i.e. rivers, lakes, seas, oceans, islands, continents ... etc).
- the Semantic Model Mapper 160 is used to convert this Topic Map to a semantic model representation 165 compatible with the Complementary Indices Creator 124.
- cartouches 911, 912, 913, 914, 915, 916, 917, 918, 919 etc. each representing a Topic, such as "India”, “Contained in”, “Country”, “Egypt”, “is capital of, “Capital”, “Cairo”, “Is seat of, “Population” respectively.
- circles 921, 922, 923, etc. each indicating an association role. Boxes 931, 932, 933 etc. indicate occurrences as described hereafter.
- Diamonds such as 941, 942 etc define associations, by connecting different topics with reference to an association role.
- FIG. 10 shows the mapped Semantic Model 165 from the previous Topic Map 103 shown in figure 9. As shown there are defined a plurality of Topics 1010, 1020, 1030, 1240, 1050, 1060.
- Each topic in this representation contains respectively a unique identifier 1011, 1021, 1031, 1041, 1051, 1061 such as Topi for 1011, Top2 for 1021, ...etc, which is stored in the Topic Map XML file, a list of names 1012, 1022, 1032, 1242, 1052, 1062, its topic type 1013, 1023, 1033, 1243, 1053, 1063, list of attributes 1014, 1024, 1034, 1244, 1054, 1064, and list of relations with the other topics 1015, 1025, 1035, 1245, 1055, 1065.
- domain specific annotators 1222 are used during the parsing phase to process the artefacts 101 and create annotations over semantic entities in the text in the artefact.
- an annotator 1222 that creates annotations that identifies "Countries” and "Cities” in the text.
- GDC Global Delivery Center
- a domain specific 1222 annotator creates "City” annotations with respect to mentions of "Pune” and “Cairo”. It also creates "Country” annotation with respect to mention of "Egypt”.
- Figure 11 shows an example for the annotations created by the domain specific annotator 1222.
- the Complementary Indices Creator 124 uses the Annotation Linker 170 to link annotations to matching topics from the mapped semantic model 165.
- Annotation Linker processing may be provided according to different approaches. Type systems acts as the contract between the Annotation Linker 170 and the domain specific annotator 1222. Also the Annotation Linker has access to the Mapped Semantic Model 165 through the Semantic Model Mapper 160.
- Annotation Linker 170 may use some rules to make the link. Rules can be fired based on logical expression which combines covered text, annotation types, features and their values, along with topic identifiers, relations, names and attributes.
- a simplified example set of rules may contain - among other rules - the following:
- IF is conditional branching, where IF checks some conditions and if all these conditions passed then it do the statements in
- 2- Annotation and Topic means an instance of the Topics and Annotation, so you can access its attributes using the dot operator.
- 3- EQUALS checks if the right hand side's value equals the left hand side's value.
- the rule-based annotation linker can link annotations (figure 11), and topics (figure 10) in the following manner: • “Annotation 1" to “Topi” representing city of "Cairo” (1110 in fig. 11)
- Complementary Indices Creator 124 fills the four indices Co-occurrence Index, Topic Index, Mention Index, and Attribute
- Topic Index The key is set to matching topic identifiers and value is filled with matching topic names, type, attributes and relations from the Mapped Semantic Model 165. Table 1 below shows how the information that would be added to the topic index 252 on the basis of the present example.
- Attribute Index Key is set to attributes names contained in mentioned topics, and value is filled with all possible attribute values. Table 3 below shows how the information that would be added to the Attribute index 558 on the basis of the present example.
- Key is set to mentioned topics identifier, and value if filled with topic identifiers for topic which co-occurred with key's mentioned topic.
- Table 4 shows how the information that would be added to the cooccurrence index 456 on the basis of the present example.
- a semantic search application can make queries to these indices via the interface 180 at runtime to help the end-user in visualizing the relations among topics and further knowledge discovery as explained in the disclosure.
- Search Keyword search: a list of documents matching the search query is returned from the main index.
- Semantic search topic ids are retrieved for topics matching the search query. Also, a list of documents that contains mentions for these topics are returned. Moreover, upon request, topics information is retrieved based on topic ids. This functionality combines information from the Main index
- Topic detection in keyword search, there is provided a method to determine if the keyword matches a topic.
- a search is performed in the Mention index and a list of topic ids is returned for matching topics.
- Topic index Given topic id for the topic to be visualized, a graph connecting this topic to all topics that is directly related to this topic in the semantic model. Information in this graph enables search application to provide search refinement and discovery capabilities through the semantic visualization. Information is retrieved from the Topic index.
- Clustering this functionality allows classification of topics into groups based on a specific attribute value. Given the topics to be classified and the attribute to be by which they are to be classified, the value of this attribute will be retrieved from the Topic index for each topic, and the topics having the same value for the given attribute will be grouped.
- Topic information retrieval this functionality allows for retrieval of names, attributes and relations for a specific topic. This helps when visualizing a topic.
- search sequence may be envisaged along the following lines:
- a user submits a search query "countries contained in Africa”.
- the search application builds a query string (that follows the search engine query syntax) and submits it to the search engine using the interface 180 ⁇ Topics retrieval API and Document retrieval).
- search results are then returned to the search Application.
- a list of topic names is displayed to the user. Also documents matching the search query are displayed to the user.
- the user can see a graph of the matched topic relative to a grouping attribute.
- the user selects the grouping attribute.
- the search application uses the "Topics clustering" function and presents a graph to the user with the matched topics in relation to the grouping attribute.
- the user can select one of the topics in the matched topics graph or list.
- the search application uses the "Topic relations" function from the Facade APIs to retrieve a data structure with the selected topic and its related topics.
- the search application presents a graph with the topic relations.
- the search application also uses the "Document retrieval" function to return the documents matching the selected topic.
- the user can hover on the selected topic.
- the search application uses the "Topic information retrieval "function and presents a tooltip with the topic information (Topic names, type and attributes).
- the user can select any topic on the graph and then its information is presented to the user using the same methodology.
- a method of processing source document to provide complementary semantic databases to enable an improved semantic search Source documents are parsed and type annotations indicating adherence to one of a set of predefined types are associated with particular words in the source documents. For each annotated word semantic keys are identified that corresponding to the annotations associated therewith. These semantic keys correspond to semantic keys defined in a semantic model, which associated each semantic key with a plurality of semantic descriptors. On this basis a complementary database is compiled associating each word with the semantic descriptors associated with each semantic key thus identified as corresponding to the annotations associated with a respective word. A number of different complementary databases are proposed.
- the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements.
- the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
- the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
- a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
- Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory
- ROM read only memory
- CD-ROM compact disk - read only memory
- CD- R/W compact disk - read/write
- a data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus.
- the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- I/O devices including but not limited to keyboards, displays, pointing devices, etc.
- I/O controllers can be coupled to the system either directly or through intervening I/O controllers.
- Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks.
- Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention porte sur un procédé de traitement de document source pour fournir des bases de données sémantiques complémentaires afin de permettre une recherche sémantique améliorée. Les documents sources sont analysés et des annotations de type indiquant une adhérence à l'un d'un ensemble de types prédéfinis sont associées à des mots particuliers dans les documents sources. Pour chaque mot annoté, des clés sémantiques sont identifiées, lesquelles correspondent aux annotations associées à celui-ci. Ces clés sémantiques correspondent aux clés sémantiques définies dans un modèle sémantique, qui a associé chaque clé sémantique à une pluralité de descripteurs sémantiques. Sur cette base, une base de données complémentaire est compilée, associant chaque mot avec les descripteurs sémantiques associés à chaque clé sémantique ainsi identifiée comme correspondant aux annotations associées à un mot respectif. Un certain nombre de différentes bases de données complémentaires sont proposées.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP09151971 | 2009-02-03 | ||
| EP09151971.0 | 2009-02-03 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2010089248A1 true WO2010089248A1 (fr) | 2010-08-12 |
Family
ID=42115969
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2010/051055 Ceased WO2010089248A1 (fr) | 2009-02-03 | 2010-01-29 | Procédé et système de recherche sémantique |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2010089248A1 (fr) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103678277A (zh) * | 2013-12-04 | 2014-03-26 | 东软集团股份有限公司 | 基于文档分段的构建主题-词汇分布的方法及系统 |
| US9280340B2 (en) | 2014-04-01 | 2016-03-08 | International Business Machines Corporation | Dynamically building an unstructured information management architecture (UIMA) pipeline |
| US9418066B2 (en) | 2013-06-27 | 2016-08-16 | International Business Machines Corporation | Enhanced document input parsing |
| US9734046B2 (en) | 2014-04-01 | 2017-08-15 | International Business Machines Corporation | Recording, replaying and modifying an unstructured information management architecture (UIMA) pipeline |
| CN111460169A (zh) * | 2020-03-27 | 2020-07-28 | 科大讯飞股份有限公司 | 语义表达式生成方法、装置及设备 |
| CN111666370A (zh) * | 2020-07-28 | 2020-09-15 | 中国人民解放军国防科技大学 | 面向多源异构航天数据的语义索引方法和装置 |
| US10878190B2 (en) | 2016-04-26 | 2020-12-29 | International Business Machines Corporation | Structured dictionary population utilizing text analytics of unstructured language dictionary text |
| US11720346B2 (en) | 2020-10-02 | 2023-08-08 | International Business Machines Corporation | Semantic code retrieval using graph matching |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2004075466A2 (fr) | 2003-02-14 | 2004-09-02 | Nervana, Inc. | Systeme et procede pour une extraction, une gestion, une capture, un partage, une decouverte, une distribution et une presentation de connaissances semantiques |
| WO2006116273A2 (fr) | 2005-04-22 | 2006-11-02 | Google, Inc. | Categorisation d'objets, de type documents et/ou groupes, par rapport a une taxinomie et a des structures de donnees derivees de ladite categorisation |
| WO2006128123A2 (fr) | 2005-05-27 | 2006-11-30 | Hakia, Inc. | Systeme et procede de traitement de langage naturel utilisant des recherches ontologiques |
| US20070038608A1 (en) | 2005-08-10 | 2007-02-15 | Anjun Chen | Computer search system for improved web page ranking and presentation |
| WO2007059287A1 (fr) | 2005-11-16 | 2007-05-24 | Evri Inc. | Extension de recherche par mot cle a des donnees annotees syntaxiquement et semantiquement |
-
2010
- 2010-01-29 WO PCT/EP2010/051055 patent/WO2010089248A1/fr not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2004075466A2 (fr) | 2003-02-14 | 2004-09-02 | Nervana, Inc. | Systeme et procede pour une extraction, une gestion, une capture, un partage, une decouverte, une distribution et une presentation de connaissances semantiques |
| WO2006116273A2 (fr) | 2005-04-22 | 2006-11-02 | Google, Inc. | Categorisation d'objets, de type documents et/ou groupes, par rapport a une taxinomie et a des structures de donnees derivees de ladite categorisation |
| WO2006128123A2 (fr) | 2005-05-27 | 2006-11-30 | Hakia, Inc. | Systeme et procede de traitement de langage naturel utilisant des recherches ontologiques |
| US20070038608A1 (en) | 2005-08-10 | 2007-02-15 | Anjun Chen | Computer search system for improved web page ranking and presentation |
| WO2007059287A1 (fr) | 2005-11-16 | 2007-05-24 | Evri Inc. | Extension de recherche par mot cle a des donnees annotees syntaxiquement et semantiquement |
Non-Patent Citations (2)
| Title |
|---|
| FERRUCCI D ET AL: "UIMA: an architectural approach to unstructured information processing in the corporate research environment", NATURAL LANGUAGE ENGINEERING, CAMBRIDGE UNIVERSITY PRESS, CAMBRIDGE, GB, vol. 10, no. 3-4, 1 September 2004 (2004-09-01), pages 327 - 348, XP009133296, ISSN: 1351-3249 * |
| GREG SMITH; MARY CZERWINSKI; BRIAN MEYERS; DANIEL ROBBINS; GEORGE ROBERTSON; DESNEY S. TAN: "FacetMap: A Scalable Search and Browse Visualization", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 12, no. 5, 2006, XP011150880, DOI: doi:10.1109/TVCG.2006.142 |
Cited By (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10430469B2 (en) | 2013-06-27 | 2019-10-01 | International Business Machines Corporation | Enhanced document input parsing |
| US9418066B2 (en) | 2013-06-27 | 2016-08-16 | International Business Machines Corporation | Enhanced document input parsing |
| US9558187B2 (en) | 2013-06-27 | 2017-01-31 | International Business Machines Corporation | Enhanced document input parsing |
| US10437890B2 (en) | 2013-06-27 | 2019-10-08 | International Business Machines Corporation | Enhanced document input parsing |
| CN103678277A (zh) * | 2013-12-04 | 2014-03-26 | 东软集团股份有限公司 | 基于文档分段的构建主题-词汇分布的方法及系统 |
| US9280340B2 (en) | 2014-04-01 | 2016-03-08 | International Business Machines Corporation | Dynamically building an unstructured information management architecture (UIMA) pipeline |
| US9734046B2 (en) | 2014-04-01 | 2017-08-15 | International Business Machines Corporation | Recording, replaying and modifying an unstructured information management architecture (UIMA) pipeline |
| US10268573B2 (en) | 2014-04-01 | 2019-04-23 | International Business Machines Corporation | Recording, replaying and modifying an unstructured information management architecture (UIMA) pipeline |
| US10878190B2 (en) | 2016-04-26 | 2020-12-29 | International Business Machines Corporation | Structured dictionary population utilizing text analytics of unstructured language dictionary text |
| CN111460169A (zh) * | 2020-03-27 | 2020-07-28 | 科大讯飞股份有限公司 | 语义表达式生成方法、装置及设备 |
| CN111666370A (zh) * | 2020-07-28 | 2020-09-15 | 中国人民解放军国防科技大学 | 面向多源异构航天数据的语义索引方法和装置 |
| CN111666370B (zh) * | 2020-07-28 | 2022-04-22 | 中国人民解放军国防科技大学 | 面向多源异构航天数据的语义索引方法和装置 |
| US11720346B2 (en) | 2020-10-02 | 2023-08-08 | International Business Machines Corporation | Semantic code retrieval using graph matching |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Wong et al. | Ontology learning from text: A look back and into the future | |
| US7974984B2 (en) | Method and system for managing single and multiple taxonomies | |
| US20100174739A1 (en) | System and Method for Wikifying Content for Knowledge Navigation and Discovery | |
| US20090217179A1 (en) | System and method for knowledge navigation and discovery utilizing a graphical user interface | |
| US20090070322A1 (en) | Browsing knowledge on the basis of semantic relations | |
| US20080306918A1 (en) | System and method for wikifying content for knowledge navigation and discovery | |
| EP3039578A1 (fr) | Procédé et système d'identification et d'évaluation de motifs sémantiques dans un langage écrit | |
| Dong et al. | A survey in semantic search technologies | |
| WO2010089248A1 (fr) | Procédé et système de recherche sémantique | |
| Hinze et al. | Improving access to large-scale digital libraries throughsemantic-enhanced search and disambiguation | |
| Bontcheva et al. | Semantic annotations and retrieval: Manual, semiautomatic, and automatic generation | |
| Pyshkin et al. | Approaches for web search user interfaces | |
| Navigli et al. | BabelNetXplorer: a platform for multilingual lexical knowledge base access and exploration | |
| Zhang | Ontology and the semantic web | |
| WO2009035871A1 (fr) | Connaissances de navigation sur la base de relations sémantiques | |
| Mirizzi et al. | From exploratory search to web search and back | |
| Klan et al. | Integrated Semantic Search on Structured and Unstructured Data in the ADOnIS System. | |
| Hinze et al. | Capisco: low-cost concept-based access to digital libraries | |
| Andrews et al. | Semantic disambiguation in folksonomy: a case study | |
| Neri et al. | Mining the Web to monitor the Political Consensus | |
| Schiessl et al. | Ontology lexicalization: Relationship between content and meaning in the context of Information Retrieval | |
| Fogarolli | Wikipedia as a source of ontological knowledge: state of the art and application | |
| Cameron et al. | Semantics-empowered text exploration for knowledge discovery | |
| US20240354318A1 (en) | System and method for searching tree based organizational hierarchies, including topic hierarchies, and generating and presenting search interfaces for same | |
| Zaman et al. | Towards Summarization of Aggregated Multimedia Verticals Web Search Results |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10701541 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 10701541 Country of ref document: EP Kind code of ref document: A1 |