WO2019172849A1 - Method and system for generating a structured knowledge data for a text - Google Patents
Method and system for generating a structured knowledge data for a text Download PDFInfo
- Publication number
- WO2019172849A1 WO2019172849A1 PCT/SG2019/050126 SG2019050126W WO2019172849A1 WO 2019172849 A1 WO2019172849 A1 WO 2019172849A1 SG 2019050126 W SG2019050126 W SG 2019050126W WO 2019172849 A1 WO2019172849 A1 WO 2019172849A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- subject
- basic components
- node
- additional
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
Definitions
- the present invention generally relates to a method of generating a structured knowledge data for a text, and a system thereof.
- various conventional knowledge representation techniques only extract triples (i.e., sets of subject, predicate and object) from sentences. Accordingly, such conventional knowledge representation techniques only take into account the basic components (e.g., major meaning components) of the sentence, namely, subject, predicate and object, but ignore or disregard other sentence components, such as modifiers (or adjuncts).
- an example sentence is provided below:
- a staff with at least one Singapore Citizen child between 7 and 12 years old is eligible to take 2 days of Childcare Leave per year unconditionally, without the need to produce a medical certificate.
- the resulting knowledge representation for the above example sentence may be: (1)“staff’ (subject),“is eligible” (predicate) and“Childcare Leave” (object) or (2) “staff’ (subject) and“is eligible to take” (predicate) and“Childcare Leave” (object).
- knowledge representation or meaning representation
- the prepositional phrases and noun phrases in the above example sentence provide additional information on the condition and time, which are helpful for conveying the original meaning of the example sentence above, but are not captured (or represented) in the above-mentioned conventional knowledge representation technique.
- a method of generating a structured knowledge data for a text comprising at least one sentence, using at least one processor, the method comprising:
- first set of basic components extracting a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and forming a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
- the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
- the above-mentioned forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
- the first set of basic components further comprises an object
- the predicate indicates a semantic relationship between the subject and the object
- the object node is configured to represent the object
- the above-mentioned forming the first data graph further comprises assigning, for each of the one or more first modifiers corresponding to the object, the first modifier as an attribute to the object node.
- the above-mentioned extracting the first set of basic components and the one or more first modifiers comprises:
- first set of basic components and the one or more first modifiers are extracted from the plurality of chunk components.
- the method further comprises:
- the one or more first modifiers are one or more adjuncts of the first sentence.
- the first data graph is a directed data graph.
- the text is an unstructured text.
- the method further comprises: extracting one or more first additional sets of basic components from the first sentence;
- an additional data graph for the first sentence based on the first additional set of basic components and the one or more additional modifiers associated with the first additional set of basic components, the additional data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
- the subject node and the edge of the additional data graph are configured to represent the subject and the predicate of the first additional set of basic components, respectively, and
- the above-mentioned forming the additional data graph comprises assigning, for each of the one or more additional modifiers corresponding to one of the subject and the predicate, the additional modifier as an attribute to the corresponding one of the subject node and the edge.
- the method further comprises merging the subject node of the first data graph and the subject node of the additional data graph of at least one of the one or more first additional sets of basic components from the first sentence as a common subject node if the subject represented by the subject node of the first data graph and the subject represented by the subject node of the additional data graph of the at least one of the one or more first additional sets of basic components correspond to each other.
- the text comprises a plurality of sentences and the method further comprises, for each additional sentence of the plurality of sentences: extracting a second set of basic components and one or more second modifiers associated with the second set of basic components from the additional sentence, the second set of basic components comprising a subject and a predicate associated with the subject; and forming a second data graph for the additional sentence based on the second set of basic components and the one or more second modifiers associated with the second set of basic components, the second data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
- the subject node and the edge of the second data graph are configured to represent the subject and the predicate of the second set of basic components, respectively, and the above-mentioned forming the second data graph comprises assigning, for each of the one or more second modifiers corresponding to one of the subject and the predicate, the second modifier as an attribute to the corresponding one of the subject node and the edge.
- a system for generating a structured knowledge data for a text comprising at least one sentence comprising:
- At least one processor communicatively coupled to the memory and configured to: extract a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and
- first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
- the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
- the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
- the first set of basic components further comprises an object
- the predicate indicates a semantic relationship between the subject and the object
- the object node is configured to represent the object
- the above-mentioned form the first data graph further comprises assigning, for each of the one or more first modifiers corresponding to the object, the first modifier as an attribute to the object node.
- the above-mentioned extract the first set of basic components and the one or more first modifiers comprises:
- first set of basic components and the one or more first modifiers are extracted from the plurality of chunk components.
- the at least one processor is further configured to: identify one or more of the plurality of chunk components as a named entity; and label each of the one or more chunk components identified with a corresponding named entity class label.
- the one or more first modifiers are one or more adjuncts of the first sentence, and the text is an unstructured text.
- the first data graph is a directed data graph.
- the at least one processor is further configured to: extract one or more first additional sets of basic components from the first sentence;
- an additional data graph for the first sentence based on the first additional set of basic components and the one or more additional modifiers associated with the first additional set of basic components, the additional data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein the subject node and the edge of the additional data graph are configured to represent the subject and the predicate of the first additional set of basic components, respectively, and
- the additional data graph comprises assigning, for each of the one or more additional modifiers corresponding to one of the subject and the predicate, the additional modifier as an attribute to the corresponding one of the subject node and the edge.
- the at least one processor is further configured to merge the subject node of the first data graph and the subject node of the additional data graph of at least one of the one or more first additional sets of basic components from the first sentence as a common subject node if the subject represented by the subject node of the first data graph and the subject represented by the subject node of the additional data graph of the at least one of the one or more first additional sets of basic components correspond to each other.
- the text comprises a plurality of sentences and the at least one processor is further configured to, for each additional sentence of the plurality of sentences:
- the second set of basic components comprising a subject and a predicate associated with the subject;
- the second data graph for the additional sentence based on the second set of basic components and the one or more second modifiers associated with the second set of basic components, the second data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
- the subject node and the edge of the second data graph are configured to represent the subject and the predicate of the second set of basic components, respectively, and the above-mentioned form the second data graph comprises assigning, for each of the one or more second modifiers corresponding to one of the subject and the predicate, the second modifier as an attribute to the corresponding one of the subject node and the edge.
- a computer program product embodied in one or more non-transitory computer-readable storage mediums, comprising instructions executable by at least one processor to perform a method of generating a structured knowledge data for a text comprising at least one sentence, the method comprising:
- first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
- the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
- the above-mentioned forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject node and the edge, the first modifier as an attribute to the corresponding one of the subject node and the edge.
- FIG. 1 depicts a flow diagram illustrating a method of generating a structured knowledge data for a text, according to various embodiments of the present invention
- FIG. 2 depicts a schematic block diagram of a system for generating a structured knowledge data for a text, according to various embodiments of the present invention
- FIG. 3 depicts an example computer system which the system as described with respect to FIG. 2 may be embodied in, by way of an example only;
- FIG. 4A depicts an example sentence including a subject and a predicate, according to various example embodiments of the present invention
- FIGs. 4B and 4C depict two example sentences, each including a subject, a predicate and an object, according to various example embodiments of the present invention
- FIG. 5 depicts a schematic drawing of an example system for generating a structured knowledge data for a text, according to various example embodiments of the present invention
- FIG. 6 depicts a sample of constituent parsing result (constituent parsing tree) for an example sentence, according to various example embodiments of the present invention
- FIGs. 7 A to 7C depict schematic drawings of three data graphs (or graph representations) formed for three example sets of basic components, according to various example embodiments of the present invention
- FIG. 8 depicts an example graphical user interface (GUI) generated by a computer processor for interaction with a user in an example implementation according to various example embodiments of the present invention.
- GUI graphical user interface
- FIGs. 9 A to 9D depict a schematic drawing of a structured knowledge data (semantic knowledge graph) formed for an example text, according to various example embodiments of the present invention.
- V arious embodiments of the present invention provide a method of generating a structured knowledge data for a text (text data) comprising at least one sentence, and a system thereof, and more particularly, for an unstructured text (e.g., free text).
- a text text data
- an unstructured text e.g., free text
- various conventional knowledge representation techniques for a text do not provide a full or sufficient knowledge representation of the original meaning of a text (e.g., sentences in an unstructured text), resulting in the loss of knowledge (or meaning) of various sentences in the text, of which may be important or useful information relating to various basic components (e.g., subject, predicate and/or object) in such sentences.
- various embodiments of the present invention provide a method of generating a structured knowledge data for a text, and a system thereof, that seek to overcome, or at least ameliorate, one or more of the deficiencies in conventional knowledge representation techniques for a text, such as but not limited to, improving or enhancing knowledge representation of a text.
- FIG. 1 depicts a flow diagram illustrating a method 100 of generating a structured knowledge data for a text (e.g., for an input text data) comprising at least one sentence, using at least one processor, according to various embodiments of the present invention.
- the method comprises extracting (at 102) a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and forming (at 104) a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node.
- the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively.
- the above-mentioned forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
- a sentence may include four types of components (or elements), namely, a subject, a predicate, an object and a modifier, whereby the modifier is an optional element in the sense that removal of the modifier from the sentence would generally not affect the grammar of the sentence.
- basic components (or basic elements) of a sentence refer to (or are obtained/derived from) components of the sentence which are non-optional, or in other words, selected from a group consisting of subject, predicate and object (which may be referred to herein as a“triple”), or excluding/without the modifiers.
- the structured knowledge data comprises one or more data graphs (e.g., including the above-mentioned first data graph).
- a data graph is a type of data structure, whereby data is organized or represented in a network of data (or data network) comprising one or more nodes and one or more edges. Accordingly, such a data network may be referred to as a data graph (or graph representation).
- a data graph comprises a subject node, an p object node and an edge connecting the subject node and the object node, whereby the subject node and the edge of the data graph are configured to represent the subject and the predicate of a set of basic components.
- the object node may then be configured to represent the object.
- configuring a node or an edge of a data graph to represent a basic component may include configuring such a node or such an edge to include the basic component (e.g., include or be defined by data corresponding to the basic component).
- the edge of a data graph connecting the subject node and the object node may be a link established between the subject node and the object node, such that, for example, when the edge is configured to represent a predicate indicating a semantic relationship between a subject represented by the subject node and an object represented by the object node, the relationship of the subject node and object node of the data graph is then defined by the edge.
- one data graph may be formed (or generated) for each set of basic components (e.g., for each triple) of a sentence.
- a plurality of sets of basic components e.g., corresponding to a plurality of triples
- two or more data graphs may share one or more common nodes (e.g., common subject node and/or common object node).
- the two or more data graphs sharing one or more common nodes may be derived from the same sentence in the text.
- the two or more data graphs sharing one or more common nodes may be derived from different sentences in the text.
- a collection or set of data graphs (e.g., all data graphs) formed for a text may be collectively referred to herein as a data knowledge graph, a domain knowledge graph or a semantic knowledge graph.
- the structured knowledge data for a text may be a domain knowledge graph, including a collection of all data graphs formed with respect to the text.
- Each data graph or the structured knowledge data may be stored in a database in a memory and is accessible for various applications, such as for providing information in a search (e.g., question answering and question generation), for representing information (e.g., text summarization), and so on.
- assigning a modifier as an attribute (or property) to a corresponding node or edge may include linking (e.g., tagging or attaching) the corresponding node or edge with the modifier as an attribute (or property) thereof.
- a corresponding node or edge of a modifier is the node or edge to which the modifier modifies (e.g., add or change meaning to). Accordingly, in various embodiments, each first modifier that corresponds to the subject or the predicate is assigned to the corresponding subject node or edge as an attribute thereof.
- extracting a set of basic components and one or more modifiers associated with the set of basic components from a sentence may refer to extracting at least a set of basic components including at least a subject and a predicate associated with the subject, and one or more modifiers (e.g., all modifiers) which modify any one or more of the basic components in the set.
- various embodiments of the present invention provide a method of generating a structured knowledge data for a text, and a system thereof, that advantageously improves or enhances knowledge representation of a text, such that the original meaning of various sentences in the text is better captured (or represented) and not lost.
- a set of basic components may only include a subject and a predicate (e.g., in the case where the sentence does not have an object), or may only include a triple, namely a subject, a predicate and a subject (e.g., in the case where the sentence has the triple).
- the above-mentioned first set of basic components including a subject and a predicate may further include an object.
- the predicate indicates a semantic relationship between the subject and the object
- the object node is configured to represent the object.
- the above- mentioned forming the first data graph further comprises assigning, for each of the one or more first modifiers corresponding to the object, the first modifier as an attribute to the object node.
- each first modifier that corresponds to the object e.g., that modifies the object, such as changes or adds meaning to the object
- the above-mentioned extracting the first set of basic components and the one or more first modifiers comprises analyzing the text to identify constituents of the first sentence; and chunking the identified constituents of the first sentence to produce a plurality of chunk components.
- the above-mentioned first set of basic components and the one or more first modifiers are extracted from the plurality of chunk components.
- the method 100 further comprises: identifying one or more of the plurality of chunk components as a named entity; and labelling each of the one or more chunk components identified with a corresponding named entity class label.
- the one or more first modifiers are one or more adjuncts of the first sentence.
- each modifier (e.g., first modifier) in sentence is or refers to an adjunct in the sentence.
- the first data graph is a directed data graph.
- the edge of the first data graph may be directed, and more specifically, directed from the subject node to the object node.
- the text is an unstructured text (e.g., free text).
- an unstructured text may refer to a text that is not organized in a pre-defined manner (e.g., based on a pre-defined data model), such as a text in a natural language.
- one or more additional sets (which, in the context of such embodiments, may be referred to as one or more first additional sets) of basic components may be extracted from the first sentence.
- the method 100 further comprises extracting one or more first additional sets of basic components from the first sentence; extracting, for each of the one or more first additional sets of basic components, one or more additional modifiers associated with the first additional set of basic components from the first sentence of the text, the first additional set of basic components comprising a subject and a predicate associated with the subject; and forming, for each of the one or more first additional sets of basic components, an additional data graph for the first sentence based on the first additional set of basic components and the one or more additional modifiers associated with the first additional set of basic components, the additional data graph comprising a subject node, an object node and an edge connecting the subject node and the object node.
- the subject node and the edge of the additional data graph are configured to represent the subject and the predicate of the first additional set of basic components, respectively.
- the above-mentioned forming the additional data graph comprises assigning, for each of the one or more additional modifiers corresponding to one of the subject and the predicate, the additional modifier as an attribute to the corresponding one of the subject node and the edge. Accordingly, a plurality of data graphs may be formed for a sentence, and two or more of the plurality of data graphs may share one or more common nodes.
- the method 100 further comprises merging (or combining) the subject node of the first data graph and the subject node of the additional data graph of at least one of the one or more first additional sets of basic components from the first sentence as a common subject node (i.e., the first data graph and the additional data graph of the at least one of the one or more first additional sets of basic components share the common subject node) if the subject represented by the subject node of the first data graph and the subject represented by the subject node of the additional data graph of the at least one of the one or more first additional sets of basic components correspond to each other (i.e., belong to the same entity (subject)).
- the object nodes of multiple data graphs may also be merged in the same or similar manner if they belong to the same entity (object).
- the text comprises a plurality of sentences and one or more additional sets (which, in the context of such embodiments, may be referred to as one or more second sets) of basic components may be extracted from additional sentence(s) in the plurality of sentences.
- the method 100 further comprises, for each additional sentence of the plurality of sentences: extracting a second set of basic components and one or more second modifiers associated with the second set of basic components from the additional sentence, the second set of basic components comprising a subject and a predicate associated with the subject; and forming a second data graph for the additional sentence based on the second set of basic components and the one or more second modifiers associated with the second set of basic components, the second data graph comprising a subject node, an object node and an edge connecting the subject node and the object node.
- the subject node and the edge of the second data graph are configured to represent the subject and the predicate of the second set of basic components, respectively.
- the above-mentioned forming the second data graph comprises assigning, for each of the one or more second modifiers corresponding to one of the subject and the predicate, the second modifier as an attribute to the corresponding one of the subject node and the edge.
- one or more data graphs may be formed for each sentence, resulting in a plurality of data graphs for the plurality of sentence. For example, two or more of the plurality of data graphs may share one or more common nodes.
- one or more additional sets (which, in the context of such embodiments, may be referred to as one or more second additional sets) of basic components may be extracted from any additional sentence in the same or similar manner as described hereinbefore, such as in relation to the one or more first additional sets of basic components.
- FIG. 2 depicts a schematic block diagram of a system 200 for generating a structured knowledge data for a text (e.g., for an input text data) comprising at least one sentence, according to various embodiments of the present invention, such as corresponding to the method 100 of generating a structured knowledge data as described hereinbefore according to various embodiments of the present invention.
- the system 200 comprises a memory 202, and at least one processor 204 communicatively coupled to the memory 202 and configured to: extract a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and form a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node.
- the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively.
- the above-mentioned forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
- the system 200 may be embodied as a device or an apparatus.
- the at least one processor 204 may be configured to perform the required functions or operations through set(s) of instructions (e.g., software modules) executable by the at least one processor 204 to perform the required functions or operations.
- the system 200 may further comprise a component extractor (or a component extracting module or circuit) 206 configured to perform the above-mentioned extracting (at 102) a first set of basic components and one or more first modifiers, and a data graph generator 208 (or a data graph generating module or circuit) 210 configured to perform the above- mentioned forming (at 104) a first data graph for the first sentence.
- modules are not necessarily separate modules, and one or more modules may be realized by or implemented as one functional module (e.g., a circuit or a software program) as desired or as appropriate without deviating from the scope of the present invention.
- the component extractor 206 and the data graph generator 208 may be realized (e.g., compiled together) as one executable software program (e.g., software application or simply referred to as an“app”), which for example may be stored in the memory 202 and executable by the at least one processor 204 to perform the functions/operations as described herein according to various embodiments.
- the system 200 corresponds to the method 100 as described hereinbefore with reference to FIG. 1, therefore, various functions or operations configured to be performed by the least one processor 204 may correspond to various steps of the method 100 described hereinbefore according to various embodiments, and thus need not be repeated with respect to the system 200 for clarity and conciseness.
- various embodiments described herein in context of the methods are analogously valid for the respective systems (e.g., which may also be embodied as devices), and vice versa.
- the memory 202 may have stored therein the component extractor 206 and/or the data graph generator 208, which respectively correspond to various steps of the method 100 as described hereinbefore according to various embodiments, which are executable by the at least one processor 204 to perform the corresponding functions/operations as described herein.
- a computing system, a controller, a microcontroller or any other system providing a processing capability may be provided according to various embodiments in the present disclosure.
- Such a system may be taken to include one or more processors and one or more computer-readable storage mediums.
- the system 200 described hereinbefore may include a processor (or controller) 204 and a computer-readable storage medium (or memory) 202 which are for example used in various processing carried out therein as described herein.
- a memory or computer-readable storage medium used in various embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
- DRAM Dynamic Random Access Memory
- PROM Programmable Read Only Memory
- EPROM Erasable PROM
- EEPROM Electrical Erasable PROM
- flash memory e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
- a“circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof.
- a“circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g., a microprocessor (e.g., a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor).
- A“circuit” may also be a processor executing software, e.g., any kind of computer program, e.g., a computer program using a virtual machine code, e.g., Java.
- a“module” may be a portion of a system according to various embodiments in the present invention and may encompass a “circuit” as above, or may be understood to be any kind of a logic-implementing entity therefrom.
- An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result.
- the steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.
- the present specification also discloses a system (e.g., which may also be embodied as a device or an apparatus) for performing the operations/functions of the methods described herein.
- a system may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer.
- the algorithms presented herein are not inherently related to any particular computer or other apparatus.
- Various general-purpose machines may be used with computer programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate.
- the present specification also at least implicitly discloses a computer program or software/functional module, in that it would be apparent to the person skilled in the art that the individual steps of the methods described herein may be put into effect by computer code.
- the computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein.
- the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.
- modules described herein may be software module(s) realized by computer program(s) or set(s) of instructions executable by a computer processor to perform the required functions, or may be hardware module(s) being functional hardware unit(s) designed to perform the required functions. It will also be appreciated that a combination of hardware and software modules may be implemented.
- a computer program/module or method described herein may be performed in parallel rather than sequentially.
- Such a computer program may be stored on any computer readable medium.
- the computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer.
- the computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the methods described herein.
- a computer program product embodied in one or more computer-readable storage mediums (non-transitory computer- readable storage medium), comprising instructions (e.g., the component extractor 206 and/or the data graph generator 208) executable by one or more computer processors to perform a method 100 of generating a structured knowledge data as described hereinbefore with reference to FIG. 1.
- instructions e.g., the component extractor 206 and/or the data graph generator 208
- various computer programs or modules described herein may be stored in a computer program product receivable by a system therein, such as the system 200 as shown in FIG. 2, for execution by at least one processor 204 of the system 200 to perform the required or desired functions.
- the software or functional modules described herein may also be implemented as hardware modules. More particularly, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the software or functional module(s) described herein can also be implemented as a combination of hardware and software modules.
- ASIC Application Specific Integrated Circuit
- the system 200 may be realized by any computer system (e.g., portable or desktop computer system, such as tablet computers, laptop computers, mobile communications devices (e.g., smart phones), and so on) including at least one processor and a memory, such as a computer system 300 as schematically shown in FIG. 3 as an example only and without limitation.
- Various methods/steps or functional modules e.g., the component extractor 206 and/or the data graph generator 208, may be implemented as software, such as a computer program being executed within the computer system 300, and instructing the computer system 300 (in particular, one or more processors therein) to conduct the methods/functions of various embodiments described herein.
- the computer system 300 may comprise a computer module 302, input modules, such as a keyboard 304 and a mouse 306, and a plurality of output devices such as a display 308, and a printer 310.
- the computer module 302 may be connected to a computer network 312 via a suitable transceiver device 314, to enable access to e.g., the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
- the computer module 302 in the example may include a processor 318 for executing various instructions, a Random Access Memory (RAM) 320 and a Read Only Memory (ROM) 322.
- the computer module 302 may also include a number of Input/Output (I/O) interfaces, for example I/O interface 324 to the display 308, and I/O interface 326 to the keyboard 304.
- I/O Input/Output
- the components of the computer module 302 typically communicate via an interconnected bus 328 and in a manner known to the person skilled in the relevant art.
- Various example embodiments relate to text analytics and provide a method and a product (e.g., a computer program product, embodied in one or more computer- readable storage mediums (non-transitory computer-readable storage medium)) for processing free text (unstructured text data) from an unstructured text document (e.g., an unstructured text input) to generate a structured knowledge data.
- a product e.g., a computer program product, embodied in one or more computer- readable storage mediums (non-transitory computer-readable storage medium)
- free text data e.g., an unstructured text input
- an unstructured text document e.g., an unstructured text input
- various example embodiments are seek to represent a free text document in a unified semantic model which is easy for a computer to understand and process.
- various example embodiments provide a complete knowledge representation model for a free text document, which leverages on the state-of-the-art natural language processing and semantic knowledge graph methods.
- knowledge information can be dynamically acquired from a free text document, an internet web page and so on, and automatically organized in a structured knowledge data (e.g., a data knowledge graph, which may be simply referred to herein as“knowledge graph”) for easy and robust information query.
- a structured knowledge data e.g., a data knowledge graph, which may be simply referred to herein as“knowledge graph”
- Various application areas include, but not limited to, open information extraction, question answering, document summarization and question generation.
- the structured knowledge data generated according to various example embodiments has been found to be especially efficient and useful for question answering over unstructured text (free text) documents, such as internet web pages..
- Various example embodiments provide a method for flexible knowledge representation and extraction which bridges a gap between open information extraction and semantic knowledge graph.
- the method (and corresponding system) seeks to capture (or represent) the meaning (e.g., sufficiently or fully) of the original text document.
- the structured knowledge data generated may be stored in a database (e.g., a graph database system) for better scalability and management.
- various example embodiments provide a unified and scalable knowledge representation method or model which makes use of the state-of-the-art natural language processing and semantic knowledge graph methods.
- the knowledge representation method seek to fully (or at least sufficiently or as much as possible or desired) represent the meaning of the original text document by integrating the knowledge information therein into a structured knowledge data and storing the structured knowledge data in a database for better scalability and data management.
- the structured knowledge data is in the form of a data knowledge graph (e.g., semantic graph network) that is organized in a way that the original syntactical information is retained as much as possible.
- the data knowledge graph is advantageously query oriented and capable for knowledge inference.
- knowledge representation method is able to represent all the modifiers (adjuncts) of a sentence in addition to the basic components (main meaning components), namely, subject, predicate and object.
- the knowledge data are fully integrated in a data knowledge graph and stored in a graph database that is accessible for various applications, such as question answering.
- a sentence is a group of words that are put together to mean something.
- a sentence is the basic unit of language which expresses a complete thought.
- the complete thought is about someone or something (subject) and what the subject is about or doing (predicate).
- a sentence composition including a subject and a predicate “Bird chirps” is composed of a subject (“Bird”) and a predicate (“chirps”), as shown in FIG. 4A.
- FIG. 4B and 4C show examples of sentences, including a subject, a predicate and an object, namely,“Sam likes chocolate” and“John sent me a present”.
- modifiers in a sentence to describe other words, for instance, adjectives modifiers and adverbs modifiers.
- Adjectives modifiers describe nouns and pronouns (e.g., answering Which one? What kind? How many? Whose?).
- Adverbs modifiers describe verbs, adjectives and other adverbs (e.g., answering How? When? Where? Why? To what extent?).
- these adjuncts are optional and structurally dispensable. However, they give additional information (knowledge information) about sentence functionaries (subject, predicate, object, etc.) and make the sentence meaning more complete and accurate.
- adjuncts may each be a word, a phrase (e.g., noun phrase, adverb phrase and prepositional phrase) or a clause.
- prepositional phrases as adjuncts usually describe when or where something happens (e.g., referring to a time or a place). Followings are some non-limiting examples:
- various example embodiments of the present invention seek to completely represent a full sentence by taking into account (e.g., capturing or representing) both basic components of a sentence (sentence fundamental elements, namely, subject, predicate and object) and other various modifiers (adjuncts).
- various conventional knowledge representation techniques only extract triples (i.e., subject, predicate and object) from sentences. Accordingly, such conventional knowledge representation techniques only take into account the basic components of the sentence (namely, triple), but ignore or disregard other sentence components, such as modifiers (adjuncts). Therefore, important or useful information relating to various basic components in a sentence are not captured (or represented), resulting in an incomplete or insufficient knowledge representation of the original meaning of the sentence. In other words, the original meaning of the sentence may be lost.
- FIG. 5 depicts a schematic drawing of an example system 500 for generating a structured knowledge data for a text according to various example embodiments of the present invention.
- the system 500 includes an analysis module 504 configured to analyze a sentence of a text (e.g., an input sentence) for identifying constituents of the sentence; a chunking module 508 configured to performing chunking (sentence chunking) on the identified constituents of the sentence to produce a plurality of chunk components (which may also be referred to herein simply as“chunks”); an extraction module 512 configured to extract a set of basic components and one or more modifiers (adjuncts) associated with the set of basic components from the plurality of chunk components; and a graph creation module 516 configured to form a data graph for the sentence based on the set of basic components and the one or more modifiers extracted.
- an analysis module 504 configured to analyze a sentence of a text (e.g., an input sentence) for identifying constituents of the sentence
- a chunking module 508 configured to
- the sentence may be analyzed by being parsed into its constituents according to various parsing techniques known in the art, such as but not limited to, Top-Down parsing or Bottom-Up parsing, and thus need not be described in detail herein for clarity and conciseness.
- various parsing techniques known in the art, such as but not limited to, Top-Down parsing or Bottom-Up parsing, and thus need not be described in detail herein for clarity and conciseness.
- the analysis module 504 may output a constituent parsing tree as shown in FIG. 6, whereby the tags (e.g., as shown in FIG.
- the sentence may further be analyzed based on coreference resolution as known in the art, and thus need not be described in detail herein for clarity and conciseness.
- coreference resolution is a technique for finding all expressions or sentence components that refer to the same entity in a sentence or text.
- coreference resolution may make every extracted triple refer to its actual entity, instead of using pronouns such as“he”,“she”,“him”, and so on.
- the extracted triples for two example sentences “Peter is an engineer. He likes programming” may be: (“Peter”,“is”, “an engineer), (“He”, “likes”, “programming”).
- the data graphs formed based on the above two triples are saved into a graph database, the context between the above two example sentences does not exist anymore and hence it will not be known what“He” is referring to.
- the subject“He” in the latter example sentence above may be replaced by“Peter” and according to various example embodiments, the data graphs formed based on the above two triples may share the same subject node“Peter” in a graph database.
- the chunking module 508 is provided to perform this function or operation according to various example embodiments.
- the chunking module 508 may produce an output“A staff/NP with/IN at least/ADVP one Singapore Citizen child/NP between 7/PP and/CC 12 years old/AD is/VB eligible to/ADJP take/VB 2 days/NP of Childcare/PP Leave/VB per year/PP unconditionally/AD ,/, without the need to/PP produce/VB a medical certificate/NP ./.”, whereby“NP”,“IN”,“AD VP” and“PP” are as defined hereinbefore, “CC” denotes coordinating conjunction, “AD” denotes adverb,“VB” denotes verb base form and“ADJP” denotes adjective phrase.
- the chunking and extraction are based on dependency parsing and constituent parsing of the sentences along with a set of rules of grammar and/or syntax.
- An example constituent parsing tree output from the analysis module 504 is shown in FIG. 6 as described hereinbefore, which embeds a number of useful information for chunking and component identification, such as, subject, predicate, object and adjuncts (including the specific types of adjuncts).
- chunking may be a kind of shallow parsing which adds more structure to a sentence after part-of-speech (POS) tagging, which for example may be implemented using regular expression rules based on POS tags.
- POS part-of-speech
- the chunking may be deduced from these parsing results.
- the constituent parsing tree as shown in FIG. 6 includes some trunk information in the upper level tree nodes.
- the words “The blue birds” share the same parent tree node“NP” which constitutes a chunk (chunk component) of a subject in the sentence.
- This trunk can be further confirmed by the dependency parsing result in which the words“The” and“blue” are the modifiers of the “birds”.
- the triple extraction may be performed in the trunk level which has less components compared to the original sentence.
- the chunking module 508 may be configured to produce a plurality of chunk components using a deep neural network (DNN) model.
- DNN deep neural network
- the graph creation module 516 is configured to form a data graph for each set of basic components based on the set of basic components and the one or more modifiers associated with the set basic components extracted from the extraction module 512.
- a data graph is formed for each set of basic components, whereby the set of basic components (e.g., subject, predicate and object) and their relationship is modelled as a data graph, whereby the subject and the object are represented as nodes (subject and object nodes) and the predicate is represented as the edge of the data graph.
- the data graph is a directed data graph.
- the graph creation module 516 is configured to assign, for each of the one or more modifiers corresponding to one of the basic components, the modifier as an attribute (or property) to the corresponding one of the subject node, object node and edge.
- each modifier that corresponds to a particular basic component e.g., that modifies the particular basic component
- the relationship (or association) between the modifiers and the triple elements may be deduced from the dependency parsing result.
- the example sentence may be parsed to produce a dependency parsing result with the following relations:
- the modifiers of predicate“caught”, for instance, may be determined from the above relations (as shown in bold), which include “successfully-4”,“field-l l” and“month-l3”.
- a data graph may be formed for each set of basic components (each triple), and thus, multiple data graphs may be formed for multiple sets of basic components (multiple triples) extracted from a given text and an overall network of nodes and edges may thus be formed for a given text, such as illustrated in FIGs. 9A to 9D for an example text (described later below).
- the overall network of nodes and edges may be formed by processing the multiple data graphs formed for the multiple triples to share common nodes where appropriate, such as by combining/merging all instances of the same nodes (e.g., representing the same entity) together (e.g., all subject nodes of different data graphs representing the same subject are configured as a common subject node).
- the input text may be preprocessed before sending for analysis.
- preprocessing may include text normalization (e.g., converting all letters into lower case, removing white spaces, and so on), tokenization, stemming, lemmatization, and so on. Such a preprocessing facilitates to make the same nodes in different forms present (exist) as one common node in the graph database.
- subject, predicate, object and various modifiers with knowledge graph and related useful attributes are provided to describe (or represent) the sentence components (e.g., chunk components).
- sentence components e.g., chunk components.
- the basic components (most meaningful elements, e.g., the subject, predicate, object) and their relationship are modelled as a directed graph whereby subject and object are represented as the nodes of the directed graph and predicate is represented as the edge of the directed graph.
- the data graph creation module 516 may be configured to form a data graph 704 as illustrated in FIG. 7A to represent the example sentence.
- an advantage of such a triple representation is that the basic components (main meaning components) of the sentence may then be easily queried with existing query language of graph database.
- the triple representation also makes reasoning possible, similar to the Resource Description Framework (RDF) triple store.
- RDF Resource Description Framework
- adjuncts are modelled as attributes (or properties) of the nodes or edge, depending on their role in the sentences or relationship to the nodes or edge.
- attributes or properties
- the adjuncts which are modelled as attributes and each are assigned (e.g., attached) to the corresponding graph nodes or edge according to various example embodiments.
- the complete graph representation (data graph) with attributes for the example sentence is illustrated in FIG. 7B.
- chunk elements are carried out with the state-of-the-art natural language processing technology such as named entity recognition, coreference resolution, and so on.
- the prepositional phrase“in the field” may be identified and tagged with a named entity class label (or tag)“location” while the noun phrase“last month” may be identified and tagged with a named entity class label (or tag)“time”.
- type questions such as“Where did the dog catch the rat?” and“When the dog caught the rat?”.
- sentences may have all basic components (subject, predicate and object) present or may have only part of the basic components present (subject and predicate, without object).
- the data graphs shown in FIGs. 7A and 7B are formed for the above-mentioned example sentences having all basic components present.
- various example embodiments represent the missing object as a blank object node in the data graph formed.
- the data graph creation module 516 may be configured to form a data graph 712 having an empty object node as illustrated in FIG. 7C to represent the example sentence.
- a method for producing semantic information from free text (or unstructured text) sources is provided.
- the method provides a complete knowledge representation model for giving meaning representation to free text document, extract knowledge based on natural language processing and rules, to generate a query oriented semantic graph for various applications, such as answering or addressing presented questions.
- the method comprises steps of analysing the text to extract linguistic components or elements such as basic components (fundamental elements, namely subject, predicate and object) and various modifiers (adjuncts); dividing a sentence of free text into non -overlapping segments based on the extracted linguistic elements and semantic rules (in an example, a constituent parsing tree) (e.g., extracting triples based on the analysis and chunking results); representing semantics in the form of a graphical representation such as a network of nodes and edges, whereby the graphical representation comprises attributes of the nodes and/or edges, which are modeled from adjuncts; generating a structured knowledge data comprising a combination or network of the data graphs (including nodes and edges, along with associated attributes); and storing the structured knowledge data in a database to be retrievable for various applications, such as queries or questions.
- linguistic components or elements such as basic components (fundamental elements, namely subject, predicate and object) and various modifiers (adjuncts); dividing a sentence
- text analysis by the analysis module 504 may be performed by various conventional natural language processing methods.
- sentences may be divided or segmented using text chunking, such as a rule -based text chunking.
- linguistic elements or sentence components
- name entity recognition may help to tag or label one or more modifiers with a corresponding named entity class tag or label (e.g., tagging a modifier“last month” with a“time” entity class tag as shown in FIG. 7B).
- a computer program product embodied in one or more computer-readable storage mediums (non-transitory computer-readable storage medium), comprising instructions (e.g., the component extractor 206 and/or the data graph generator 208) executable by one or more computer processors to perform a method of generating a structured knowledge data as described hereinbefore according to various embodiments.
- instructions e.g., the component extractor 206 and/or the data graph generator 208
- various computer programs or modules described herein may be stored in a computer program product receivable by a system therein, such as the system 200 as shown in FIG. 2, for execution by at least one processor 204 of the system 200 to perform the required or desired functions.
- the computer program product may further comprise instructions executable by one or more computer processors to generate a graphical user interface (GUI) for receiving various inputs (e.g., text data for which a data knowledge graph is to be generated and stored in a graph database, question(s), and so on) and providing various outputs (e.g., displaying an answer to a question inputted by a user).
- GUI graphical user interface
- a computer program product configured to generate or produce semantic information (e.g., providing an answer to a query from free text (unstructured text) sources based on the structured knowledge data (data knowledge graph) generated using a method according to various embodiments of the present invention (which may be referred to herein after“the present method”), such as described hereinbefore with reference to FIG. 1.
- semantic information e.g., providing an answer to a query from free text (unstructured text) sources based on the structured knowledge data (data knowledge graph) generated using a method according to various embodiments of the present invention (which may be referred to herein after“the present method”), such as described hereinbefore with reference to FIG. 1.
- FIG. 8 depicts an example GUI 800 generated by a computer processor for interaction with a user in an example implementation according to an example embodiment of the present invention.
- a publicly available graph database may be used as a backend database for better scale-up and handling of possible large amount of triple relations.
- the GUI 800 may directly extract knowledge information from a web page, a text file or a user manually input text.
- a structured knowledge data e.g., semantic knowledge graph
- a web link may be provided via the GUI 800 to the system for knowledge information extraction, as well as question generating and answering.
- FIGs. 9A to 9D depict a schematic drawing/illustration of the semantic knowledge graph 900 generated, including a plurality of merged data graphs. It will be appreciated that FIGs. 9C and 9D are partial views of a merged data graph and may be joined at corresponding sides to show the complete merged data graph. As can be seen from FIGs. 9A to 9D, subject nodes of data graphs identified to be the same (e.g., represent the same entity) may be merged into one common subject node so that such data graphs share the common subject node.
- object nodes of data graphs identified to be the same may be merged into one common object node so that such data graphs share the common object node.
- data graphs sharing one or more common nodes may collectively be referred to as a merged data graph.
- questions may then be immediately generated about the content of the web page as shown in the output text box of FIG. 8.
- the present method (and the corresponding system) of generating a structured knowledge data has been found to be efficient and effective in answering various questions on the extracted web content.
- the present method advantageously improves or enhances knowledge representation of a text, such that the original meaning of various sentences in the text is better captured (or represented) and not lost.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
There is provided a method of generating a structured knowledge data for a text including at least one sentence, using at least one processor, the method including: extracting a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and forming a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph including a subject node, an object node and an edge connecting the subject node and the object node. The subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively. In particular, the above-mentioned forming the first data graph includes assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge. There is also provided a corresponding system for generating a structured knowledge data for a text.
Description
METHOD AND SYSTEM FOR GENERATING A STRUCTURED KNOWLEDGE
DATA FOR A TEXT
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of priority of Singapore Patent Application No. 10201801825V, filed 6 March 2018, the content of which being hereby incorporated by reference in its entirety for all purposes.
TECHNICAL FIELD
[0002] The present invention generally relates to a method of generating a structured knowledge data for a text, and a system thereof.
BACKGROUND
[0003] Large amount of information is in the form of unstructured text (e.g., free text) in, for example, internet web pages and text documents. The rapidly increasing information generated in the World Wide Web, email communications and social media postings make it a challenge to turn such information into a structured data format (data structure) for easy access and digestion. Various conventional search engines are capable to provide web content crawling, indexing and searching, but are unable to perform comprehensive web document understanding and/or representation. For example, various conventional information extraction systems only focus on how to extract triple knowledge (i.e., set of subject, predicate and object) using various conventional algorithms that generate raw triples in text format. Accordingly, such conventional information extraction systems (or conventional knowledge representation techniques) are not capable of providing a full or sufficient representation of the original meaning of the unstructured text (e.g., the retrieved sentences in the unstructured text).
[0004] For example, various conventional knowledge representation techniques only extract triples (i.e., sets of subject, predicate and object) from sentences. Accordingly, such conventional knowledge representation techniques only take into account the basic components (e.g., major meaning components) of the sentence, namely, subject, predicate
and object, but ignore or disregard other sentence components, such as modifiers (or adjuncts). By way of an example, an example sentence is provided below:
A staff with at least one Singapore Citizen child between 7 and 12 years old is eligible to take 2 days of Childcare Leave per year unconditionally, without the need to produce a medical certificate.
[0005] Based on the above-mentioned conventional knowledge representation technique, the resulting knowledge representation for the above example sentence may be: (1)“staff’ (subject),“is eligible” (predicate) and“Childcare Leave” (object) or (2) “staff’ (subject) and“is eligible to take” (predicate) and“Childcare Leave” (object). It can be seen that either knowledge representation (or meaning representation) for the above example sentence does not provide any meaningful or additional information in relation to the triple. For example, the prepositional phrases and noun phrases in the above example sentence provide additional information on the condition and time, which are helpful for conveying the original meaning of the example sentence above, but are not captured (or represented) in the above-mentioned conventional knowledge representation technique.
[0006] A need therefore exists to provide a method of generating a structured knowledge data for a text, and a system thereof, that seek to overcome, or at least ameliorate, one or more of the deficiencies in conventional knowledge representation techniques for a text, such as but not limited to, improving or enhancing knowledge representation of a text. It is against this background that the present invention has been developed.
SUMMARY
[0007] According to a first aspect of the present invention, there is provided a method of generating a structured knowledge data for a text comprising at least one sentence, using at least one processor, the method comprising:
extracting a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and
forming a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
the above-mentioned forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
[0008] In various embodiments, the first set of basic components further comprises an object, the predicate indicates a semantic relationship between the subject and the object, the object node is configured to represent the object, and the above-mentioned forming the first data graph further comprises assigning, for each of the one or more first modifiers corresponding to the object, the first modifier as an attribute to the object node.
[0009] In various embodiments, the above-mentioned extracting the first set of basic components and the one or more first modifiers comprises:
analyzing the text to identify constituents of the first sentence; and
chunking the identified constituents of the first sentence to produce a plurality of chunk components,
wherein the first set of basic components and the one or more first modifiers are extracted from the plurality of chunk components.
[0010] In various embodiments, the method further comprises:
identifying one or more of the plurality of chunk components as a named entity; and
labelling each of the one or more chunk components identified with a corresponding named entity class label.
[0011] In various embodiments, the one or more first modifiers are one or more adjuncts of the first sentence.
[0012] In various embodiments, the first data graph is a directed data graph.
[0013] In various embodiments, the text is an unstructured text.
[0014] In various embodiments, the method further comprises:
extracting one or more first additional sets of basic components from the first sentence;
extracting, for each of the one or more first additional sets of basic components, one or more additional modifiers associated with the first additional set of basic components from the first sentence of the text, the first additional set of basic components comprising a subject and a predicate associated with the subject; and
forming, for each of the one or more first additional sets of basic components, an additional data graph for the first sentence based on the first additional set of basic components and the one or more additional modifiers associated with the first additional set of basic components, the additional data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the additional data graph are configured to represent the subject and the predicate of the first additional set of basic components, respectively, and
the above-mentioned forming the additional data graph comprises assigning, for each of the one or more additional modifiers corresponding to one of the subject and the predicate, the additional modifier as an attribute to the corresponding one of the subject node and the edge.
[0015] In various embodiments, the method further comprises merging the subject node of the first data graph and the subject node of the additional data graph of at least one of the one or more first additional sets of basic components from the first sentence as a common subject node if the subject represented by the subject node of the first data graph and the subject represented by the subject node of the additional data graph of the at least one of the one or more first additional sets of basic components correspond to each other.
[0016] In various embodiments, the text comprises a plurality of sentences and the method further comprises, for each additional sentence of the plurality of sentences: extracting a second set of basic components and one or more second modifiers associated with the second set of basic components from the additional sentence, the second set of basic components comprising a subject and a predicate associated with the subject; and
forming a second data graph for the additional sentence based on the second set of basic components and the one or more second modifiers associated with the second set of basic components, the second data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the second data graph are configured to represent the subject and the predicate of the second set of basic components, respectively, and the above-mentioned forming the second data graph comprises assigning, for each of the one or more second modifiers corresponding to one of the subject and the predicate, the second modifier as an attribute to the corresponding one of the subject node and the edge.
[0017] According to a second aspect of the present invention, there is provided a system for generating a structured knowledge data for a text comprising at least one sentence, the system comprising:
a memory; and
at least one processor communicatively coupled to the memory and configured to: extract a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and
form a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
the above-mentioned form the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
[0018] In various embodiments, the first set of basic components further comprises an object, the predicate indicates a semantic relationship between the subject and the object, the object node is configured to represent the object, and the above-mentioned form the first data graph further comprises assigning, for each of the one or more first modifiers corresponding to the object, the first modifier as an attribute to the object node.
[0019] In various embodiments, the above-mentioned extract the first set of basic components and the one or more first modifiers comprises:
analyzing the text to identify constituents of the first sentence; and
chunking the identified constituents of the first sentence to produce a plurality of chunk components,
wherein the first set of basic components and the one or more first modifiers are extracted from the plurality of chunk components.
[0020] In various embodiments, the at least one processor is further configured to: identify one or more of the plurality of chunk components as a named entity; and label each of the one or more chunk components identified with a corresponding named entity class label.
[0021] In various embodiments, the one or more first modifiers are one or more adjuncts of the first sentence, and the text is an unstructured text.
[0022] In various embodiments, the first data graph is a directed data graph.
[0023] In various embodiments, the at least one processor is further configured to: extract one or more first additional sets of basic components from the first sentence;
extract, for each of the one or more first additional sets of basic components, one or more additional modifiers associated with the first additional set of basic components from the first sentence of the text, the first additional set of basic components comprising a subject and a predicate associated with the subject; and
form, for each of the one or more first additional sets of basic components, an additional data graph for the first sentence based on the first additional set of basic components and the one or more additional modifiers associated with the first additional set of basic components, the additional data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the additional data graph are configured to represent the subject and the predicate of the first additional set of basic components, respectively, and
the above-mentioned form the additional data graph comprises assigning, for each of the one or more additional modifiers corresponding to one of the subject and the predicate, the additional modifier as an attribute to the corresponding one of the subject node and the edge.
[0024] In various embodiments, the at least one processor is further configured to merge the subject node of the first data graph and the subject node of the additional data graph of at least one of the one or more first additional sets of basic components from the first sentence as a common subject node if the subject represented by the subject node of the first data graph and the subject represented by the subject node of the additional data graph of the at least one of the one or more first additional sets of basic components correspond to each other.
[0025] In various embodiments, the text comprises a plurality of sentences and the at least one processor is further configured to, for each additional sentence of the plurality of sentences:
extract a second set of basic components and one or more second modifiers associated with the second set of basic components from the additional sentence, the second set of basic components comprising a subject and a predicate associated with the subject; and
form a second data graph for the additional sentence based on the second set of basic components and the one or more second modifiers associated with the second set of basic components, the second data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the second data graph are configured to represent the subject and the predicate of the second set of basic components, respectively, and the above-mentioned form the second data graph comprises assigning, for each of the one or more second modifiers corresponding to one of the subject and the predicate, the second modifier as an attribute to the corresponding one of the subject node and the edge.
[0026] According to a third aspect of the present invention, there is provided a computer program product, embodied in one or more non-transitory computer-readable storage mediums, comprising instructions executable by at least one processor to perform a method of generating a structured knowledge data for a text comprising at least one sentence, the method comprising:
extracting a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and
forming a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
the above-mentioned forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject node and the edge, the first modifier as an attribute to the corresponding one of the subject node and the edge.
BRIEF DESCRIPTION OF THE DRAWINGS
[0027] Embodiments of the present invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:
FIG. 1 depicts a flow diagram illustrating a method of generating a structured knowledge data for a text, according to various embodiments of the present invention;
FIG. 2 depicts a schematic block diagram of a system for generating a structured knowledge data for a text, according to various embodiments of the present invention;
FIG. 3 depicts an example computer system which the system as described with respect to FIG. 2 may be embodied in, by way of an example only;
FIG. 4A depicts an example sentence including a subject and a predicate, according to various example embodiments of the present invention;
FIGs. 4B and 4C depict two example sentences, each including a subject, a predicate and an object, according to various example embodiments of the present invention;
FIG. 5 depicts a schematic drawing of an example system for generating a structured knowledge data for a text, according to various example embodiments of the present invention;
FIG. 6 depicts a sample of constituent parsing result (constituent parsing tree) for an example sentence, according to various example embodiments of the present invention;
FIGs. 7 A to 7C depict schematic drawings of three data graphs (or graph representations) formed for three example sets of basic components, according to various example embodiments of the present invention;
FIG. 8 depicts an example graphical user interface (GUI) generated by a computer processor for interaction with a user in an example implementation according to various example embodiments of the present invention; and
FIGs. 9 A to 9D depict a schematic drawing of a structured knowledge data (semantic knowledge graph) formed for an example text, according to various example embodiments of the present invention.
DETAILED DESCRIPTION
[0028] V arious embodiments of the present invention provide a method of generating a structured knowledge data for a text (text data) comprising at least one sentence, and a system thereof, and more particularly, for an unstructured text (e.g., free text).
[0029] As described in the background, various conventional knowledge representation techniques for a text do not provide a full or sufficient knowledge representation of the original meaning of a text (e.g., sentences in an unstructured text), resulting in the loss of knowledge (or meaning) of various sentences in the text, of which may be important or useful information relating to various basic components (e.g., subject, predicate and/or object) in such sentences. Accordingly, various embodiments of the present invention provide a method of generating a structured knowledge data for a text, and a system thereof, that seek to overcome, or at least ameliorate, one or more of
the deficiencies in conventional knowledge representation techniques for a text, such as but not limited to, improving or enhancing knowledge representation of a text.
[0030] FIG. 1 depicts a flow diagram illustrating a method 100 of generating a structured knowledge data for a text (e.g., for an input text data) comprising at least one sentence, using at least one processor, according to various embodiments of the present invention. The method comprises extracting (at 102) a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and forming (at 104) a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node. In this regard, the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively. In particular, the above-mentioned forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
[0031] It will be appreciated by a person skilled in the art that a sentence may include four types of components (or elements), namely, a subject, a predicate, an object and a modifier, whereby the modifier is an optional element in the sense that removal of the modifier from the sentence would generally not affect the grammar of the sentence. In various embodiments, basic components (or basic elements) of a sentence refer to (or are obtained/derived from) components of the sentence which are non-optional, or in other words, selected from a group consisting of subject, predicate and object (which may be referred to herein as a“triple”), or excluding/without the modifiers.
[0032] In various embodiments, the structured knowledge data comprises one or more data graphs (e.g., including the above-mentioned first data graph). In various embodiments, a data graph is a type of data structure, whereby data is organized or represented in a network of data (or data network) comprising one or more nodes and one or more edges. Accordingly, such a data network may be referred to as a data graph (or graph representation). In various embodiments, a data graph comprises a subject node, an
p object node and an edge connecting the subject node and the object node, whereby the subject node and the edge of the data graph are configured to represent the subject and the predicate of a set of basic components. In various embodiments, if the set of basic components further includes an object, the object node may then be configured to represent the object. In various embodiments, configuring a node or an edge of a data graph to represent a basic component may include configuring such a node or such an edge to include the basic component (e.g., include or be defined by data corresponding to the basic component).
[0033] In various embodiments, the edge of a data graph connecting the subject node and the object node may be a link established between the subject node and the object node, such that, for example, when the edge is configured to represent a predicate indicating a semantic relationship between a subject represented by the subject node and an object represented by the object node, the relationship of the subject node and object node of the data graph is then defined by the edge.
[0034] In various embodiments, one data graph may be formed (or generated) for each set of basic components (e.g., for each triple) of a sentence. In various embodiments, a plurality of sets of basic components (e.g., corresponding to a plurality of triples) may be extracted from a sentence. In various embodiments, two or more data graphs may share one or more common nodes (e.g., common subject node and/or common object node). In various embodiments, the two or more data graphs sharing one or more common nodes may be derived from the same sentence in the text. In various embodiments, the two or more data graphs sharing one or more common nodes may be derived from different sentences in the text. In various embodiments, a collection or set of data graphs (e.g., all data graphs) formed for a text may be collectively referred to herein as a data knowledge graph, a domain knowledge graph or a semantic knowledge graph. Accordingly, the structured knowledge data for a text may be a domain knowledge graph, including a collection of all data graphs formed with respect to the text. Each data graph or the structured knowledge data may be stored in a database in a memory and is accessible for various applications, such as for providing information in a search (e.g., question answering and question generation), for representing information (e.g., text summarization), and so on.
[0035] In various embodiments, assigning a modifier as an attribute (or property) to a corresponding node or edge may include linking (e.g., tagging or attaching) the corresponding node or edge with the modifier as an attribute (or property) thereof. In various embodiments, a corresponding node or edge of a modifier is the node or edge to which the modifier modifies (e.g., add or change meaning to). Accordingly, in various embodiments, each first modifier that corresponds to the subject or the predicate is assigned to the corresponding subject node or edge as an attribute thereof. In addition, extracting a set of basic components and one or more modifiers associated with the set of basic components from a sentence may refer to extracting at least a set of basic components including at least a subject and a predicate associated with the subject, and one or more modifiers (e.g., all modifiers) which modify any one or more of the basic components in the set.
[0036] Accordingly, by forming a data graph for each set of basic components that takes into account or includes the one or more modifiers associated with the set of basic components, various embodiments of the present invention provide a method of generating a structured knowledge data for a text, and a system thereof, that advantageously improves or enhances knowledge representation of a text, such that the original meaning of various sentences in the text is better captured (or represented) and not lost.
[0037] In various embodiments, a set of basic components may only include a subject and a predicate (e.g., in the case where the sentence does not have an object), or may only include a triple, namely a subject, a predicate and a subject (e.g., in the case where the sentence has the triple).
[0038] Accordingly, in various embodiments, the above-mentioned first set of basic components including a subject and a predicate may further include an object. In this regard, the predicate indicates a semantic relationship between the subject and the object, and the object node is configured to represent the object. In addition, the above- mentioned forming the first data graph further comprises assigning, for each of the one or more first modifiers corresponding to the object, the first modifier as an attribute to the object node. In other words, each first modifier that corresponds to the object (e.g., that
modifies the object, such as changes or adds meaning to the object) is assigned to the corresponding object node as an attribute thereof.
[0039] In various embodiments, the above-mentioned extracting the first set of basic components and the one or more first modifiers comprises analyzing the text to identify constituents of the first sentence; and chunking the identified constituents of the first sentence to produce a plurality of chunk components. In this regard, the above-mentioned first set of basic components and the one or more first modifiers are extracted from the plurality of chunk components.
[0040] In various embodiments, the method 100 further comprises: identifying one or more of the plurality of chunk components as a named entity; and labelling each of the one or more chunk components identified with a corresponding named entity class label.
[0041] In various embodiments, the one or more first modifiers are one or more adjuncts of the first sentence. In other words, each modifier (e.g., first modifier) in sentence is or refers to an adjunct in the sentence.
[0042] In various embodiments, the first data graph is a directed data graph. In this regard, the edge of the first data graph may be directed, and more specifically, directed from the subject node to the object node.
[0043] In various embodiments, the text is an unstructured text (e.g., free text). For example, an unstructured text may refer to a text that is not organized in a pre-defined manner (e.g., based on a pre-defined data model), such as a text in a natural language.
[0044] In various embodiments, one or more additional sets (which, in the context of such embodiments, may be referred to as one or more first additional sets) of basic components may be extracted from the first sentence. In this regard, the method 100 further comprises extracting one or more first additional sets of basic components from the first sentence; extracting, for each of the one or more first additional sets of basic components, one or more additional modifiers associated with the first additional set of basic components from the first sentence of the text, the first additional set of basic components comprising a subject and a predicate associated with the subject; and forming, for each of the one or more first additional sets of basic components, an additional data graph for the first sentence based on the first additional set of basic components and the one or more additional modifiers associated with the first additional
set of basic components, the additional data graph comprising a subject node, an object node and an edge connecting the subject node and the object node. In this regard, the subject node and the edge of the additional data graph are configured to represent the subject and the predicate of the first additional set of basic components, respectively. In particular, the above-mentioned forming the additional data graph comprises assigning, for each of the one or more additional modifiers corresponding to one of the subject and the predicate, the additional modifier as an attribute to the corresponding one of the subject node and the edge. Accordingly, a plurality of data graphs may be formed for a sentence, and two or more of the plurality of data graphs may share one or more common nodes.
[0045] In various embodiments, the method 100 further comprises merging (or combining) the subject node of the first data graph and the subject node of the additional data graph of at least one of the one or more first additional sets of basic components from the first sentence as a common subject node (i.e., the first data graph and the additional data graph of the at least one of the one or more first additional sets of basic components share the common subject node) if the subject represented by the subject node of the first data graph and the subject represented by the subject node of the additional data graph of the at least one of the one or more first additional sets of basic components correspond to each other (i.e., belong to the same entity (subject)). In various embodiments, the object nodes of multiple data graphs may also be merged in the same or similar manner if they belong to the same entity (object).
[0046] In various embodiments, the text comprises a plurality of sentences and one or more additional sets (which, in the context of such embodiments, may be referred to as one or more second sets) of basic components may be extracted from additional sentence(s) in the plurality of sentences. In this regard, the method 100 further comprises, for each additional sentence of the plurality of sentences: extracting a second set of basic components and one or more second modifiers associated with the second set of basic components from the additional sentence, the second set of basic components comprising a subject and a predicate associated with the subject; and forming a second data graph for the additional sentence based on the second set of basic components and the one or more second modifiers associated with the second set of basic components, the second data
graph comprising a subject node, an object node and an edge connecting the subject node and the object node. In this regard, the subject node and the edge of the second data graph are configured to represent the subject and the predicate of the second set of basic components, respectively. In particular, the above-mentioned forming the second data graph comprises assigning, for each of the one or more second modifiers corresponding to one of the subject and the predicate, the second modifier as an attribute to the corresponding one of the subject node and the edge. Accordingly, one or more data graphs may be formed for each sentence, resulting in a plurality of data graphs for the plurality of sentence. For example, two or more of the plurality of data graphs may share one or more common nodes.
[0047] In various embodiments, one or more additional sets (which, in the context of such embodiments, may be referred to as one or more second additional sets) of basic components may be extracted from any additional sentence in the same or similar manner as described hereinbefore, such as in relation to the one or more first additional sets of basic components.
[0048] FIG. 2 depicts a schematic block diagram of a system 200 for generating a structured knowledge data for a text (e.g., for an input text data) comprising at least one sentence, according to various embodiments of the present invention, such as corresponding to the method 100 of generating a structured knowledge data as described hereinbefore according to various embodiments of the present invention. The system 200 comprises a memory 202, and at least one processor 204 communicatively coupled to the memory 202 and configured to: extract a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and form a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node. In this regard, the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively. In particular, the above-mentioned forming the first data graph comprises assigning, for each of the one or more first modifiers
corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge. It will be appreciated to a person skilled in the art that the system 200 may be embodied as a device or an apparatus.
[0049] It will be appreciated by a person skilled in the art that the at least one processor 204 may be configured to perform the required functions or operations through set(s) of instructions (e.g., software modules) executable by the at least one processor 204 to perform the required functions or operations. Accordingly, as shown in FIG. 2, the system 200 may further comprise a component extractor (or a component extracting module or circuit) 206 configured to perform the above-mentioned extracting (at 102) a first set of basic components and one or more first modifiers, and a data graph generator 208 (or a data graph generating module or circuit) 210 configured to perform the above- mentioned forming (at 104) a first data graph for the first sentence.
[0050] It will be appreciated by a person skilled in the art that the above-mentioned modules are not necessarily separate modules, and one or more modules may be realized by or implemented as one functional module (e.g., a circuit or a software program) as desired or as appropriate without deviating from the scope of the present invention. For example, the component extractor 206 and the data graph generator 208 may be realized (e.g., compiled together) as one executable software program (e.g., software application or simply referred to as an“app”), which for example may be stored in the memory 202 and executable by the at least one processor 204 to perform the functions/operations as described herein according to various embodiments.
[0051] In various embodiments, the system 200 corresponds to the method 100 as described hereinbefore with reference to FIG. 1, therefore, various functions or operations configured to be performed by the least one processor 204 may correspond to various steps of the method 100 described hereinbefore according to various embodiments, and thus need not be repeated with respect to the system 200 for clarity and conciseness. In other words, various embodiments described herein in context of the methods are analogously valid for the respective systems (e.g., which may also be embodied as devices), and vice versa.
[0052] For example, in various embodiments, the memory 202 may have stored therein the component extractor 206 and/or the data graph generator 208, which
respectively correspond to various steps of the method 100 as described hereinbefore according to various embodiments, which are executable by the at least one processor 204 to perform the corresponding functions/operations as described herein.
[0053] A computing system, a controller, a microcontroller or any other system providing a processing capability may be provided according to various embodiments in the present disclosure. Such a system may be taken to include one or more processors and one or more computer-readable storage mediums. For example, the system 200 described hereinbefore may include a processor (or controller) 204 and a computer-readable storage medium (or memory) 202 which are for example used in various processing carried out therein as described herein. A memory or computer-readable storage medium used in various embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
[0054] In various embodiments, a“circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof. Thus, in an embodiment, a“circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g., a microprocessor (e.g., a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor). A“circuit” may also be a processor executing software, e.g., any kind of computer program, e.g., a computer program using a virtual machine code, e.g., Java. Any other kind of implementation of the respective functions which will be described in more detail below may also be understood as a“circuit” in accordance with various alternative embodiments. Similarly, a“module” may be a portion of a system according to various embodiments in the present invention and may encompass a “circuit” as above, or may be understood to be any kind of a logic-implementing entity therefrom.
[0055] Some portions of the present disclosure are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.
[0056] Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as“extracting”,“forming”,“generating”,“analyzing”,“chunking”,“identifying”, “labelling”,“linking”,“configuring”,“processing”,“performing” or the like, refer to the actions and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.
[0057] The present specification also discloses a system (e.g., which may also be embodied as a device or an apparatus) for performing the operations/functions of the methods described herein. Such a system may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose machines may be used with computer programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate.
[0058] In addition, the present specification also at least implicitly discloses a computer program or software/functional module, in that it would be apparent to the person skilled in the art that the individual steps of the methods described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated
that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention. It will be appreciated by a person skilled in the art that various modules described herein (e.g., the component extractor 206 and/or the data graph generator 208) may be software module(s) realized by computer program(s) or set(s) of instructions executable by a computer processor to perform the required functions, or may be hardware module(s) being functional hardware unit(s) designed to perform the required functions. It will also be appreciated that a combination of hardware and software modules may be implemented.
[0059] Furthermore, one or more of the steps of a computer program/module or method described herein may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the methods described herein.
[0060] In various embodiments, there is provided a computer program product, embodied in one or more computer-readable storage mediums (non-transitory computer- readable storage medium), comprising instructions (e.g., the component extractor 206 and/or the data graph generator 208) executable by one or more computer processors to perform a method 100 of generating a structured knowledge data as described hereinbefore with reference to FIG. 1. Accordingly, various computer programs or modules described herein may be stored in a computer program product receivable by a system therein, such as the system 200 as shown in FIG. 2, for execution by at least one processor 204 of the system 200 to perform the required or desired functions.
[0061] The software or functional modules described herein may also be implemented as hardware modules. More particularly, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For
example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the software or functional module(s) described herein can also be implemented as a combination of hardware and software modules.
[0062] In various embodiments, the system 200 may be realized by any computer system (e.g., portable or desktop computer system, such as tablet computers, laptop computers, mobile communications devices (e.g., smart phones), and so on) including at least one processor and a memory, such as a computer system 300 as schematically shown in FIG. 3 as an example only and without limitation. Various methods/steps or functional modules (e.g., the component extractor 206 and/or the data graph generator 208) may be implemented as software, such as a computer program being executed within the computer system 300, and instructing the computer system 300 (in particular, one or more processors therein) to conduct the methods/functions of various embodiments described herein. The computer system 300 may comprise a computer module 302, input modules, such as a keyboard 304 and a mouse 306, and a plurality of output devices such as a display 308, and a printer 310. The computer module 302 may be connected to a computer network 312 via a suitable transceiver device 314, to enable access to e.g., the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN). The computer module 302 in the example may include a processor 318 for executing various instructions, a Random Access Memory (RAM) 320 and a Read Only Memory (ROM) 322. The computer module 302 may also include a number of Input/Output (I/O) interfaces, for example I/O interface 324 to the display 308, and I/O interface 326 to the keyboard 304. The components of the computer module 302 typically communicate via an interconnected bus 328 and in a manner known to the person skilled in the relevant art.
[0063] It will be appreciated by a person skilled in the art that the terminology used herein is for the purpose of describing various embodiments only and is not intended to be limiting of the present invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising,"
when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0064] In order that the present invention may be readily understood and put into practical effect, various example embodiments of the present invention will be described hereinafter by way of examples only and not limitations. It will be appreciated by a person skilled in the art that the present invention may, however, be embodied in various different forms or configurations and should not be construed as limited to the example embodiments set forth hereinafter. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the present invention to those skilled in the art.
[0065] Various example embodiments relate to text analytics and provide a method and a product (e.g., a computer program product, embodied in one or more computer- readable storage mediums (non-transitory computer-readable storage medium)) for processing free text (unstructured text data) from an unstructured text document (e.g., an unstructured text input) to generate a structured knowledge data. For example, various example embodiments are seek to represent a free text document in a unified semantic model which is easy for a computer to understand and process. For example, various example embodiments provide a complete knowledge representation model for a free text document, which leverages on the state-of-the-art natural language processing and semantic knowledge graph methods. For example, knowledge information can be dynamically acquired from a free text document, an internet web page and so on, and automatically organized in a structured knowledge data (e.g., a data knowledge graph, which may be simply referred to herein as“knowledge graph”) for easy and robust information query. Various application areas include, but not limited to, open information extraction, question answering, document summarization and question generation. For example, the structured knowledge data generated according to various example embodiments has been found to be especially efficient and useful for question answering over unstructured text (free text) documents, such as internet web pages..
[0066] Various example embodiments provide a method for flexible knowledge representation and extraction which bridges a gap between open information extraction and semantic knowledge graph. For example, different from conventional open domain information extraction, the method (and corresponding system) according to various example embodiments seeks to capture (or represent) the meaning (e.g., sufficiently or fully) of the original text document. In various example embodiments, the structured knowledge data generated may be stored in a database (e.g., a graph database system) for better scalability and management.
[0067] Accordingly, various example embodiments provide a unified and scalable knowledge representation method or model which makes use of the state-of-the-art natural language processing and semantic knowledge graph methods. The knowledge representation method according to various example embodiments seek to fully (or at least sufficiently or as much as possible or desired) represent the meaning of the original text document by integrating the knowledge information therein into a structured knowledge data and storing the structured knowledge data in a database for better scalability and data management. In various example embodiments, the structured knowledge data is in the form of a data knowledge graph (e.g., semantic graph network) that is organized in a way that the original syntactical information is retained as much as possible. In this regard, the data knowledge graph is advantageously query oriented and capable for knowledge inference. For example, different from various conventional open knowledge extraction systems (or various conventional knowledge representation techniques), knowledge representation method according to various example embodiments is able to represent all the modifiers (adjuncts) of a sentence in addition to the basic components (main meaning components), namely, subject, predicate and object. Furthermore, the knowledge data are fully integrated in a data knowledge graph and stored in a graph database that is accessible for various applications, such as question answering.
[0068] For better understanding, the basics of a sentence structure will now be described by way of examples only and without limitations. In general, a sentence is a group of words that are put together to mean something. For example, a sentence is the basic unit of language which expresses a complete thought. For example, the complete
thought is about someone or something (subject) and what the subject is about or doing (predicate). By way of an example of a sentence composition including a subject and a predicate,“Bird chirps” is composed of a subject (“Bird”) and a predicate (“chirps”), as shown in FIG. 4A. FIG. 4B and 4C show examples of sentences, including a subject, a predicate and an object, namely,“Sam likes chocolate” and“John sent me a present”.
[0069] Besides the subject, predicate and object, there may be one or more modifiers (adjuncts) in a sentence to describe other words, for instance, adjectives modifiers and adverbs modifiers. Adjectives modifiers describe nouns and pronouns (e.g., answering Which one? What kind? How many? Whose?). Adverbs modifiers describe verbs, adjectives and other adverbs (e.g., answering How? When? Where? Why? To what extent?). Grammatically, these adjuncts are optional and structurally dispensable. However, they give additional information (knowledge information) about sentence functionaries (subject, predicate, object, etc.) and make the sentence meaning more complete and accurate. These adjuncts may each be a word, a phrase (e.g., noun phrase, adverb phrase and prepositional phrase) or a clause. For example, prepositional phrases as adjuncts usually describe when or where something happens (e.g., referring to a time or a place). Followings are some non-limiting examples:
• David completed his task quickly (adverb)
• David completed his task so quickly (adverb phrase)
• Last week we went to the library (noun phrase)
• The blue balloon burst (adjective)
• The teacher comes into the classroom (prepositional phrase)
[0070] In this regard, various example embodiments of the present invention seek to completely represent a full sentence by taking into account (e.g., capturing or representing) both basic components of a sentence (sentence fundamental elements, namely, subject, predicate and object) and other various modifiers (adjuncts).
[0071] In contrast, as described in the background, various conventional knowledge representation techniques only extract triples (i.e., subject, predicate and object) from sentences. Accordingly, such conventional knowledge representation techniques only take into account the basic components of the sentence (namely, triple), but ignore or disregard other sentence components, such as modifiers (adjuncts). Therefore, important
or useful information relating to various basic components in a sentence are not captured (or represented), resulting in an incomplete or insufficient knowledge representation of the original meaning of the sentence. In other words, the original meaning of the sentence may be lost.
[0072] FIG. 5 depicts a schematic drawing of an example system 500 for generating a structured knowledge data for a text according to various example embodiments of the present invention. The system 500 includes an analysis module 504 configured to analyze a sentence of a text (e.g., an input sentence) for identifying constituents of the sentence; a chunking module 508 configured to performing chunking (sentence chunking) on the identified constituents of the sentence to produce a plurality of chunk components (which may also be referred to herein simply as“chunks”); an extraction module 512 configured to extract a set of basic components and one or more modifiers (adjuncts) associated with the set of basic components from the plurality of chunk components; and a graph creation module 516 configured to form a data graph for the sentence based on the set of basic components and the one or more modifiers extracted.
[0073] In relation to the analysis module 504, for example, the sentence may be analyzed by being parsed into its constituents according to various parsing techniques known in the art, such as but not limited to, Top-Down parsing or Bottom-Up parsing, and thus need not be described in detail herein for clarity and conciseness. By way of an example and without limitation, given a sentence“The blue birds in the trees above my house were chirping endlessly throughout the whole night last month”, the analysis module 504 may output a constituent parsing tree as shown in FIG. 6, whereby the tags (e.g., as shown in FIG. 6) are Penn Treebank II Tags, namely,“S” denotes simple declarative clause,“NP” denotes noun phrase,“DT” denotes determiner,“JJ” denotes adjective,“PP” denotes prepositional phrase,“VP” denotes verb phrase,“VBD” denotes verb past tense,“NNS” denotes noun plural,“NN” denotes noun singular or mass,“IN” denotes preposition, “VBG” verb gerund or present participle, “PRP$” denotes possessive pronoun,“AD VP” denotes adverb phrase,“RB” denotes adverb, and“TMP” denotes temporal.
[0074] The sentence may further be analyzed based on coreference resolution as known in the art, and thus need not be described in detail herein for clarity and
conciseness. In general, coreference resolution is a technique for finding all expressions or sentence components that refer to the same entity in a sentence or text. For example, coreference resolution may make every extracted triple refer to its actual entity, instead of using pronouns such as“he”,“she”,“him”, and so on. For instance, the extracted triples for two example sentences “Peter is an engineer. He likes programming” may be: (“Peter”,“is”, “an engineer), (“He”, “likes”, “programming”). For example, without coreference resolution, when the data graphs formed based on the above two triples are saved into a graph database, the context between the above two example sentences does not exist anymore and hence it will not be known what“He” is referring to. On the other hand, with coreference resolution, the subject“He” in the latter example sentence above may be replaced by“Peter” and according to various example embodiments, the data graphs formed based on the above two triples may share the same subject node“Peter” in a graph database.
[0075] In various example embodiments, in order to extract the subject, predicate, object and all the modifiers (adjuncts), first the sentence is correctly split (e.g., divided or segmented into a plurality of sentence components), and then grammatically identified for every sentence component. In this regard, the chunking module 508 is provided to perform this function or operation according to various example embodiments. By way of an example only and without limitation, given a constitute parsing tree for an example sentence“A staff with at least one Singapore Citizen child between 7 and 12 years old is eligible to take 2 days of Childcare Leave per year unconditionally, without the need to produce a medical certificate”, the chunking module 508 may produce an output“A staff/NP with/IN at least/ADVP one Singapore Citizen child/NP between 7/PP and/CC 12 years old/AD is/VB eligible to/ADJP take/VB 2 days/NP of Childcare/PP Leave/VB per year/PP unconditionally/AD ,/, without the need to/PP produce/VB a medical certificate/NP ./.”, whereby“NP”,“IN”,“AD VP” and“PP” are as defined hereinbefore, “CC” denotes coordinating conjunction, “AD” denotes adverb,“VB” denotes verb base form and“ADJP” denotes adjective phrase.
[0076] In various example embodiments, the chunking and extraction are based on dependency parsing and constituent parsing of the sentences along with a set of rules of grammar and/or syntax. An example constituent parsing tree output from the analysis
module 504 is shown in FIG. 6 as described hereinbefore, which embeds a number of useful information for chunking and component identification, such as, subject, predicate, object and adjuncts (including the specific types of adjuncts). For example, chunking may be a kind of shallow parsing which adds more structure to a sentence after part-of-speech (POS) tagging, which for example may be implemented using regular expression rules based on POS tags. For example, in the analysis module 504, since constituent parsing and dependency parsing are already carried out therein, the chunking may be deduced from these parsing results. For instance, the constituent parsing tree as shown in FIG. 6 includes some trunk information in the upper level tree nodes. For example, the words “The blue birds” share the same parent tree node“NP” which constitutes a chunk (chunk component) of a subject in the sentence. This trunk can be further confirmed by the dependency parsing result in which the words“The” and“blue” are the modifiers of the “birds”. The triple extraction may be performed in the trunk level which has less components compared to the original sentence. Various techniques or rules of grammar and/or syntax known in the art may be applied for the triple extraction, for instance, the subject may be nouns, pronouns, and noun phrase that occurs before the verb in the sentence, a predicate may be a verb or a verb phrase and the object. In various example embodiments, the chunking module 508 may be configured to produce a plurality of chunk components using a deep neural network (DNN) model.
[0077] In various example embodiments, the graph creation module 516 is configured to form a data graph for each set of basic components based on the set of basic components and the one or more modifiers associated with the set basic components extracted from the extraction module 512. In various example embodiments, a data graph is formed for each set of basic components, whereby the set of basic components (e.g., subject, predicate and object) and their relationship is modelled as a data graph, whereby the subject and the object are represented as nodes (subject and object nodes) and the predicate is represented as the edge of the data graph. In various example embodiment, the data graph is a directed data graph. Moreover, the graph creation module 516 is configured to assign, for each of the one or more modifiers corresponding to one of the basic components, the modifier as an attribute (or property) to the corresponding one of the subject node, object node and edge. In other words, each
modifier that corresponds to a particular basic component (e.g., that modifies the particular basic component) is assigned to the node or edge representing that particular basic component as an attribute thereof.
[0078] In various example embodiments, the relationship (or association) between the modifiers and the triple elements may be deduced from the dependency parsing result. By way of an example only and without limitation, for an example sentence“The lovely dog successfully caught the white rat in the field last month”, the example sentence may be parsed to produce a dependency parsing result with the following relations:
('root', 'ROOT-O', 'caught-5'),
('def, 'dog-3', The-T),
('amod', 'dog-3', 'lovely-2'),
('nsubj', 'caught-5', 'dog-3'),
('advmod', 'caught-5', 'successfully-4'),
('def, 'rat-8', 'the-6'),
('amod', 'rat-8', 'white-7'),
('dobj', 'caught-5', 'rat-8'),
('case', 'field- 1 G, 'in-9'),
('def, 'field- 11', 'the-l0'),
('nmoddn', 'caught-5', 'field- 11'),
('amod', 'month-l3', 'last- 12'),
('nmod:tmod', 'caught-5', 'month- 13'))
[0079] In the above example sentence, the modifiers of predicate“caught”, for instance, may be determined from the above relations (as shown in bold), which include “successfully-4”,“field-l l” and“month-l3”.
[0080] As described hereinbefore, a data graph may be formed for each set of basic components (each triple), and thus, multiple data graphs may be formed for multiple sets of basic components (multiple triples) extracted from a given text and an overall network of nodes and edges may thus be formed for a given text, such as illustrated in FIGs. 9A to 9D for an example text (described later below). In various example embodiments, the overall network of nodes and edges may be formed by processing the multiple data graphs formed for the multiple triples to share common nodes where appropriate, such as
by combining/merging all instances of the same nodes (e.g., representing the same entity) together (e.g., all subject nodes of different data graphs representing the same subject are configured as a common subject node). For example, when the multiple data graphs formed for the multiple triples (e.g., all triples) are populated into a graph database, all nodes which are identified as the same (e.g., represent the same entity) may be merged into one common node, resulting in the associated data graphs sharing the common node (which collectively may be referred to as a merged data graph). In various example embodiments, the input text may be preprocessed before sending for analysis. For example, preprocessing may include text normalization (e.g., converting all letters into lower case, removing white spaces, and so on), tokenization, stemming, lemmatization, and so on. Such a preprocessing facilitates to make the same nodes in different forms present (exist) as one common node in the graph database.
[0081] Therefore, according to various example embodiments, subject, predicate, object and various modifiers (adjuncts) with knowledge graph and related useful attributes are provided to describe (or represent) the sentence components (e.g., chunk components). For better understanding, example data graphs formed for example sentences according to various example embodiments will now be described below by way of examples only and without limitations.
[0082] In various example embodiments, the basic components (most meaningful elements, e.g., the subject, predicate, object) and their relationship are modelled as a directed graph whereby subject and object are represented as the nodes of the directed graph and predicate is represented as the edge of the directed graph. For instance, for an example sentence“Sam likes chocolate”, the data graph creation module 516 may be configured to form a data graph 704 as illustrated in FIG. 7A to represent the example sentence. For example, an advantage of such a triple representation is that the basic components (main meaning components) of the sentence may then be easily queried with existing query language of graph database. The triple representation also makes reasoning possible, similar to the Resource Description Framework (RDF) triple store.
[0083] For sentences having various modifiers (adjuncts), in various example embodiments, all the adjuncts are modelled as attributes (or properties) of the nodes or edge, depending on their role in the sentences or relationship to the nodes or edge. For
instance, in the example sentence“The lovely dog successfully caught the white rat in the field last month”, the adjective “lovely”, adverb “successfully”, adjective “white”, prepositional phrases“in the field” and noun phrase“last month” are the adjuncts which are modelled as attributes and each are assigned (e.g., attached) to the corresponding graph nodes or edge according to various example embodiments. The complete graph representation (data graph) with attributes for the example sentence is illustrated in FIG. 7B.
[0084] In various example embodiments, enhanced understanding of the chunk elements (chunk components) are carried out with the state-of-the-art natural language processing technology such as named entity recognition, coreference resolution, and so on. In this regard, for instance, in the example of FIG. 7B, the prepositional phrase“in the field” may be identified and tagged with a named entity class label (or tag)“location” while the noun phrase“last month” may be identified and tagged with a named entity class label (or tag)“time”. For example, such information are helpful when answering when and where type questions, such as“Where did the dog catch the rat?” and“When the dog caught the rat?”.
[0085] It will be appreciated that sentences may have all basic components (subject, predicate and object) present or may have only part of the basic components present (subject and predicate, without object). For example, the data graphs shown in FIGs. 7A and 7B are formed for the above-mentioned example sentences having all basic components present. On the other hand, for sentences without an object, various example embodiments represent the missing object as a blank object node in the data graph formed. For example, for an example sentence“Jane sadly cry”, the data graph creation module 516 may be configured to form a data graph 712 having an empty object node as illustrated in FIG. 7C to represent the example sentence. For example, in relation to the example sentence, if there is a query on“Who (sadly) cried?”, the correct answer“Jane” can still be retrieved with standard graph query language. The named entity tag“person” in the subject results in the system being more confident on the above“who” question.
[0086] Accordingly, in various example embodiments, a method for producing semantic information from free text (or unstructured text) sources is provided. The method provides a complete knowledge representation model for giving meaning
representation to free text document, extract knowledge based on natural language processing and rules, to generate a query oriented semantic graph for various applications, such as answering or addressing presented questions.
[0087] In various example embodiments, the method comprises steps of analysing the text to extract linguistic components or elements such as basic components (fundamental elements, namely subject, predicate and object) and various modifiers (adjuncts); dividing a sentence of free text into non -overlapping segments based on the extracted linguistic elements and semantic rules (in an example, a constituent parsing tree) (e.g., extracting triples based on the analysis and chunking results); representing semantics in the form of a graphical representation such as a network of nodes and edges, whereby the graphical representation comprises attributes of the nodes and/or edges, which are modeled from adjuncts; generating a structured knowledge data comprising a combination or network of the data graphs (including nodes and edges, along with associated attributes); and storing the structured knowledge data in a database to be retrievable for various applications, such as queries or questions.
[0088] In various example embodiments, text analysis by the analysis module 504 may be performed by various conventional natural language processing methods. In various example embodiments, sentences may be divided or segmented using text chunking, such as a rule -based text chunking. In various example embodiments, linguistic elements (or sentence components) may be further recognized and tagged using name entity recognition. For example, name entity recognition may help to tag or label one or more modifiers with a corresponding named entity class tag or label (e.g., tagging a modifier“last month” with a“time” entity class tag as shown in FIG. 7B).
[0089] In various example embodiments, there is provided a computer program product, embodied in one or more computer-readable storage mediums (non-transitory computer-readable storage medium), comprising instructions (e.g., the component extractor 206 and/or the data graph generator 208) executable by one or more computer processors to perform a method of generating a structured knowledge data as described hereinbefore according to various embodiments. Accordingly, various computer programs or modules described herein may be stored in a computer program product receivable by a system therein, such as the system 200 as shown in FIG. 2, for execution
by at least one processor 204 of the system 200 to perform the required or desired functions. In various example embodiments, the computer program product may further comprise instructions executable by one or more computer processors to generate a graphical user interface (GUI) for receiving various inputs (e.g., text data for which a data knowledge graph is to be generated and stored in a graph database, question(s), and so on) and providing various outputs (e.g., displaying an answer to a question inputted by a user).
[0090] Accordingly, in various example embodiments, there is provided a computer program product configured to generate or produce semantic information (e.g., providing an answer to a query from free text (unstructured text) sources based on the structured knowledge data (data knowledge graph) generated using a method according to various embodiments of the present invention (which may be referred to herein after“the present method”), such as described hereinbefore with reference to FIG. 1.
[0091] By way of an example only and without limitation, FIG. 8 depicts an example GUI 800 generated by a computer processor for interaction with a user in an example implementation according to an example embodiment of the present invention. In the example embodiment, a publicly available graph database may be used as a backend database for better scale-up and handling of possible large amount of triple relations. For example, the GUI 800 may directly extract knowledge information from a web page, a text file or a user manually input text. After analysis and processing the input text data according to the present method of generating a structured knowledge data, a structured knowledge data (e.g., semantic knowledge graph) for the specific input text data may be automatically generated and stored into a graph database. For example, as illustrated in FIG. 8, a web link may be provided via the GUI 800 to the system for knowledge information extraction, as well as question generating and answering.
[0092] By way of an illustrative example only and without limitation, a web link (https://www.straitstimes.com/asia/se-asia/president-tony-tan-pays-respects-to-late-thai- king) was inputted via the GUI 800 for an example news article entitled“President Tony Tan pays respects to late Thai King Bhumibol Adulyadej”, which was published by The Straits Times on 25 Oct 2016. The content (text data) of the news article is reproduced below for ease of reference:
Singapore President Tony Tan Keng Yam paid his respects to late Thai King Bhumibol Adulyadej on Monday (Oct 24) in Bangkok's Grand Palace.
President Tan, who was accompanied by his wife and officials from Singapore's Foreign Affairs Ministry, laid a wreath by the late monarch's royal urn and signed a condolence book early on Monday afternoon.
His visit comes three days after Singapore Prime Minister Lee Hsien Loong travelled to Bangkok to do the same.
In a Facebook post published that day, Dr Tan said he was "deeply saddened" by the death of King Bhumibol, who was "well-loved by the Thai people because of his compassion and concern for his people".
He added that the late king was a close friend of Singapore, and under the king's reign, ties between the two countries were strengthened.
"King Bhumibol will be dearly missed by all. His legacy will remain an inspiration for generations to come. Our thoughts and prayers are with the Thai Royal family and the Thai people," said Dr Tan.
Since Mr Lee's visit, a stream of regional leaders have travelled to the Thai capital to pay their respects. They include Malaysian Prime Minister Najib Razak, his Cambodian counterpart Hun Sen, and China's vice-president, Mr Li Yuanchao. Indonesian President Joko Widodo is also expected to follow suit this week.
King Bhumibol died on Oct 13 at the age of 88 in Bangkok's Siriraj Hospital after a long illness. The revered monarch, who reigned for 70 years, is a unifying figure in a country of 68 million now afflicted by a deep political divide.
Former Privy Council president Prem Tinsulanonda, 96, is now standing in as regent after Crown Prince Maha Vajiralongkorn asked for time to mourn with the people before his ascension to the throne.
In a Facebook post on Oct 15, Dr Tan said King Bhumibol "was an outstanding King who had dedicated his life to the welfare of the Thai people".
"He was steadfast in launching projects which had impactful benefits to all corners of the Kingdom of Thailand," he wrote.
"History will remember King Bhumibol as a great monarch, and a unifying force deeply loved and respected by the Thai people and the rest of the world."
[0093] For the above example news article, a semantic knowledge graph (corresponding to a“structured knowledge data” as mentioned hereinbefore according to various embodiments) was created using the present method. FIGs. 9A to 9D depict a schematic drawing/illustration of the semantic knowledge graph 900 generated, including a plurality of merged data graphs. It will be appreciated that FIGs. 9C and 9D are partial views of a merged data graph and may be joined at corresponding sides to show the complete merged data graph. As can be seen from FIGs. 9A to 9D, subject nodes of data graphs identified to be the same (e.g., represent the same entity) may be merged into one common subject node so that such data graphs share the common subject node. Similarly, object nodes of data graphs identified to be the same (e.g., represent the same entity) may be merged into one common object node so that such data graphs share the common object node. Accordingly, data graphs sharing one or more common nodes (e.g., object node and/or subject node) may collectively be referred to as a merged data graph. In this regard, after the semantic knowledge graph 900 has been created, for example, questions may then be immediately generated about the content of the web page as shown in the output text box of FIG. 8. Furthermore, based on experiments conducted on online web knowledge extraction and question answering, the present method (and the corresponding system) of generating a structured knowledge data has been found to be efficient and effective in answering various questions on the extracted web content. Accordingly, by
forming a data graph for each set of basic components that takes into account or includes the one or more modifiers associated with the set of basic components, the present method (and the corresponding system) advantageously improves or enhances knowledge representation of a text, such that the original meaning of various sentences in the text is better captured (or represented) and not lost.
[0094] While embodiments of the invention have been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.
Claims
1. A method of generating a structured knowledge data for a text comprising at least one sentence, using at least one processor, the method comprising:
extracting a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and
forming a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
said forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
2. The method according to claim 1 , wherein
the first set of basic components further comprises an object, the predicate indicates a semantic relationship between the subject and the object,
the object node is configured to represent the object, and
said forming the first data graph further comprises assigning, for each of the one or more first modifiers corresponding to the object, the first modifier as an attribute to the object node.
3. The method according to claim 1, wherein said extracting the first set of basic components and the one or more first modifiers comprises:
analyzing the text to identify constituents of the first sentence; and chunking the identified constituents of the first sentence to produce a plurality of chunk components,
wherein the first set of basic components and the one or more first modifiers are extracted from the plurality of chunk components.
4. The method according to claim 3, further comprising:
identifying one or more of the plurality of chunk components as a named entity; and
labelling each of the one or more chunk components identified with a corresponding named entity class label.
5. The method according to claim 1, wherein the one or more first modifiers are one or more adjuncts of the first sentence.
6. The method according to claim 1, wherein the first data graph is a directed data graph.
7. The method according to claim 1, wherein the text is an unstructured text.
8. The method according to claim 1, further comprising:
extracting one or more first additional sets of basic components from the first sentence;
extracting, for each of the one or more first additional sets of basic components, one or more additional modifiers associated with the first additional set of basic components from the first sentence of the text, the first additional set of basic components comprising a subject and a predicate associated with the subject; and
forming, for each of the one or more first additional sets of basic components, an additional data graph for the first sentence based on the first additional set of basic components and the one or more additional modifiers associated with the first additional set of basic components, the additional data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the additional data graph are configured to represent the subject and the predicate of the first additional set of basic components, respectively, and
said forming the additional data graph comprises assigning, for each of the one or more additional modifiers corresponding to one of the subject and the predicate, the additional modifier as an attribute to the corresponding one of the subject node and the edge.
9. The method according to claim 8, further comprising merging the subject node of the first data graph and the subject node of the additional data graph of at least one of the one or more first additional sets of basic components from the first sentence as a common subject node if the subject represented by the subject node of the first data graph and the subject represented by the subject node of the additional data graph of the at least one of the one or more first additional sets of basic components correspond to each other.
10. The method according to claim 1, wherein the text comprises a plurality of sentences and the method further comprises, for each additional sentence of the plurality of sentences:
extracting a second set of basic components and one or more second modifiers associated with the second set of basic components from the additional sentence, the second set of basic components comprising a subject and a predicate associated with the subject; and
forming a second data graph for the additional sentence based on the second set of basic components and the one or more second modifiers associated
with the second set of basic components, the second data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the second data graph are configured to represent the subject and the predicate of the second set of basic components, respectively, and
said forming the second data graph comprises assigning, for each of the one or more second modifiers corresponding to one of the subject and the predicate, the second modifier as an attribute to the corresponding one of the subject node and the edge.
11. A system for generating a structured knowledge data for a text comprising at least one sentence, the system comprising:
a memory; and
at least one processor communicatively coupled to the memory and configured to:
extract a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and
form a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
said form the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject and the predicate, the first modifier as an attribute to the corresponding one of the subject node and the edge.
12. The system according to claim 11, wherein
the first set of basic components further comprises an object, the predicate indicates a semantic relationship between the subject and the object,
the object node is configured to represent the object, and
said form the first data graph further comprises assigning, for each of the one or more first modifiers corresponding to the object, the first modifier as an attribute to the object node.
13. The system according to claim 11, wherein said extract the first set of basic components and the one or more first modifiers comprises:
analyzing the text to identify constituents of the first sentence; and chunking the identified constituents of the first sentence to produce a plurality of chunk components,
wherein the first set of basic components and the one or more first modifiers are extracted from the plurality of chunk components.
14. The system according to claim 13, wherein the at least one processor is further configured to:
identify one or more of the plurality of chunk components as a named entity; and
label each of the one or more chunk components identified with a corresponding named entity class label.
15. The system according to claim 11, wherein the one or more first modifiers are one or more adjuncts of the first sentence, and the text is an unstructured text.
16. The system according to claim 11, wherein the first data graph is a directed data graph.
17. The system according to claim 11, wherein the at least one processor is further configured to:
extract one or more first additional sets of basic components from the first sentence;
extract, for each of the one or more first additional sets of basic components, one or more additional modifiers associated with the first additional set of basic components from the first sentence of the text, the first additional set of basic components comprising a subject and a predicate associated with the subject; and
form, for each of the one or more first additional sets of basic components, an additional data graph for the first sentence based on the first additional set of basic components and the one or more additional modifiers associated with the first additional set of basic components, the additional data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the additional data graph are configured to represent the subject and the predicate of the first additional set of basic components, respectively, and
said form the additional data graph comprises assigning, for each of the one or more additional modifiers corresponding to one of the subject and the predicate, the additional modifier as an attribute to the corresponding one of the subject node and the edge.
18. The system according to claim 17, wherein the at least one processor is further configured to merge the subject node of the first data graph and the subject node of the additional data graph of at least one of the one or more first additional sets of basic components from the first sentence as a common subject node if the subject represented by the subject node of the first data graph and the subject represented by the subject node of the additional data graph of the at least one of the one or more first additional sets of basic components correspond to each other.
19. The system according to claim 11, wherein the text comprises a plurality of sentences and the at least one processor is further configured to, for each additional sentence of the plurality of sentences:
extract a second set of basic components and one or more second modifiers associated with the second set of basic components from the additional sentence, the second set of basic components comprising a subject and a predicate associated with the subject; and
form a second data graph for the additional sentence based on the second set of basic components and the one or more second modifiers associated with the second set of basic components, the second data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the second data graph are configured to represent the subject and the predicate of the second set of basic components, respectively, and
said form the second data graph comprises assigning, for each of the one or more second modifiers corresponding to one of the subject and the predicate, the second modifier as an attribute to the corresponding one of the subject node and the edge.
20. A computer program product, embodied in one or more non-transitory computer- readable storage mediums, comprising instructions executable by at least one processor to perform a method of generating a structured knowledge data for a text comprising at least one sentence, the method comprising:
extracting a first set of basic components and one or more first modifiers associated with the first set of basic components from a first sentence of the text, the first set of basic components comprising a subject and a predicate associated with the subject; and
forming a first data graph for the first sentence based on the first set of basic components and the one or more first modifiers associated with the first set
of basic components, the first data graph comprising a subject node, an object node and an edge connecting the subject node and the object node, wherein
the subject node and the edge of the first data graph are configured to represent the subject and the predicate of the first set of basic components, respectively, and
said forming the first data graph comprises assigning, for each of the one or more first modifiers corresponding to one of the subject node and the edge, the first modifier as an attribute to the corresponding one of the subject node and the edge.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| SG11202008351PA SG11202008351PA (en) | 2018-03-06 | 2019-03-06 | Method and system for generating a structured knowledge data for a text |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| SG10201801825V | 2018-03-06 | ||
| SG10201801825V | 2018-03-06 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019172849A1 true WO2019172849A1 (en) | 2019-09-12 |
Family
ID=67847556
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/SG2019/050126 Ceased WO2019172849A1 (en) | 2018-03-06 | 2019-03-06 | Method and system for generating a structured knowledge data for a text |
Country Status (2)
| Country | Link |
|---|---|
| SG (1) | SG11202008351PA (en) |
| WO (1) | WO2019172849A1 (en) |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113128226A (en) * | 2019-12-31 | 2021-07-16 | 阿里巴巴集团控股有限公司 | Named entity recognition method and device, electronic equipment and computer storage medium |
| CN113158671A (en) * | 2021-03-25 | 2021-07-23 | 胡明昊 | Open domain information extraction method combining named entity recognition |
| US11580127B1 (en) | 2018-12-21 | 2023-02-14 | Wells Fargo Bank, N.A. | User interfaces for database visualizations |
| CN116090560A (en) * | 2023-04-06 | 2023-05-09 | 北京大学深圳研究生院 | Knowledge graph establishment method, device and system based on teaching materials |
| US11768837B1 (en) | 2021-12-28 | 2023-09-26 | Wells Fargo Bank, N.A. | Semantic entity search using vector space |
| CN117174234A (en) * | 2023-11-03 | 2023-12-05 | 南京都昌信息科技有限公司 | Medical text data analysis method and system |
| US11880379B1 (en) | 2022-04-28 | 2024-01-23 | Wells Fargo Bank, N.A. | Identity resolution in knowledge graph databases |
| US12072918B1 (en) | 2021-12-28 | 2024-08-27 | Wells Fargo Bank, N.A. | Machine learning using knowledge graphs |
| CN120471159A (en) * | 2025-07-14 | 2025-08-12 | 龙兴(杭州)航空电子有限公司 | A method and device for constructing a multimodal knowledge graph based on multi-level knowledge association |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5966686A (en) * | 1996-06-28 | 1999-10-12 | Microsoft Corporation | Method and system for computing semantic logical forms from syntax trees |
| US20070016863A1 (en) * | 2005-07-08 | 2007-01-18 | Yan Qu | Method and apparatus for extracting and structuring domain terms |
| CN102693310A (en) * | 2012-05-28 | 2012-09-26 | 无锡成电科大科技发展有限公司 | Resource description framework querying method and system based on relational database |
| US20160292304A1 (en) * | 2015-04-01 | 2016-10-06 | Tata Consultancy Services Limited | Knowledge representation on action graph database |
| US20170255709A1 (en) * | 2016-03-01 | 2017-09-07 | Linkedin Corporation | Atomic updating of graph database index structures |
-
2019
- 2019-03-06 WO PCT/SG2019/050126 patent/WO2019172849A1/en not_active Ceased
- 2019-03-06 SG SG11202008351PA patent/SG11202008351PA/en unknown
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5966686A (en) * | 1996-06-28 | 1999-10-12 | Microsoft Corporation | Method and system for computing semantic logical forms from syntax trees |
| US20070016863A1 (en) * | 2005-07-08 | 2007-01-18 | Yan Qu | Method and apparatus for extracting and structuring domain terms |
| CN102693310A (en) * | 2012-05-28 | 2012-09-26 | 无锡成电科大科技发展有限公司 | Resource description framework querying method and system based on relational database |
| US20160292304A1 (en) * | 2015-04-01 | 2016-10-06 | Tata Consultancy Services Limited | Knowledge representation on action graph database |
| US20170255709A1 (en) * | 2016-03-01 | 2017-09-07 | Linkedin Corporation | Atomic updating of graph database index structures |
Cited By (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11580127B1 (en) | 2018-12-21 | 2023-02-14 | Wells Fargo Bank, N.A. | User interfaces for database visualizations |
| US12306844B2 (en) | 2018-12-21 | 2025-05-20 | Wells Fargo Bank, N.A. | User interfaces for database visualizations |
| US11989198B1 (en) | 2018-12-21 | 2024-05-21 | Wells Fargo Bank, N.A. | User interfaces for database visualizations |
| CN113128226A (en) * | 2019-12-31 | 2021-07-16 | 阿里巴巴集团控股有限公司 | Named entity recognition method and device, electronic equipment and computer storage medium |
| CN113158671A (en) * | 2021-03-25 | 2021-07-23 | 胡明昊 | Open domain information extraction method combining named entity recognition |
| CN113158671B (en) * | 2021-03-25 | 2023-08-11 | 胡明昊 | Open domain information extraction method combined with named entity identification |
| US12147433B2 (en) | 2021-12-28 | 2024-11-19 | Wells Fargo Bank, N.A. | Semantic entity search using vector space |
| US11768837B1 (en) | 2021-12-28 | 2023-09-26 | Wells Fargo Bank, N.A. | Semantic entity search using vector space |
| US12072918B1 (en) | 2021-12-28 | 2024-08-27 | Wells Fargo Bank, N.A. | Machine learning using knowledge graphs |
| US12339861B2 (en) | 2022-04-28 | 2025-06-24 | Wells Fargo Bank, N.A. | Identity resolution in knowledge graph databases |
| US11880379B1 (en) | 2022-04-28 | 2024-01-23 | Wells Fargo Bank, N.A. | Identity resolution in knowledge graph databases |
| CN116090560A (en) * | 2023-04-06 | 2023-05-09 | 北京大学深圳研究生院 | Knowledge graph establishment method, device and system based on teaching materials |
| CN117174234B (en) * | 2023-11-03 | 2024-01-05 | 南京都昌信息科技有限公司 | Medical text data analysis method and system |
| CN117174234A (en) * | 2023-11-03 | 2023-12-05 | 南京都昌信息科技有限公司 | Medical text data analysis method and system |
| CN120471159A (en) * | 2025-07-14 | 2025-08-12 | 龙兴(杭州)航空电子有限公司 | A method and device for constructing a multimodal knowledge graph based on multi-level knowledge association |
Also Published As
| Publication number | Publication date |
|---|---|
| SG11202008351PA (en) | 2020-09-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2019172849A1 (en) | Method and system for generating a structured knowledge data for a text | |
| US9292490B2 (en) | Unsupervised learning of deep patterns for semantic parsing | |
| Cooper | Type theory and semantics in flux | |
| US9361587B2 (en) | Authoring system for bayesian networks automatically extracted from text | |
| Zeni et al. | GaiusT: supporting the extraction of rights and obligations for regulatory compliance | |
| Hussain et al. | Approximation of COSMIC functional size to support early effort estimation in Agile | |
| Diamantopoulos et al. | Software requirements as an application domain for natural language processing | |
| Dragoni et al. | Combining natural language processing approaches for rule extraction from legal documents | |
| Arunthavanathan et al. | Support for traceability management of software artefacts using natural language processing | |
| Tangkawarow et al. | ID2SBVR: A method for extracting business vocabulary and rules from an informal document | |
| Quaresma et al. | Event extraction and representation: a case study for the Portuguese language | |
| Sateli et al. | Automatic construction of a semantic knowledge base from CEUR workshop proceedings | |
| Ramsay et al. | Machine Learning for Emotion Analysis in Python: Build AI-powered tools for analyzing emotion using natural language processing and machine learning | |
| Buey et al. | Automatic legal document analysis: Improving the results of information extraction processes using an ontology | |
| Agt-Rickauer | Supporting domain modeling with automated knowledge acquisition and modeling recommendations | |
| Baumann et al. | The road map to FAME: A framework for mining and formal evaluation of arguments | |
| Nazaruka et al. | Extracting Core Elements of TFM Functional Characteristics from Stanford CoreNLP Application Outcomes. | |
| Ciaghi et al. | Law modeling with ontological support and BPMN: A case study | |
| Cameron et al. | A hybrid approach to finding relevant social media content for complex domain specific information needs | |
| Boschetti et al. | Collaborative and Multidisciplinary Annotations of Ancient Texts: The Euporia System | |
| Mishra | Multimodal Extraction of Proofs and Theorems from the Scientific Literature | |
| Waltl | Semantic analysis and computational modeling of legal documents | |
| Nazaruka et al. | Using Stanford CoreNLP Capabilities for Semantic Information Extraction from Textual Descriptions | |
| Marano | Exploring Formal Models of Linguistic Data Structuring. Enhanced Solutions for Knowledge Management Systems Based on NLP Applications | |
| Boroghina et al. | Multi-Microworld Conversational Agent with RDF Knowledge Graph Integration |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19763939 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19763939 Country of ref document: EP Kind code of ref document: A1 |