WO2025237327A1 - Gut microbe knowledge graph system - Google Patents
Gut microbe knowledge graph systemInfo
- Publication number
- WO2025237327A1 WO2025237327A1 PCT/CN2025/094816 CN2025094816W WO2025237327A1 WO 2025237327 A1 WO2025237327 A1 WO 2025237327A1 CN 2025094816 W CN2025094816 W CN 2025094816W WO 2025237327 A1 WO2025237327 A1 WO 2025237327A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- knowledge graph
- knowledge
- entity
- gut
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
Definitions
- This invention relates to the fields of medicine and treatment, and specifically to a knowledge graph system and reasoning system in the field of gut microbiota.
- the gut microbiota is the largest microbial community in the human body, with an adult carrying approximately 200 grams of gut microbes. These microbes contain about 30 times the genes of the human genome and could even be considered a vital organ in themselves. Gut microbes directly or indirectly influence other vital organs through pathways such as the gut-brain axis, gut-liver axis, and gut-lung axis. Therefore, the stability and dysregulation of the gut microbiota are closely related to human health and disease. They can help the human host enhance immunity, promote beneficial metabolism, and control weight gain, but they can also contribute to the development and progression of diseases such as autism, inflammatory bowel disease, and diabetes.
- GMKGs gut microbe knowledge graphs
- CN117292846A discloses a method and apparatus for constructing a gut microbiome knowledge graph.
- the method involves acquiring initial literature information related to gut microbiome research; extracting relevant entities, relationships between entities, and clinical annotation model information from the initial literature information; standardizing, organizing, and integrating the extracted entities to construct a data framework for the relevant entities; however, the gut microbiome knowledge graph constructed based on this data framework, relationships between entities, and clinical annotation model information does not include data cleaning or redundancy removal, nor does it construct a "multimodal uncertain reasoning system for knowledge graphs.” It only extracts unidirectional relationships of information using a relatively simple extraction method, and the final result lacks further validation through a reasoning model.
- CN117196028A discloses a method and system for producing medical knowledge graphs based on knowledge graphs.
- the main steps include constructing a graph production model, defining an entity-relationship-attribute framework for the knowledge graph, thereby connecting entities in the text with entities in the knowledge base to generate a triplet knowledge set for visualization.
- This patent establishes a triplet knowledge set, but it does not train or evaluate the generated data, nor does it calculate a function to obtain appropriate confidence scores. Therefore, it cannot train or evaluate the data, nor does it involve further constructing multimodal uncertain reasoning based on the triples.
- CN114691896A discloses a method and apparatus for cleaning knowledge graph data.
- the method mainly includes training a knowledge graph embedding model and a triple classification model on the triples of the knowledge graph to be cleaned, and repairing erroneous triples using the global confidence score to obtain the cleaned knowledge graph.
- This patent only provides a commonly used data cleaning method for knowledge graphs and does not cover any specific application areas.
- CN115080764A discloses a medical similar entity classification method and system based on knowledge graphs and clustering algorithms, solving the problem of tedious manual annotation for similar entity classification. This patent also primarily provides a training method for a medical database triplet dataset, classifying positive and negative samples to obtain an entity similarity classification model.
- This invention proposes a gut microbiome knowledge graph system that, through the construction of a high-quality knowledge base, dynamic confidence assessment, and multimodal uncertain reasoning, achieves accurate prediction and verification of the associations between gut microbiota and drugs, diseases, and other factors.
- the first aspect of this invention relates to a method for constructing a knowledge graph.
- the system integrates a gut microbiome knowledge base, obtains information from public databases, performs data cleaning and redundancy removal, and retains gut microbiome-related knowledge; it extracts entity tables and relation tables, and further includes the extraction of attribute tables; and it uses entity tables to uniformly match entities in relation tables and attribute tables to construct a complete "gut microbiome knowledge base".
- entity tables and relational tables can be extracted from the entire database or a portion of the database, and attribute tables can be extracted from the entire database or a portion of the database after merging and deduplication.
- the entity table refers to professional terms in the fields of gut microbiota, genetic engineering, protein engineering, enzyme engineering, and biochemistry, including but not limited to genes, proteins, RNA, small molecules, pathways, reactions, nucleotides, and variations. These terms can be expressed in various languages, such as Chinese, English, German, French, and Japanese.
- each entity corresponds to at least one specialized term in the biomedical field, such as a proper noun; however, it is not necessarily a noun like "gene” or “protein.” It can also be a verb, a verb-object phrase, such as "protein expression” or “gene modification,” or a short sentence containing a few prepositions, such as "purify the protein,” etc., but all of these can express a relatively complete meaning, though not sufficient to express the meaning of the entire sentence.
- nouns are preferred entities; preferably, Chinese and/or English are preferred languages.
- the relation table refers to the horizontal associations between the contents of each entity in the entity table, including but not limited to associations of therapeutic pathways, gene expression modes, chemical modifications, protein activation pathways, protein inhibition pathways, and protein auxiliary associations.
- the attribute table refers to the inherent attributes, superordinate attributes, and subordinate attributes of each entity in the entity table, including but not limited to biological classification, text description and definition, genomic sequence, proteomics sequence, microbiological classification, 16S rRNA sequence, 168 rRNA sequence, etc.
- the entities referred to here are the actual contents of entity tables, relation tables, and attribute tables, and not content that is a complete match of text. Based on the general understanding of text, entities are expressed in various ways, including translations of actual content in different languages, synonyms, near-synonyms, common aliases, general expressions, and changes in order that do not affect the understanding of meaning.
- the original databases can be BioCyc, EcoCyc, MetaCyc, BacDive, BV-BRC, EMBL-EBI, NCBI-Taxonomy, and MicrobeWiki, etc.
- the number of gut microbiome-related knowledge retained is not limited; if a limit is needed, it can be 100-500, 200-400, 200-300, 200, or 300 gut microbiome-related knowledge.
- the eight databases can first be cleaned and deredundantized, retaining knowledge related to 200 gut microbiome-related knowledge. After processing these separately, the BioCyc, EcoCyc, MetaCyc, BacDive, and BV-BRC databases can be merged and deredundantized, extracting entity and relational tables.
- This process includes retrieving relevant databases, indexing documents, obtaining a small database after indexing, dividing it into training and validation sets, using a non-generative pre-trained language model for training and evaluation, obtaining a three-class classification model, obtaining the probabilities of the three relationships through a function, and calculating confidence scores.
- the databases linking gut microbiota to small molecule drug treatment for human diseases are retrieved, relevant literature is downloaded, and the literature is indexed using annotation tools, AI technology, and manual indexing (preferably using annotation tools, AI technology for automatic annotation, and manual review of annotation results).
- a small database is obtained, which is divided into training and validation sets.
- a non-generative pre-trained language model is used for training and evaluation to obtain a three-class classification model.
- the probabilities of the three relationships are obtained through functions, and confidence scores are calculated to finally obtain the "Gut Microbiota Small Molecule Drug Treatment Association Knowledge Base".
- the annotation refers to extracting associations between gut microbiome entities, small molecule entities, and disease entities based on entities already annotated on the annotation tool platform. Specifically, it can be determined whether an entity is a potential entity for relationship extraction based on its co-occurrence location, interval, and frequency of occurrence in the article.
- the co-occurrence location can be within the same paragraph (with a line break as the pause), sentence (with a period as the pause), paragraph type, or field (such as abstract, title, keywords, background technology, methods, discussion, etc.).
- the interval can be a specific number of characters, such as 1-10, 2-6, 2-8 characters, or more, ensuring a specific order for multiple entities or allowing for unrestricted order.
- the frequency of occurrence refers to the number of times an entity appears in the article, such as once, twice, three times, etc. For example, if an entity appears only once, it can be considered to have low relevance. It can appear in the entire text, in a specific paragraph type, or in combination, such as once in the abstract and more than five times in the entire text. Users can choose one of the above criteria for annotation or use a combination of criteria for comprehensive judgment.
- the comprehensive confidence score is ;
- the final layer of the relation extraction model is a three-class classification model.
- a SoftMax function is used to obtain the probability that the paragraph belongs to one of the three relations. This probability can be directly used as the confidence score for non-unrelated triples. Assume the triples... There are a total of n sources, where y ⁇ sub> i ⁇ /sub>, j ⁇ sub>i ⁇ /sub>, and s ⁇ sub> i ⁇ /sub> represent the nth, j ⁇ sub> i ⁇ /sub> , and s ⁇ sub>i ⁇ /sub> sources, respectively.
- Article The source article's publication year, journal impact factor, and confidence score given by the relation extraction model are used.
- the overall confidence score of this triple is defined as follows: Where k ⁇ sub>n ⁇ /sub> , k ⁇ sub> yi ⁇ /sub> , and k ⁇ sub> ji ⁇ /sub> represent linear segmentation weighting coefficients based on n, yi , and ji, respectively. These weighting coefficients are monotonically increasing, all less than 1, and the inflection points of the linear segments need to be defined manually based on experience. Finally, triples predicted as unrelated are deleted, and non-unrelated triples are predicted as related. Triples with confidence scores greater than or equal to a certain value (e.g., greater than or equal to 0.5) are retained, resulting in quadruplets containing "head entity, relation, tail entity, and confidence score".
- a certain value e.g., greater than or equal to 0.5
- the association between gut microbiota and disease can be defined as activate or inhibit.
- non-unrelated triples include (a bacterium, activate, a disease), while quadruples obtained after further screening include (a bacterium, activate, a disease, confidence score 0.8).
- the "retrieval of databases linking gut microbiota with small molecule drug treatment for human diseases” can be a PubTator3 raw XML file containing PubMed abstracts and PMC full texts;
- the annotation tool platform can be PubTator3, officially developed by NCBI, used to annotate entities and some relationships in the article.
- the annotated entities include small chemical molecules, diseases, genes, species, and variants, as well as some relationships. However, the annotated relationships do not include knowledge from the "Gut Microbiome Knowledge Base”.
- PubTator3 is just an example. In fact, there are many existing tool platforms that can achieve this, such as GNormPlus, MetaMap, BERN2, etc., and one or more can be selected.
- the training set and the validation set are divided according to a specific ratio, such as 6:2, 7:1, 8:1, 9:2, 9:1, or 8:2.
- training set refers to the labeled subset of data used during the training of a machine learning model to optimize model parameters
- validation set refers to the independent subset of data used to evaluate model performance, which is used to adjust hyperparameters or select the best model to prevent overfitting, and its division ratio corresponds to that of the training set.
- the aforementioned clinical medical database utilizes molecular biology data provided by authoritative clinical medical databases, such as the PMapp database, to offer knowledge related to disease diagnosis, medication guidelines, and disease prevention, thus providing knowledge support for the gut microbiota knowledge graph (see Figure 1, where GMKG-200 represents the "gut microbiota knowledge graph").
- the database integrates multiple authoritative molecular biology and clinical medicine databases. Given the database's authority and the fact that it has been reviewed by field experts, a confidence score of 1.0 is defined. There is no limit to the number of databases; for example, it can be 50-100, or 55-90, such as 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, etc.
- the three knowledge bases namely the "Gut Microbiome Knowledge Base,” the “Gut Microbiome Small Molecule Drug Therapy Association Knowledge Base,” and the “Clinical Medical Database,” are combined, constructed, and connected to align small molecules, drugs, and disease entities, thereby obtaining the final gut microbiome knowledge graph.
- the second aspect of this invention relates to a knowledge graph multimodal uncertainty reasoning system.
- each of the head entity, tail entity, and relation has a randomly initialized embedding vector, and each also has an embedding vector based on multimodal annotation, which is then normalized based on the representation matrix.
- a dynamic gating network of a hybrid expert system is used to fuse the two embedding vectors of each entity to obtain its final vector representation
- the final vector representation of the triples is substituted into the basic knowledge graph embedding model and transformation function to generate the predicted confidence score, which is then compared with the true confidence score to define the loss function.
- hybrid expert system refers to a machine learning architecture that integrates multi-source data through a dynamic gating network.
- This system deploys multiple expert networks (sub-models) in parallel, each processing specific types of data features (such as randomly initialized vectors and multimodal annotation vectors). It automatically learns the weight allocation of different expert networks through a trainable gating mechanism, and finally fuses the outputs of all expert networks to generate a unified vector representation of entities.
- this module is specifically used to coordinate the fusion of initial embedding vectors and multimodal feature vectors to optimize the vectorized representation of entity relationships in knowledge graphs.
- the initialization vector is , and
- the embedding vector is and
- the extraction methods for multimodal feature vectors vary depending on the entity type.
- the vector normalization process can be performed on the intestinal microbial entity based on the gene expression matrix, or it can be performed on the entity based on the ontology category through multi-class vector average pooling, or it can be performed using encoding software.
- gene entities are processed by multi-class vector average pooling based on gene ontology categories; disease entities are encoded using PubMedBERT or BioLinkBERT based on text descriptions; protein entities are encoded using ProteinBERT based on amino acid sequences; and small molecule and drug entities are encoded using ChemBERT based on SMILES formulas.
- average pooling represents a feature dimensionality reduction method, specifically referring to the arithmetic average of multi-class feature vectors in structured classification systems such as Gene Ontology (GO): feature vectors (e.g., GO term vectors) corresponding to multiple ontology categories belonging to the same gene entity are summed element-wise and then divided by the number of categories to generate a single feature vector with overall representativeness.
- this method is used to fuse multi-dimensional ontology annotation information of gene entities, constructing a normalized gene representation vector by retaining the statistical features of each classification dimension, thus eliminating redundant information while maintaining the integrity of the ontology semantics.
- head entity vector For example: ,in , It is a relation-based weight vector, and it is also randomly initialized. (Tail entity vector) The calculation is similar. For an entity that does not exist in the "Gut Microbiome Knowledge Graph" (GMKG-200), simply... and Set as the zero vector. or That is, equal to zero. or Will be completely based on or Therefore, as long as multimodal annotations are extracted for any entity, this model can achieve cold-start inference.
- the final vector representation of the triples can be...
- Substitute the basic knowledge graph embedding model and transformation function Generate confidence scores for predictions
- the loss function is defined by comparing the loss function with the true confidence score.
- TransE represents the distance after translation in planar space:
- RotatE represents the distance after rotation in complex space:
- gut microbiota are directly linked to diseases and drugs, and predicting new potential associations is equivalent to performing a simple knowledge graph completion task.
- gut microbiota are not directly linked to human genes; they can only be indirectly linked through multiple steps via diseases, drugs, and small molecules, which traditional knowledge graph link prediction methods cannot achieve.
- confidence scores to homogenize existing and new inferred relationships, and only needing to comprehensively calculate the expected probability, confidence score estimation and association ranking of long-chain relationships can be effectively achieved. (See Figure 5)
- the third aspect of this invention relates to a gut microbiome knowledge graph system.
- the gut microbiome knowledge graph system includes a "gut microbiome knowledge base”, a “gut microbiome small molecule drug therapy association knowledge base”, and a "clinical medical database”.
- the "Gut Microbiome Knowledge Base” contains information on the knowledge relationships between bacteria obtained after extraction and processing from the original database
- the "Gut Microbiome Small Molecule Drug Therapy Association Knowledge Base” contains information on gut microbiome and small molecule drug therapy for human diseases obtained after extraction and processing from the original database.
- the "intestinal microbiome knowledge base” is constructed using the method for obtaining raw data of the "intestinal microbiome knowledge base” as described in the first aspect of the present invention
- the "intestinal microbiome small molecule drug therapy association knowledge base” is constructed using the construction method of the "intestinal microbiome small molecule drug therapy association knowledge base” as described in the first aspect of the present invention
- the "clinical medical database” is constructed using the "construction of clinical medical database” method described in the first aspect of the present invention.
- the gut microbiome knowledge graph system also includes a knowledge graph multimodal uncertain reasoning system, used to predict potential associated diseases, drugs, genes, etc. for gut microbiota.
- the gut microbiome knowledge graph system also includes an image database that can be used for querying, providing visualized query results, and enabling open-source data sharing.
- Image databases can be open-source graph databases, commercial/cloud graph databases, or RDF/semantic web graph databases. All of these databases enable visual queries and open-source data sharing. The specific choice depends on factors such as data scale, query complexity, and deployment environment.
- Open-source graph databases include ArangoDB, Neo4j, and JanusGraph
- commercial/cloud graph databases include Amazon Neptune and Microsoft Azure Cosmos DB
- RDF/semantic web graph databases include Virtuoso and Stardog.
- the fourth aspect of this invention relates to a knowledge graph system construction apparatus.
- the knowledge graph system construction apparatus includes:
- the data acquisition module is used to acquire raw data related to gut microbiota, specifically, to acquire raw database information for the "Gut Microbiota Knowledge Base”, "Gut Microbiota Small Molecule Drug Therapy Related Knowledge Base”, and "Clinical Medical Database”.
- the data processing module includes cleaning and deduplicating data, retaining target-related knowledge, and extracting entity tables, relation tables, and attribute tables; or using annotation tool platforms, AI technology, and manual indexing to index the acquired data.
- the data training module is used to divide the labeled data into training and validation sets, train and evaluate it using a non-generative pre-trained language model, obtain a three-class classification model, obtain the probabilities of the three relationships through a function, and calculate the confidence score.
- the data matching module is used to uniformly match entities in the entity table with entities in the relationship table and attribute table; and to obtain matched data after data processing based on confidence scores.
- a data reasoning module which is used to import the acquired data into the knowledge graph multimodal uncertain reasoning system to predict potential associated diseases, drugs, genes, etc. for gut microbiota.
- each of the head entity, tail entity, and relation has a randomly initialized embedding vector, as well as an embedding vector based on multimodal annotation.
- This module performs vector normalization processing based on the expression matrix.
- hybrid expert system module which uses a dynamic gating network of a hybrid expert system to fuse the two embedding vectors of each entity to obtain its final vector representation
- comparison module which is used to input the final vector representation of the triples into the basic knowledge graph embedding model and transformation function to generate the predicted confidence score, and compare it with the true confidence score to define the loss function.
- the knowledge graph system construction device may also include a display module, which can be used for querying, providing visualized query results, and enabling open-source data sharing.
- the fifth aspect of this invention relates to computer programs and storage media.
- a computer storage medium can store the gut microbiome knowledge graph construction program designed according to the first to fourth aspects of the present invention, as well as the gut microbiome knowledge graph established thereon.
- this invention constructs a gut microbiome knowledge graph (GMKG) with broad knowledge coverage (multi-topic), high knowledge completeness (multi-source), rich knowledge modalities (multi-modal), and confidence scores.
- GMKG gut microbiome knowledge graph
- a multi-modal uncertainty reasoning system is established, utilizing multi-modal annotations and confidence scores to achieve cold-start and long-chain reasoning, improving the prediction range and interpretability of traditional reasoning models.
- This not only helps gut microbiome researchers summarize existing authoritative and cutting-edge knowledge (e.g., constructing a Neo4j graph database for convenient querying and searching), but also serves as a large-scale screening tool to discover new research directions from a forward-looking perspective for subsequent in-depth experimental verification (publishing confidence scores and ranking lists of new reasoning knowledge).
- it will enable more accurate and rapid prevention and treatment of various diseases in clinical decision-making (publishing trained entity and relation embedding vectors as pre-training vectors for downstream clinical tasks).
- This invention addresses two major challenges in current biomedical reasoning: cold-start and long-chain reasoning. It achieves cold-start reasoning by extracting multimodal annotations of biomedical entities and utilizing a hybrid expert system to supplement the knowledge gaps in triples, thus enhancing predictive ability for small and zero-sample scenarios. Long-chain reasoning is achieved by calculating the comprehensive confidence score of relational paths using confidence scores. The resulting relational paths improve the interpretability of biomedical knowledge, acting similarly to pathways. This invention overcomes four main shortcomings: narrow knowledge coverage, low knowledge completeness, fixed knowledge structure, and simplistic knowledge reasoning.
- Figure 1 is an overview of the gut microbiota knowledge graph system.
- Figure 2 shows the data processing of the "Gut Microbiome Knowledge Base”.
- FIG. 3 shows the establishment of the "Gut Microbiota Small Molecule Drug Therapy Association Knowledge Base”.
- Figure 4 shows the construction of a unified knowledge base for small molecules, drugs, and disease entities.
- Figure 5 shows a knowledge graph multimodal uncertainty reasoning system.
- a gut microbiota knowledge graph was constructed and a multimodal uncertainty reasoning system was built and operated.
- Zhuowei Knowledge Base A large-scale comprehensive gut microbiome knowledge base, named "Zhuowei Knowledge Base,” was constructed by integrating eight biological categories, molecular pathways, and multi-omics public databases.
- the Zhuowei Knowledge Base fully downloads and integrates eight original databases: BioCyc, EcoCyc, MetaCyc, BacDive, BV-BRC, EMBL-EBI, NCBI-Taxonomy, and MicrobeWiki.
- each of the eight databases underwent data cleaning and redundancy removal, retaining only knowledge related to 200 gut microorganisms.
- the BioCyc, EcoCyc, MetaCyc, BacDive, and BV-BRC databases were merged and redundancy removed, extracting entity and relational tables.
- PubTator3 is an annotation tool platform officially developed by NCBI, which annotates Chemical, Disease, Gene, Species, and Variant entities and some relationships in articles. However, the annotated relationships do not include Species entities, meaning there is a lack of knowledge related to gut microbiota.
- GPT-4 Two rounds of automatic annotation were performed using GPT-4. Paragraphs with inconsistent results were then manually reviewed to obtain the final annotation results.
- prompt words were added before paragraphs to assist in text generation, and entity annotation tokens were added before and after the entity context for marking (gut microbial entities are @gm-1$, @/gm-1$, @gm-2$, and @/gm-2$, small molecules are @sm-1$ and @/sm-1$, and diseases are @di-1$ and @/di-1$).
- the labeled small dataset is divided into training and validation sets in an 8:2 ratio, and fine-tuned and evaluated using non-generative pre-trained language models such as PubMedBERT or BioLinkBERT.
- the last layer of this relation extraction model is a three-class classification model, which uses a SoftMax function to obtain the probability that the paragraph belongs to one of the three relations. This probability can be directly used as the confidence score for non-unrelated triples. Assuming triples...
- n sources There are a total of n sources, where y ⁇ sub> i ⁇ /sub>, j ⁇ sub>i ⁇ /sub>, and s ⁇ sub> i ⁇ /sub> represent the nth, j ⁇ sub> i ⁇ /sub> , and s ⁇ sub>i ⁇ /sub> sources, respectively.
- Article The source article's publication year, journal impact factor, and confidence score given by the relation extraction model are used.
- the overall confidence score of this triple is defined as follows: , where k ⁇ sub>n ⁇ /sub> , k ⁇ sub> yi ⁇ /sub> , and k ⁇ sub> ji ⁇ /sub> represent linear piecewise weighting coefficients based on n, yi , and j , respectively. These weighting coefficients are monotonically increasing, all less than 1, and the inflection points of the linear pieces need to be defined manually based on experience.
- a multimodal, uncertain gut microbiome knowledge graph was constructed based on three parts: microbe-microbe, microbe-human, and human-human, and named "GMKG-200".
- the knowledge about bacteria is provided by the Zhuowei Knowledge Base, which is structured and extracted from eight authoritative databases. It has already passed the manual review of researchers in the early stage, so the confidence score is directly defined as 1.0.
- the human-to-human knowledge base namely the "clinical medical database” is provided by the PMapp database, which integrates 61 authoritative molecular biology and clinical medical databases previously built by the inventors.
- the confidence score is also defined as 1.0, which can provide complete knowledge support for the gut microbiome knowledge graph (GMKG-200).
- PubTator3 connects the Zhuowei Knowledge Base and the PMapp Knowledge Base; simply aligning small molecule, drug, and disease entities allows for the final construction of the gut microbiota knowledge graph (GMKG-200).
- a knowledge graph multimodal uncertainty reasoning system is constructed based on triples of the gut microbiome knowledge graph (GMKG-200) to predict potential associated diseases, drugs, and genes for gut microbiota.
- the head entity, tail entity, and relation each have a randomly initialized embedding vector.
- the head entity and the tail entity each have an embedding vector based on multimodal annotations.
- the methods for extracting multimodal feature vectors differ for different entity types.
- Gut microbial entities are processed using vector normalization based on gene expression matrices; gene entities are processed using multi-class vector average pooling based on gene ontology categories; disease entities are encoded using PubMedBERT or BioLinkBERT based on text descriptions; protein entities are encoded using ProteinBERT based on amino acid sequences; and small molecule and drug entities are encoded using ChemBERT based on SMILES.
- a dynamic gating network of a hybrid expert system is used to fuse the two embedding vectors of each entity to obtain its final vector representation, with the head entity vector as the first example.
- the head entity vector For example: ,in , It is a relation-based weight vector, and it is also randomly initialized. (Tail entity vector)
- the calculation is similar. For an entity that does not exist in GMKG-200, simply... and Set as the zero vector. or It equals zero. or Will be completely based on or Therefore, as long as multimodal annotations are extracted for any entity, this model can achieve cold-start inference.
- Image database which can be used for querying, providing visual query results, and enabling open-source sharing of data.
- Neo4j graph database can be used to achieve the above functions.
- gut microbes are directly linked to diseases and drugs, making the prediction of new potential associations akin to a simple knowledge graph completion task.
- gut microbes are not directly linked to human genes; associations are only established indirectly through multiple steps via diseases, drugs, and small molecules, which traditional knowledge graph link prediction methods cannot achieve.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Bioethics (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Medicines Containing Material From Animals Or Micro-Organisms (AREA)
Abstract
Description
本发明涉及医药、治疗等领域,具体的,涉及一种肠道微生物领域知识图谱系统以及推理系统。This invention relates to the fields of medicine and treatment, and specifically to a knowledge graph system and reasoning system in the field of gut microbiota.
肠道菌群是人体内最大的微生物群落,一个成年人体内就携有约200克肠道微生物(Gut Microbe),它们包含的基因大约是人类基因组的30倍,甚至可以单独视为一个重要器官。肠道微生物通过肠-脑轴、肠-肝轴和肠-肺轴等通路直接或间接地作用于人体的其它重要器官,因此,肠道微生物的稳定和失调与人类的健康和疾病息息相关。它们既可以帮助人类宿主提升免疫能力、促进有益代谢和控制体重增长等,也可能导致自闭症、炎症性肠病和糖尿病等疾病的发生与发展。近年来,随着高通量测序技术的出现与普及,人类微生物组的研究门槛显著降低,肠道微生物的多组学以及临床研究已经成为当下非常热门的领域,许多研究已针对各种关联进行了深入的实验验证与分析探讨。然而,由于肠道微生物的高变异性,目前现有的发现或许仅仅只是冰山一角,更多肠道微生物对于人类宿主的具体功能特征及其作用机制仍待进一步的研究发现。The gut microbiota is the largest microbial community in the human body, with an adult carrying approximately 200 grams of gut microbes. These microbes contain about 30 times the genes of the human genome and could even be considered a vital organ in themselves. Gut microbes directly or indirectly influence other vital organs through pathways such as the gut-brain axis, gut-liver axis, and gut-lung axis. Therefore, the stability and dysregulation of the gut microbiota are closely related to human health and disease. They can help the human host enhance immunity, promote beneficial metabolism, and control weight gain, but they can also contribute to the development and progression of diseases such as autism, inflammatory bowel disease, and diabetes. In recent years, with the emergence and widespread use of high-throughput sequencing technology, the research threshold for the human microbiome has been significantly lowered. Multi-omics and clinical research on gut microbiota has become a very popular field, and many studies have conducted in-depth experimental verification and analysis of various associations. However, due to the high variability of gut microbiota, current findings may only be the tip of the iceberg; further research is needed to discover more about the specific functional characteristics and mechanisms of action of gut microbes in the human host.
为了全面了解肠道微生物在人类健康和疾病中的重要影响作用,除了可靠但费时费力的实验方法以及高效但有很大局限性的微生物组学方法,研究者们还应多多关注于生物医学大数据。随着前沿文献的高频发表、专业数据库的百花齐放和临床数据的海量堆叠,生物医学已经逐渐由主观经验驱动的“循证医学”转向客观数据驱动的“精准医学”。当前已经形成了规模巨大的生物医学知识体,如何将它们有效存储并开展知识的深度挖掘和融合推理,将是一个巨大的挑战。近些年,作为异构网络的知识图谱(Knowledge Graph)逐渐取代了无法区分复杂生物效应的非关系网络,更多地应用于生物医学领域的知识系统模拟。知识图谱作为知识的一种存储方式,通过三元组(头实体、关系、尾实体)的形式串接分散于各类载体中的海量知识,受到了学术界和工业界的广泛关注。To fully understand the crucial role of gut microbiota in human health and disease, researchers should pay close attention to biomedical big data, in addition to reliable but time-consuming experimental methods and efficient but limited microbiome approaches. With the frequent publication of cutting-edge literature, the proliferation of specialized databases, and the massive accumulation of clinical data, biomedicine has gradually shifted from subjective experience-driven "evidence-based medicine" to objective data-driven "precision medicine." A massive biomedical knowledge base has now formed, and effectively storing it and conducting in-depth knowledge mining and fusion reasoning will be a significant challenge. In recent years, knowledge graphs, as heterogeneous networks, have gradually replaced non-relational networks that cannot distinguish complex biological effects, and are increasingly used in the simulation of knowledge systems in the biomedical field. As a method of knowledge storage, knowledge graphs connect massive amounts of knowledge scattered across various carriers in the form of triples (head entity, relation, tail entity), attracting widespread attention from academia and industry.
然而,现实世界日新月异,已建成的知识图谱几乎都是不完整的,利用现有知识推理新的可靠知识对于提升知识的完整性和延伸性都非常重要。其中,基于知识图谱嵌入的归纳推理已经成为了生物医学等领域的主流,它将实体和关系映射到一个低维空间上,显著提升了计算效率,有效缓解了数据稀疏,还可以实现异质信息的融合。知识图谱嵌入可以细分为以RotatE为代表的平移模型、以HolE为代表的语义模型和以ConvKB为代表的神经网络模型。多模态知识图谱嵌入则是另一大发展方向,例如结合关系路径的RSN、结合实体描述的KEPLER、结合实体类别的TKRL和结合实体影像的IKRL等。此外,还有结合置信分值进行不确定性知识图谱嵌入的UKGE和UKGSE。它们在药物重利用和疾病基因关联预测等下游任务中发挥了重要作用。However, the real world is constantly evolving, and existing knowledge graphs are almost always incomplete. Utilizing existing knowledge to infer new, reliable knowledge is crucial for improving the completeness and extensibility of knowledge. Inductive reasoning based on knowledge graph embedding has become mainstream in fields such as biomedicine. It maps entities and relationships to a low-dimensional space, significantly improving computational efficiency, effectively mitigating data sparsity, and enabling the fusion of heterogeneous information. Knowledge graph embedding can be further subdivided into translation models represented by RotatE, semantic models represented by HolE, and neural network models represented by ConvKB. Multimodal knowledge graph embedding is another major development direction, such as RSN combining relational paths, KEPLER combining entity descriptions, TKRL combining entity categories, and IKRL combining entity images. In addition, there are UKGE and UKGSE, which combine confidence scores for uncertain knowledge graph embedding. They have played an important role in downstream tasks such as drug reuse and disease gene association prediction.
近几年,已有不少团队尝试构建了一些肠道微生物知识图谱(Gut Microbe Knowledge Graphs, GMKGs),比如Food4healthKG、KG4NH、MiKG4MD和MGMLink,但暂时还没有成熟的项目落地应用。纵观这些工作,虽然对本领域的发展有着别样的启发,但它们或多或少存在以下缺陷:In recent years, many teams have attempted to construct gut microbe knowledge graphs (GMKGs), such as Food4healthKG, KG4NH, MiKG4MD, and MGMLink, but no mature projects have yet been successfully implemented. While these works offer unique insights into the field, they also suffer from the following limitations to varying degrees:
(1)大多只针对一两种疾病进行分析,特别是占了大半的心理健康疾病,应用范围非常有限。应该扩大知识覆盖面,不再只是针对单一疾病,而是尽可能对多疾病做到通用,因为不同疾病的知识体系有时是互通的;(1) Most analyses only target one or two diseases, especially mental health diseases, which account for the majority, and their application scope is very limited. The scope of knowledge should be expanded, and it should no longer be limited to a single disease, but should be applied to multiple diseases as much as possible, because the knowledge systems of different diseases are sometimes interchangeable;
(2)大多只从专业数据库中抽取结构化知识,或者从少量相关文献中人工阅读抽取知识,导致数据量较小,知识的完整性不足,无法得到具有信服力的结论;(2) Most of them only extract structured knowledge from professional databases or manually extract knowledge from a small number of relevant documents, resulting in a small amount of data, insufficient knowledge completeness, and inability to obtain convincing conclusions;
(3)在完成构建工作后,大多只是基于Neo4j等图数据库进行知识图谱的可视化和查询,而无法基于现有知识推理得到可靠的新知识。因此也就更加无法基于三元组以外的注释信息,进行更加完整和多元的多模态推理;(3) After the construction work is completed, most of them only perform knowledge graph visualization and query based on graph databases such as Neo4j, and cannot obtain reliable new knowledge based on existing knowledge reasoning. Therefore, it is even more impossible to perform more complete and diverse multimodal reasoning based on annotation information other than triples;
(4)所构建的全部都是确定性知识图谱,一条知识只有存在或不存在这两种情况,而没有根据真实情况为其赋予置信分值或置信分值的设置并不合理。因此也就更加无法基于置信分值,进行更加合理的不确定性推理;(4) All constructed knowledge graphs are deterministic, with each piece of knowledge having only two possibilities: existence or non-existence. Confidence scores are not assigned based on actual circumstances, or the assignment of confidence scores is unreasonable. Therefore, it is even more impossible to conduct more reasonable uncertain reasoning based on confidence scores.
(5)传统的知识图谱推理模型也只能预测已存在的关系,也就是知识图谱不全,其预测模式非常固定;(5) Traditional knowledge graph reasoning models can only predict existing relationships, that is, the knowledge graph is incomplete and its prediction pattern is very fixed.
(6)传统的知识图谱推理模型只能预测已存在的实体,也就是热启动推理,其预测范围比较受限。(6) Traditional knowledge graph reasoning models can only predict existing entities, which is hot-start reasoning, and its prediction range is relatively limited.
现有技术中对于知识图谱构建有一定的报道,例如CN117292846A涉及一种肠道微生物知识图谱的构建方法及装置,获取肠道微生物相关研究的初始文献信息; 将初始文献信息进行信息提取处理, 提取相关实体、 相关实体间的关联信息以及临床注释模型信息, 并将提取的相关实体进行标准化、整理与整合处理, 构建相关实体的数据框架; 基于相关实体的数据框架、相关实体间的关联信息以及临床注释模型信息,构建肠道微生物知识图谱并未对获取的数据进行清洗和去冗余处理,也没有构建“知识图谱多模态不确定推理系统”。只是进行了信息的单向关联性的提取,提取方式比较单一,最终结果也没有进一步的推理模型进行验证。Existing technologies have reported some advancements in knowledge graph construction. For example, CN117292846A discloses a method and apparatus for constructing a gut microbiome knowledge graph. The method involves acquiring initial literature information related to gut microbiome research; extracting relevant entities, relationships between entities, and clinical annotation model information from the initial literature information; standardizing, organizing, and integrating the extracted entities to construct a data framework for the relevant entities; however, the gut microbiome knowledge graph constructed based on this data framework, relationships between entities, and clinical annotation model information does not include data cleaning or redundancy removal, nor does it construct a "multimodal uncertain reasoning system for knowledge graphs." It only extracts unidirectional relationships of information using a relatively simple extraction method, and the final result lacks further validation through a reasoning model.
CN117196028A公开了一种基于知识图谱的医学知识图谱生产方法和系统,限定了其主要步骤在于,构建图谱生产模型,定义图谱知识的实体-关系-属性的框架,从而将文本中的实体与知识库中的实体连接起来,生成三元组知识集进行可视化展示。该专利中建立了一个三元组知识集,但没有对生成的数据进行训练和评估,也没有计算函数获得合适的置信分值,无法对数据进行训练和评估,也并未涉及对三元组进一步构建多模态不确定推理的内容。CN117196028A discloses a method and system for producing medical knowledge graphs based on knowledge graphs. The main steps include constructing a graph production model, defining an entity-relationship-attribute framework for the knowledge graph, thereby connecting entities in the text with entities in the knowledge base to generate a triplet knowledge set for visualization. This patent establishes a triplet knowledge set, but it does not train or evaluate the generated data, nor does it calculate a function to obtain appropriate confidence scores. Therefore, it cannot train or evaluate the data, nor does it involve further constructing multimodal uncertain reasoning based on the triples.
CN114691896A公开了一种知识图谱数据清洗方法和装置,主要包括对待清洗知识图谱的三元组训练一个知识图谱嵌入模型和一个三元组分类模型,通过所述全局置信度对错误的三元组进行修复,以得到清洗后的知识图谱。该专利只是给出了一个知识图谱中普遍应用的数据清洗方法,并不涉及具体的应用领域。CN114691896A discloses a method and apparatus for cleaning knowledge graph data. The method mainly includes training a knowledge graph embedding model and a triple classification model on the triples of the knowledge graph to be cleaned, and repairing erroneous triples using the global confidence score to obtain the cleaned knowledge graph. This patent only provides a commonly used data cleaning method for knowledge graphs and does not cover any specific application areas.
CN115080764A公开了一种基于知识图谱及聚类算法的医学相似实体分类方法及系统,解决人工标注相似实体分类繁琐的问题。该专利也同样主要是给出了医学数据库三元组数据集的训练方法,对正样本和负样本进行分类,得到实体相似分类模型。CN115080764A discloses a medical similar entity classification method and system based on knowledge graphs and clustering algorithms, solving the problem of tedious manual annotation for similar entity classification. This patent also primarily provides a training method for a medical database triplet dataset, classifying positive and negative samples to obtain an entity similarity classification model.
现有技术中,肠道微生物知识图谱的构建方法大多存在数据清洗不足、缺乏多模态特征融合以及无法进行可靠推理验证等问题,导致知识图谱的准确性和实用性受限。本发明提出一种肠道微生物知识图谱系统,通过构建高质量知识库、动态置信评估和多模态不确定推理,实现肠道微生物与药物、疾病等关联关系的精准预测与验证。Existing methods for constructing gut microbiome knowledge graphs often suffer from insufficient data cleaning, lack of multimodal feature fusion, and inability to perform reliable reasoning verification, thus limiting the accuracy and practicality of the knowledge graphs. This invention proposes a gut microbiome knowledge graph system that, through the construction of a high-quality knowledge base, dynamic confidence assessment, and multimodal uncertain reasoning, achieves accurate prediction and verification of the associations between gut microbiota and drugs, diseases, and other factors.
本发明第一方面涉及一种知识图谱构建方法The first aspect of this invention relates to a method for constructing a knowledge graph.
1 “肠道微生物知识库”原始数据获得(菌-菌知识库)1. Obtaining raw data for the "Gut Microbiome Knowledge Base" (bacterium-to-bacterium knowledge base)
整合肠道微生物知识库,从公共数据库中获得信息,进行数据清洗、去冗余处理,保留肠道微生物相关知识;提取实体表和关系表,进一步的包括提取属性表;以实体表统一匹配关系表和属性表中的实体构建完整的“肠道微生物知识库”。The system integrates a gut microbiome knowledge base, obtains information from public databases, performs data cleaning and redundancy removal, and retains gut microbiome-related knowledge; it extracts entity tables and relation tables, and further includes the extraction of attribute tables; and it uses entity tables to uniformly match entities in relation tables and attribute tables to construct a complete "gut microbiome knowledge base".
进一步的,可以将全部数据库提取/或从数据库中选择一部分提取实体表和关系表,以及将全部数据库进行/或选择一部分数据库进行合并和去冗余从中提取属性表。Furthermore, entity tables and relational tables can be extracted from the entire database or a portion of the database, and attribute tables can be extracted from the entire database or a portion of the database after merging and deduplication.
所述实体表,含义为肠道微生物、基因工程、蛋白质工程、酶工程、生物化学等大分子领域的专业名词,包括但不限于基因、蛋白质、RNA、小分子、通路、反应、核苷酸、变异等,可采用各种语言进行表达,例如中文、英文、德语、法语、日语等,具体的,每个实体都对应生物医学领域至少一个专有用语,例如专用名词;但不一定为“基因”“蛋白质”等名词,也可以是动词、动宾短语等表述例如“蛋白表达”“基因修饰”,或包含少数介词的短句例如“把蛋白进行纯化”等,但均能够表达一个相对完整的含义,又不足以表达整句的含义;作为优选,名词是优选的实体;作为优选,中文和/或英文为优选语言。The entity table refers to professional terms in the fields of gut microbiota, genetic engineering, protein engineering, enzyme engineering, and biochemistry, including but not limited to genes, proteins, RNA, small molecules, pathways, reactions, nucleotides, and variations. These terms can be expressed in various languages, such as Chinese, English, German, French, and Japanese. Specifically, each entity corresponds to at least one specialized term in the biomedical field, such as a proper noun; however, it is not necessarily a noun like "gene" or "protein." It can also be a verb, a verb-object phrase, such as "protein expression" or "gene modification," or a short sentence containing a few prepositions, such as "purify the protein," etc., but all of these can express a relatively complete meaning, though not sufficient to express the meaning of the entire sentence. Preferably, nouns are preferred entities; preferably, Chinese and/or English are preferred languages.
所述关系表,含义为实体表中各实体内容之间的横向关联,包括但不限于治疗通路的关联、基因表达方式的关联、化学修饰的关联、蛋白质的激活通路、蛋白的抑制通路、蛋白质辅助关联等;The relation table refers to the horizontal associations between the contents of each entity in the entity table, including but not limited to associations of therapeutic pathways, gene expression modes, chemical modifications, protein activation pathways, protein inhibition pathways, and protein auxiliary associations.
所述属性表,含义为所述实体表中各实体的自身的固有属性、上位属性、下位属性等,包括但不限于生物学分类、文本描述与定义、基因组学序列、蛋白质组学序列、微生物学分类、16S rRNA序列、168 rRNA序列等。The attribute table refers to the inherent attributes, superordinate attributes, and subordinate attributes of each entity in the entity table, including but not limited to biological classification, text description and definition, genomic sequence, proteomics sequence, microbiological classification, 16S rRNA sequence, 168 rRNA sequence, etc.
所述实体,是指实体表、关系表、属性表中的实际内容,而并非文字完全匹配的内容,在一般文字的理解基础上,实体,包含实际内容的各种语言翻译、同义词近义词、常见别名、通用表达、不影响意思理解的顺序更换等方式来进行表述。The entities referred to here are the actual contents of entity tables, relation tables, and attribute tables, and not content that is a complete match of text. Based on the general understanding of text, entities are expressed in various ways, including translations of actual content in different languages, synonyms, near-synonyms, common aliases, general expressions, and changes in order that do not affect the understanding of meaning.
具体的,原始数据库可以为BioCyc、EcoCyc、MetaCyc、BacDive、BV-BRC、EMBL-EBI、NCBI-Taxonomy和MicrobeWiki等;具体的,保留的肠道微生物相关知识并不限定肠道微生物的数量,如果需要限定,可以是100-500、200-400、200-300、200、300种肠道微生物的相关知识;仅为示例,可以首先对8个数据库进行清洗和去冗余处理,保留200种肠道微生物的相关知识,单独处理完成后,对BioCyc、EcoCyc、MetaCyc、BacDive和BV-BRC这5个数据库进行合并和去冗余,并从中提取实体表和关系表同时对EMBL-EBI、NCBI-Taxonomy和MicrobeWiki这3个数据库进行合并和去冗余,并从中提取属性表。最后,以实体表统一匹配关系表和属性表中的实体构成完整的“肠道微生物知识库”(参见图2,图2中“卓微知识库”即为上述记载的“肠道微生物知识库”的具体名称)Specifically, the original databases can be BioCyc, EcoCyc, MetaCyc, BacDive, BV-BRC, EMBL-EBI, NCBI-Taxonomy, and MicrobeWiki, etc. The number of gut microbiome-related knowledge retained is not limited; if a limit is needed, it can be 100-500, 200-400, 200-300, 200, or 300 gut microbiome-related knowledge. For example, the eight databases can first be cleaned and deredundantized, retaining knowledge related to 200 gut microbiome-related knowledge. After processing these separately, the BioCyc, EcoCyc, MetaCyc, BacDive, and BV-BRC databases can be merged and deredundantized, extracting entity and relational tables. Simultaneously, the EMBL-EBI, NCBI-Taxonomy, and MicrobeWiki databases can be merged and deredundantized, extracting attribute tables. Finally, the entities in the entity table are uniformly matched with those in the relation table and attribute table to form a complete "Gut Microbiome Knowledge Base" (see Figure 2, where "Zhuowei Knowledge Base" is the specific name of the "Gut Microbiome Knowledge Base" described above).
所述“肠道微生物知识库”因已经经过领域专家的审核,因此置信分数定为1.0。The "Gut Microbiome Knowledge Base" has been reviewed by domain experts, therefore its confidence score is set at 1.0.
2 “肠道微生物小分子药物治疗关联知识库”的构建(菌-人知识库)2. Construction of the "Gut Microbiota Small Molecule Drug Therapy Association Knowledge Base" (Microbe-Human Knowledge Base)
包括检索相关数据库,对文献进行标引,标引后获得小型数据库,划分为训练集和验证集,采用非生成式预训练语言模型进行训练和评估,得到三分类模型并通过函数获得三种关系的概率,经过计算获得置信分值。This process includes retrieving relevant databases, indexing documents, obtaining a small database after indexing, dividing it into training and validation sets, using a non-generative pre-trained language model for training and evaluation, obtaining a three-class classification model, obtaining the probabilities of the three relationships through a function, and calculating confidence scores.
具体的,检索肠道微生物与小分子药物治疗人类疾病相关联的数据库,下载关联文献,对文献进行标引,使用标注工具平台、AI技术、人工标引进行标引与审核(进一步的,优选使用标注工具标注平台标引,AI技术自动标注,人工审核标注结果),标引之后获得小型数据库,划分为训练集和验证集,采用非生成式预训练语言模型进行训练和评估,得到三分类模型并通过函数获得三种关系的概率,经过计算获得置信分值,最终获得“肠道微生物小分子药物治疗关联知识库”。Specifically, the databases linking gut microbiota to small molecule drug treatment for human diseases are retrieved, relevant literature is downloaded, and the literature is indexed using annotation tools, AI technology, and manual indexing (preferably using annotation tools, AI technology for automatic annotation, and manual review of annotation results). After indexing, a small database is obtained, which is divided into training and validation sets. A non-generative pre-trained language model is used for training and evaluation to obtain a three-class classification model. The probabilities of the three relationships are obtained through functions, and confidence scores are calculated to finally obtain the "Gut Microbiota Small Molecule Drug Treatment Association Knowledge Base".
所述标注,是指根据标注工具平台已标注的实体,提取肠道微生物实体与小分子实体、疾病实体的关联;具体的,可以根据实体在文章中同在的位置、间隔、出现频次等,进行判断,是否为潜在实体进行关系提取。所述同在的位置,可以是同在一个段落(以回车为停顿)、一句(以句号为停顿)、一个段落类型或称为一个字段(如摘要、标题、关键词、背景技术、方法、讨论等);所述间隔,可以间隔特定的字符数,例如间隔1-10个、2-6个、2-8个字符,或更多个字符,可以保证多个实体保证特定的顺序或不限定顺序;所述出现频次,是指实体在文章中出现的频次,例如1次,2次,3次……等,例如如果仅出现1次可以判定为相关度较低,可以在全文中,也可以在特定的段落类型中,也可以进行组合,例如在摘要中出现1次,全文中出现5次以上。可以选择以上多个判定标准中一个进行标注,也可以选择多个判定标准综合判断。The annotation refers to extracting associations between gut microbiome entities, small molecule entities, and disease entities based on entities already annotated on the annotation tool platform. Specifically, it can be determined whether an entity is a potential entity for relationship extraction based on its co-occurrence location, interval, and frequency of occurrence in the article. The co-occurrence location can be within the same paragraph (with a line break as the pause), sentence (with a period as the pause), paragraph type, or field (such as abstract, title, keywords, background technology, methods, discussion, etc.). The interval can be a specific number of characters, such as 1-10, 2-6, 2-8 characters, or more, ensuring a specific order for multiple entities or allowing for unrestricted order. The frequency of occurrence refers to the number of times an entity appears in the article, such as once, twice, three times, etc. For example, if an entity appears only once, it can be considered to have low relevance. It can appear in the entire text, in a specific paragraph type, or in combination, such as once in the abstract and more than five times in the entire text. Users can choose one of the above criteria for annotation or use a combination of criteria for comprehensive judgment.
所述综合置信分值为 ; The comprehensive confidence score is ;
关系提取模型的最后一层是一个三分类模型,经过一个SoftMax函数得到该段落属于三种关系的概率,该概率可以直接作为非Unrelated三元组的置信分值。假设三元组 总共有n篇文献来源,y i、j i和s i分别表示第 篇( )来源文章的发表年份、期刊影响因子以及关系提取模型给出的置信分值。定义该三元组的综合置信分值为 ,其中k n、k yi和k ji分别表示基于n、y i和j i的线性分段加权系数,所述加权系数是单调递增的,它们都小于1且需要人工凭经验定义线性分段的转折点;最终,删除预测为无关联的Unrelated的三元组,非Unrelated三元组即预测为有关联的三元组,保留置信分值大于等于一定数值的三元组(例如可以是大于等于0.5),得到包含“头实体、关系、尾实体、置信分值”的四元组。 The final layer of the relation extraction model is a three-class classification model. A SoftMax function is used to obtain the probability that the paragraph belongs to one of the three relations. This probability can be directly used as the confidence score for non-unrelated triples. Assume the triples... There are a total of n sources, where y <sub> i</sub>, j<sub>i</sub>, and s<sub>i</sub> represent the nth, j<sub>i</sub> , and s<sub>i</sub> sources, respectively. Article ( The source article's publication year, journal impact factor, and confidence score given by the relation extraction model are used. The overall confidence score of this triple is defined as follows: Where k <sub>n</sub> , k<sub>yi</sub> , and k<sub> ji </sub> represent linear segmentation weighting coefficients based on n, yi , and ji, respectively. These weighting coefficients are monotonically increasing, all less than 1, and the inflection points of the linear segments need to be defined manually based on experience. Finally, triples predicted as unrelated are deleted, and non-unrelated triples are predicted as related. Triples with confidence scores greater than or equal to a certain value (e.g., greater than or equal to 0.5) are retained, resulting in quadruplets containing "head entity, relation, tail entity, and confidence score".
例如,对于肠道微生物和疾病的关联而言,可以定义为activate或inhibit,非Unrelated三元组示例,例如(某菌,activate,某疾病),而经过进一步筛选得到的四元组,例如(某菌,activate,某疾病,置信分值0.8)。For example, the association between gut microbiota and disease can be defined as activate or inhibit. Examples of non-unrelated triples include (a bacterium, activate, a disease), while quadruples obtained after further screening include (a bacterium, activate, a disease, confidence score 0.8).
所述“检索肠道微生物与小分子药物治疗人类疾病相关联的数据库”可以为包含PubMed摘要和PMC全文的PubTator3原始xml文件;The "retrieval of databases linking gut microbiota with small molecule drug treatment for human diseases" can be a PubTator3 raw XML file containing PubMed abstracts and PMC full texts;
所述标注工具平台可以为NCBI官方开发的PubTator3作为工具平台,用于标注文章中的实体以及部分关系;所述标注的实体包括chemical小分子、Disease疾病、Gene基因、Species物种和Variant变体等相关实体以及部分关系;但标注的关系并不包含“肠道微生物知识库”的相关知识;当然PubTator3只是一种示例,事实上现有技术中有多个可以实现的工具平台,例如GNormPlus、MetaMap、BERN2等,可以选择一种或多种。The annotation tool platform can be PubTator3, officially developed by NCBI, used to annotate entities and some relationships in the article. The annotated entities include small chemical molecules, diseases, genes, species, and variants, as well as some relationships. However, the annotated relationships do not include knowledge from the "Gut Microbiome Knowledge Base". Of course, PubTator3 is just an example. In fact, there are many existing tool platforms that can achieve this, such as GNormPlus, MetaMap, BERN2, etc., and one or more can be selected.
具体的,所述训练集和验证集按照特定比例进行划分,例如6:2,7:1,8:1,9:2,9:1,8:2的比例进行划分;Specifically, the training set and the validation set are divided according to a specific ratio, such as 6:2, 7:1, 8:1, 9:2, 9:1, or 8:2.
术语“训练集”代表在机器学习模型训练过程中使用的标注数据子集,用于优化模型参数; 术语“验证集”代表用于评估模型性能的独立数据子集,通过调整超参数或选择最佳模型以防止过拟合,其划分比例与训练集对应。 The term "training set" refers to the labeled subset of data used during the training of a machine learning model to optimize model parameters; the term "validation set" refers to the independent subset of data used to evaluate model performance, which is used to adjust hyperparameters or select the best model to prevent overfitting, and its division ratio corresponds to that of the training set.
所述图3“肠道微生物小分子药物治疗关联知识库”的建立参见图3。The establishment of the "Gut Microbiota Small Molecule Drug Therapy Association Knowledge Base" mentioned in Figure 3 is shown in Figure 3.
3 构建“临床医学数据库”(人-人知识库)3. Construct a "Clinical Medicine Database" (human-to-human knowledge base)
所述临床医学数据库,采用分子生物学由临床医学权威数据库提供数据,例如PMapp数据库,提供疾病诊断、用药指南、疾病预防等相关知识,为肠道微生物知识图谱提供知识支撑(参见图1,图1中显示的GMKG-200即为“肠道微生物知识图谱”)The aforementioned clinical medical database utilizes molecular biology data provided by authoritative clinical medical databases, such as the PMapp database, to offer knowledge related to disease diagnosis, medication guidelines, and disease prevention, thus providing knowledge support for the gut microbiota knowledge graph (see Figure 1, where GMKG-200 represents the "gut microbiota knowledge graph").
进一步的,数据库融合了多个分子生物学和临床医学权威数据库,鉴于数据库的权威性以及已经经过了领域专家的审核,置信分值定义为1.0。数据库数量不需要限制,例如可以为50-100个,或者55-90个,例如可以是60个、61个、62个、63个、64个、65个、66个、67个、68个、69个、70个、71个、72个等。Furthermore, the database integrates multiple authoritative molecular biology and clinical medicine databases. Given the database's authority and the fact that it has been reviewed by field experts, a confidence score of 1.0 is defined. There is no limit to the number of databases; for example, it can be 50-100, or 55-90, such as 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, etc.
4 肠道微生物知识图谱构建4. Construction of Gut Microbiome Knowledge Graph
将“肠道微生物知识库”、“肠道微生物小分子药物治疗关联知识库”、“临床医学数据库”三个知识库,进行组合构建对接,对齐小分子、药物和疾病实体,获得最终的肠道微生物知识图谱。The three knowledge bases, namely the "Gut Microbiome Knowledge Base," the "Gut Microbiome Small Molecule Drug Therapy Association Knowledge Base," and the "Clinical Medical Database," are combined, constructed, and connected to align small molecules, drugs, and disease entities, thereby obtaining the final gut microbiome knowledge graph.
本发明第二方面涉及知识图谱多模态不确定性推理系统The second aspect of this invention relates to a knowledge graph multimodal uncertainty reasoning system.
利用“肠道微生物知识图谱”的知识图谱多模态不确定推理系统,为肠道微生物预测潜在的关联疾病、药物、基因等。By utilizing the knowledge graph multimodal uncertain reasoning system of the "Gut Microbiome Knowledge Graph", potential associated diseases, drugs, genes, etc. can be predicted for gut microbiota.
首先,头实体、尾实体和关系都各有一个随机初始化的嵌入向量,还各有一个基于多模态注释的嵌入向量,基于表达矩阵进行向量归一化处理;First, each of the head entity, tail entity, and relation has a randomly initialized embedding vector, and each also has an embedding vector based on multimodal annotation, which is then normalized based on the representation matrix.
其次,采用混合专家系统的动态门控网络融合每个实体的两个嵌入向量得到其最终向量表示;Secondly, a dynamic gating network of a hybrid expert system is used to fuse the two embedding vectors of each entity to obtain its final vector representation;
最后,将三元组的最终向量表示带入基础知识图谱嵌入模型和转化函数生成预测的置信分值,与真实置信分值比较定义损失函数。Finally, the final vector representation of the triples is substituted into the basic knowledge graph embedding model and transformation function to generate the predicted confidence score, which is then compared with the true confidence score to define the loss function.
其中,术语"混合专家系统"代表一种通过动态门控网络整合多源数据的机器学习架构,该系统并行部署多个专家网络(子模型),每个专家网络分别处理特定类型的数据特征(如随机初始化向量和多模态注释向量),并通过可训练的门控机制自动学习不同专家网络的权重分配,最终融合各专家网络的输出生成实体的统一向量表示。在本发明中,该模块专门用于协调初始嵌入向量与多模态特征向量的融合,以优化知识图谱中实体关系的向量化表达。The term "hybrid expert system" refers to a machine learning architecture that integrates multi-source data through a dynamic gating network. This system deploys multiple expert networks (sub-models) in parallel, each processing specific types of data features (such as randomly initialized vectors and multimodal annotation vectors). It automatically learns the weight allocation of different expert networks through a trainable gating mechanism, and finally fuses the outputs of all expert networks to generate a unified vector representation of entities. In this invention, this module is specifically used to coordinate the fusion of initial embedding vectors and multimodal feature vectors to optimize the vectorized representation of entity relationships in knowledge graphs.
具体的,所述初始化向量为 、 和 ;所述嵌入向量为 和 ;不同实体类型的多模态特征向量提取方式各不相同;所述向量归一化处理,可以是肠道微生物实体基于基因表达矩阵进行向量归一化处理,是指实体基于本体类别进行多类别向量平均池化处理,或采用编码软件进行编码处理。 Specifically, the initialization vector is , and The embedding vector is and The extraction methods for multimodal feature vectors vary depending on the entity type. The vector normalization process can be performed on the intestinal microbial entity based on the gene expression matrix, or it can be performed on the entity based on the ontology category through multi-class vector average pooling, or it can be performed using encoding software.
例如:基因实体基于基因本体类别进行多类别向量平均池化处理,疾病实体基于文本描述采用PubMedBERT或BioLinkBERT进行编码处理,蛋白质实体基于氨基酸序列采用ProteinBERT进行编码处理,小分子和药物实体基于SMILES式采用ChemBERTa进行编码处理。For example: gene entities are processed by multi-class vector average pooling based on gene ontology categories; disease entities are encoded using PubMedBERT or BioLinkBERT based on text descriptions; protein entities are encoded using ProteinBERT based on amino acid sequences; and small molecule and drug entities are encoded using ChemBERT based on SMILES formulas.
术语"平均池化"代表一种特征降维处理方法,具体指对基因本体(GO)等结构化分类体系中的多类别特征向量进行算术平均运算:将同一基因实体所属的多个本体类别对应的特征向量(如GO term向量)按元素相加后除以类别数量,生成具有整体代表性的单一特征向量。在本发明中,该方法用于融合基因实体的多维度本体注释信息,通过保留各分类维度的统计特征来构建归一化的基因表示向量,既消除了冗余信息又维持了本体语义的完整性。The term "average pooling" represents a feature dimensionality reduction method, specifically referring to the arithmetic average of multi-class feature vectors in structured classification systems such as Gene Ontology (GO): feature vectors (e.g., GO term vectors) corresponding to multiple ontology categories belonging to the same gene entity are summed element-wise and then divided by the number of categories to generate a single feature vector with overall representativeness. In this invention, this method is used to fuse multi-dimensional ontology annotation information of gene entities, constructing a normalized gene representation vector by retaining the statistical features of each classification dimension, thus eliminating redundant information while maintaining the integrity of the ontology semantics.
具体的,以头实体向量 为例: ,其中 , 是基于关系的权重向量,它也是随机初始化的。尾实体向量 的计算同理。对于一个不存在于“肠道微生物知识图谱”(GMKG-200)的实体,只需将 和 设置为零向量, 或 即等于零, 或 将完全基于 或 生成。因此,任意实体只要提取了多模态注释,本模型就可以实现冷启动推理。 Specifically, using the head entity vector For example: ,in , It is a relation-based weight vector, and it is also randomly initialized. (Tail entity vector) The calculation is similar. For an entity that does not exist in the "Gut Microbiome Knowledge Graph" (GMKG-200), simply... and Set as the zero vector. or That is, equal to zero. or Will be completely based on or Therefore, as long as multimodal annotations are extracted for any entity, this model can achieve cold-start inference.
具体的,可以将三元组的最终向量表示 代入基础知识图谱嵌入模型和转化函数 生成预测的置信分值 ,并与真实置信分值比较定义损失函数。 Specifically, the final vector representation of the triples can be... Substitute the basic knowledge graph embedding model and transformation function Generate confidence scores for predictions The loss function is defined by comparing the loss function with the true confidence score.
例如,TransE表示在平面空间上平移后的距离: ,RotatE表示在复数空间上旋转后的距离: ,其中 表示向量元素乘积, 表示复数实部和虚部的连接。 For example, TransE represents the distance after translation in planar space: RotatE represents the distance after rotation in complex space: ,in Represents the element-wise product of vectors. It indicates the connection between the real and imaginary parts of a complex number.
在“肠道微生物知识图谱(GMKG-200)”中,肠道微生物与疾病和药物直接相连,预测新的潜在关联就相当于是进行简单的知识图谱补全任务。但是,肠道微生物与人类基因并无直接关联,只能通过疾病、药物和小分子进行多步间接关联,传统的知识图谱链接预测无法实现。将基于置信分值达成已存在与新推理关系的同质化,只需要综合计算概率期望,就可以有效实现长链关系的置信分值估算与关联度排名。(参见图5)In the Gut Microbiome Knowledge Graph (GMKG-200), gut microbiota are directly linked to diseases and drugs, and predicting new potential associations is equivalent to performing a simple knowledge graph completion task. However, gut microbiota are not directly linked to human genes; they can only be indirectly linked through multiple steps via diseases, drugs, and small molecules, which traditional knowledge graph link prediction methods cannot achieve. By using confidence scores to homogenize existing and new inferred relationships, and only needing to comprehensively calculate the expected probability, confidence score estimation and association ranking of long-chain relationships can be effectively achieved. (See Figure 5)
本发明第三方面涉及一种肠道微生物知识图谱系统The third aspect of this invention relates to a gut microbiome knowledge graph system.
所述肠道微生物知识图谱系统包括“肠道微生物知识库”、“肠道微生物小分子药物治疗关联知识库”、“临床医学数据库”;The gut microbiome knowledge graph system includes a "gut microbiome knowledge base", a "gut microbiome small molecule drug therapy association knowledge base", and a "clinical medical database".
进一步的,所述“肠道微生物知识库”包含从原始数据库中经过提取加工后获得的菌和菌之间知识关联性信息,所述“肠道微生物小分子药物治疗关联知识库”包含从原始数据库中提取加工后获得的肠道微生物与小分子药物治疗人类疾病相关信息。Furthermore, the "Gut Microbiome Knowledge Base" contains information on the knowledge relationships between bacteria obtained after extraction and processing from the original database, and the "Gut Microbiome Small Molecule Drug Therapy Association Knowledge Base" contains information on gut microbiome and small molecule drug therapy for human diseases obtained after extraction and processing from the original database.
进一步的,所述“肠道微生物知识库”,是通过本发明第一方面中记载的“肠道微生物知识库”原始数据获得的方法构建得到的,所述“肠道微生物小分子药物治疗关联知识库”是通过本发明第一方面中记载的“肠道微生物小分子药物治疗关联知识库”的构建方法构建得到的,所述“临床医学数据库”是采用本发明第一方面中记载的“临床医学数据库的构建”的方法所构建得到的。Furthermore, the "intestinal microbiome knowledge base" is constructed using the method for obtaining raw data of the "intestinal microbiome knowledge base" as described in the first aspect of the present invention, the "intestinal microbiome small molecule drug therapy association knowledge base" is constructed using the construction method of the "intestinal microbiome small molecule drug therapy association knowledge base" as described in the first aspect of the present invention, and the "clinical medical database" is constructed using the "construction of clinical medical database" method described in the first aspect of the present invention.
进一步的,所述肠道微生物知识图谱系统还包括知识图谱多模态不确定推理系统,用于为肠道微生物预测潜在的关联疾病、药物、基因等。Furthermore, the gut microbiome knowledge graph system also includes a knowledge graph multimodal uncertain reasoning system, used to predict potential associated diseases, drugs, genes, etc. for gut microbiota.
进一步的,所述肠道微生物知识图谱系统还包括图像数据库,可用于查询,提供可视化查询结果,以及数据实现开源分享。Furthermore, the gut microbiome knowledge graph system also includes an image database that can be used for querying, providing visualized query results, and enabling open-source data sharing.
图像数据库可以选用开源图数据库,或者商业/云图数据库,或者RDF/语义网图数据库,数据库均可实现 可视化查询 和 数据开源共享,具体选择取决于 数据规模、查询复杂度、部署环境 等因素。开源图数据库 ArangoDB、Neo4j、JanusGraph等,商业/云图数据库例如 Amazon Neptune、Microsoft Azure Cosmos DB等,RDF/语义网图数据库例如 Virtuoso、Stardog等。 Image databases can be open-source graph databases, commercial/cloud graph databases, or RDF/semantic web graph databases. All of these databases enable visual queries and open-source data sharing. The specific choice depends on factors such as data scale, query complexity, and deployment environment. Open-source graph databases include ArangoDB, Neo4j, and JanusGraph; commercial/cloud graph databases include Amazon Neptune and Microsoft Azure Cosmos DB; and RDF/semantic web graph databases include Virtuoso and Stardog.
本发明第四方面涉及一种知识图谱系统构建装置The fourth aspect of this invention relates to a knowledge graph system construction apparatus.
所述知识图谱系统构建装置,包括:The knowledge graph system construction apparatus includes:
数据获取模块,用于获取肠道微生物相关的原始数据,具体的,为获得“肠道微生物知识库”、“肠道微生物小分子药物治疗关联知识库”、“临床医学数据库”所进行原始数据库信息获取;The data acquisition module is used to acquire raw data related to gut microbiota, specifically, to acquire raw database information for the "Gut Microbiota Knowledge Base", "Gut Microbiota Small Molecule Drug Therapy Related Knowledge Base", and "Clinical Medical Database".
数据加工模块,包括将数据进行清洗、去冗余,保留目标相关知识,提取实体表、关系表、属性表;或者采用标注工具平台、AI技术、人工标引对获取的数据进行标引;The data processing module includes cleaning and deduplicating data, retaining target-related knowledge, and extracting entity tables, relation tables, and attribute tables; or using annotation tool platforms, AI technology, and manual indexing to index the acquired data.
数据训练模块,用于将标注的数据,划分训练集和验证集,采用非生成式预训练语言模型进行训练和评估,得到三分类模型并通过函数获得三种关系的概率,经过计算获得置信分值;The data training module is used to divide the labeled data into training and validation sets, train and evaluate it using a non-generative pre-trained language model, obtain a three-class classification model, obtain the probabilities of the three relationships through a function, and calculate the confidence score.
数据匹配模块,用于实体表统一匹配关系表和属性表中的实体;以及通过置信分值进行数据处理后获得匹配的数据。The data matching module is used to uniformly match entities in the entity table with entities in the relationship table and attribute table; and to obtain matched data after data processing based on confidence scores.
进一步的,还包括数据推理模块,用于将获取的数据导入知识图谱多模态不确定推理系统,为肠道微生物预测潜在的关联疾病、药物、基因等。Furthermore, it also includes a data reasoning module, which is used to import the acquired data into the knowledge graph multimodal uncertain reasoning system to predict potential associated diseases, drugs, genes, etc. for gut microbiota.
具体的,包括向量归一化处理模块,头实体、尾实体和关系都各有一个随机初始化的嵌入向量,还各有一个基于多模态注释的嵌入向量,该模块基于表达矩阵进行向量归一化处理;Specifically, it includes a vector normalization processing module. Each of the head entity, tail entity, and relation has a randomly initialized embedding vector, as well as an embedding vector based on multimodal annotation. This module performs vector normalization processing based on the expression matrix.
具体的,还包括混合专家系统模块,用于采用混合专家系统的动态门控网络融合每个实体的两个嵌入向量得到其最终向量表示;Specifically, it also includes a hybrid expert system module, which uses a dynamic gating network of a hybrid expert system to fuse the two embedding vectors of each entity to obtain its final vector representation;
具体的,还包括比较模块,用于将三元组的最终向量表示带入基础知识图谱嵌入模型和转化函数生成预测的置信分值,与真实置信分值比较定义损失函数。Specifically, it also includes a comparison module, which is used to input the final vector representation of the triples into the basic knowledge graph embedding model and transformation function to generate the predicted confidence score, and compare it with the true confidence score to define the loss function.
进一步的,所述知识图谱系统构建装置还可以包括显示模块,可用于查询,提供可视化查询结果,以及数据实现开源分享。Furthermore, the knowledge graph system construction device may also include a display module, which can be used for querying, providing visualized query results, and enabling open-source data sharing.
本发明第五方面涉及计算机程序与存储介质The fifth aspect of this invention relates to computer programs and storage media.
一种计算机程序,用于实现本发明第一至第四方面所设计的肠道微生物知识图谱构建程序;A computer program for implementing the gut microbiota knowledge graph construction program designed in the first to fourth aspects of the present invention;
一种计算机存储介质可存储实现本发明第一至第四方面所设计的肠道微生物知识图谱构建程序以及所建立的肠道微生物知识图谱。A computer storage medium can store the gut microbiome knowledge graph construction program designed according to the first to fourth aspects of the present invention, as well as the gut microbiome knowledge graph established thereon.
综上,本发明构建了一个知识覆盖面广(多主题)、知识完整度高(多来源)、知识模态丰富(多模态)且具有置信分值的肠道微生物知识图谱(GMKG)。并基于知识图谱嵌入模型建立一个知识图谱多模态不确定性推理系统,利用多模态注释和置信分值实现冷启动和长链推理,提升传统推理模型的预测范围和可解释性。不仅可以帮助肠道微生物研究者们总结现有的权威和前沿知识(例如构建Neo4j图数据库方便查询和搜索),也能作为大规模筛选工具以前瞻性的视角发现新的研究方向从而进行后续深入的实验验证(发布新推理知识的置信分值和排名列表),还将在临床决策方面实现对于各类疾病更准确和快捷的预防与治疗(发布训练得到的实体和关系嵌入向量作为下游临床任务的预训练向量)。In summary, this invention constructs a gut microbiome knowledge graph (GMKG) with broad knowledge coverage (multi-topic), high knowledge completeness (multi-source), rich knowledge modalities (multi-modal), and confidence scores. Based on a knowledge graph embedding model, a multi-modal uncertainty reasoning system is established, utilizing multi-modal annotations and confidence scores to achieve cold-start and long-chain reasoning, improving the prediction range and interpretability of traditional reasoning models. This not only helps gut microbiome researchers summarize existing authoritative and cutting-edge knowledge (e.g., constructing a Neo4j graph database for convenient querying and searching), but also serves as a large-scale screening tool to discover new research directions from a forward-looking perspective for subsequent in-depth experimental verification (publishing confidence scores and ranking lists of new reasoning knowledge). Furthermore, it will enable more accurate and rapid prevention and treatment of various diseases in clinical decision-making (publishing trained entity and relation embedding vectors as pre-training vectors for downstream clinical tasks).
有益效果Beneficial effects
本发明解决了当前生物医学推理的两大难点:冷启动和长链推理。发明通过提取生物医学实体的多模态注释,利用混合专家系统补充三元组的知识缺失实现冷启动推理,增强对于小样本和零样本的预测能力。通过置信分值计算关系路径的综合置信分值实现长链推理,所得到的关系路径将提升生物医学知识的可解释性,起到类似于通路的效果。克服了知识覆盖面窄、知识完整性低、知识结构固定和知识推理简单这四种主要缺陷。This invention addresses two major challenges in current biomedical reasoning: cold-start and long-chain reasoning. It achieves cold-start reasoning by extracting multimodal annotations of biomedical entities and utilizing a hybrid expert system to supplement the knowledge gaps in triples, thus enhancing predictive ability for small and zero-sample scenarios. Long-chain reasoning is achieved by calculating the comprehensive confidence score of relational paths using confidence scores. The resulting relational paths improve the interpretability of biomedical knowledge, acting similarly to pathways. This invention overcomes four main shortcomings: narrow knowledge coverage, low knowledge completeness, fixed knowledge structure, and simplistic knowledge reasoning.
图1为肠道微生物知识图谱系统概览。Figure 1 is an overview of the gut microbiota knowledge graph system.
图2为“肠道微生物知识库”加工数据处理。Figure 2 shows the data processing of the "Gut Microbiome Knowledge Base".
图3为“肠道微生物小分子药物治疗关联知识库”的建立。Figure 3 shows the establishment of the "Gut Microbiota Small Molecule Drug Therapy Association Knowledge Base".
图4为小分子、药物和疾病实体统一知识库构建。Figure 4 shows the construction of a unified knowledge base for small molecules, drugs, and disease entities.
图5为知识图谱多模态不确定性推理系统。Figure 5 shows a knowledge graph multimodal uncertainty reasoning system.
针对200种人体内常见和检测时常用的肠道微生物,进行肠道微生物知识图谱的构建与多模态不确定性推理系统的构建与运作。For 200 common gut microbiota species in the human body and commonly used in detection, a gut microbiota knowledge graph was constructed and a multimodal uncertainty reasoning system was built and operated.
构建肠道微生物知识库Building a Gut Microbiome Knowledge Base
整合8个生物学门类、分子通路及多组学公共数据库,构建一个大型综合肠道微生物知识库,并命名为“卓微知识库”。卓微知识库将完整下载并整合BioCyc、EcoCyc、MetaCyc、BacDive、BV-BRC、EMBL-EBI、NCBI-Taxonomy和MicrobeWiki这8个原始数据库。首先,分别单独对8个数据库进行数据清洗以及去冗余处理,并且只保留200种肠道微生物的相关知识。单独处理完成后,再对BioCyc、EcoCyc、MetaCyc、BacDive和BV-BRC这5个数据库进行合并和去冗余,并从中提取实体表和关系表同时对EMBL-EBI、NCBI-Taxonomy和MicrobeWiki这3个数据库进行合并和去冗余,并从中提取属性表。最后,以实体表统一匹配关系表和属性表中的实体构成完整的肠道微生物知识库(图1中显示为卓微知识库,作为一种示例)。A large-scale comprehensive gut microbiome knowledge base, named "Zhuowei Knowledge Base," was constructed by integrating eight biological categories, molecular pathways, and multi-omics public databases. The Zhuowei Knowledge Base fully downloads and integrates eight original databases: BioCyc, EcoCyc, MetaCyc, BacDive, BV-BRC, EMBL-EBI, NCBI-Taxonomy, and MicrobeWiki. First, each of the eight databases underwent data cleaning and redundancy removal, retaining only knowledge related to 200 gut microorganisms. After individual processing, the BioCyc, EcoCyc, MetaCyc, BacDive, and BV-BRC databases were merged and redundancy removed, extracting entity and relational tables. Simultaneously, the EMBL-EBI, NCBI-Taxonomy, and MicrobeWiki databases were merged and redundancy removed, extracting attribute tables. Finally, the entity tables were uniformly matched with entities in the relational and attribute tables to form the complete gut microbiome knowledge base (Figure 1 shows the Zhuowei Knowledge Base as an example).
“肠道微生物小分子药物治疗关联知识库”的构建Construction of a "Knowledge Base for Small Molecule Drug Therapy Related to Gut Microbiota"
现有数据库中很少有关于肠道微生物相互作用,以及它们与小分子和疾病关联的知识,完整下载包含PubMed摘要和PMC全文的PubTator3原始xml文件,补充这些关系。PubTator3是NCBI官方开发的标注工具平台,标注了文章中的Chemical、Disease、Gene、Species和Variant实体以及部分关系,但已标注的关系中并不包含Species实体,也就是没有肠道微生物相关知识。Existing databases contain very little information about gut microbiota interactions and their associations with small molecules and diseases. Downloading the full PubTator3 raw XML file, which includes PubMed abstracts and the full text of the PMC, can supplement this information. PubTator3 is an annotation tool platform officially developed by NCBI, which annotates Chemical, Disease, Gene, Species, and Variant entities and some relationships in articles. However, the annotated relationships do not include Species entities, meaning there is a lack of knowledge related to gut microbiota.
首先,根据PubTator3已标注的实体,如果一个肠道微生物实体与另一个肠道微生物实体或小分子实体或疾病实体出现在同一个段落中,则将其视为一组潜在的实体对进行关系提取。然后,随机选取三类符合条件的段落各2000段进行关系标注,各标注为三种关系,除了Unrelated以外,肠道微生物与肠道微生物之间为Symbiotic或Exclusion,与小分子之间为Secrete或Consume,与疾病之间为Activate或Inhibit。First, based on the entities already labeled in PubTator3, if a gut microbiome entity appears in the same paragraph as another gut microbiome entity, small molecule entity, or disease entity, it is considered a potential entity pair for relation extraction. Then, 2000 paragraphs from each of the three categories that meet the criteria are randomly selected for relation labeling. Each labeling uses three types of relations: besides "Unrelated," the relations between gut microbes are "Symbiotic" or "Exclusion," between gut microbes and small molecules are "Secrete" or "Consume," and between gut microbes and diseases are "Activate" or "Inhibit."
使用GPT-4进行两轮自动标注,对于结果不一致的段落再由人工审核得到最终标注结果。使用GPT-4自动标注时,需要在段落前加上prompt提示词辅助文本生成,并且在实体上下文前后加上实体标注token进行标记(肠道微生物实体为@gm-1$、@/gm-1$、@gm-2$和@/gm-2$,小分子为@sm-1$和@/sm-1$,疾病为@di-1$和@/di-1$)。Two rounds of automatic annotation were performed using GPT-4. Paragraphs with inconsistent results were then manually reviewed to obtain the final annotation results. When using GPT-4 for automatic annotation, prompt words were added before paragraphs to assist in text generation, and entity annotation tokens were added before and after the entity context for marking (gut microbial entities are @gm-1$, @/gm-1$, @gm-2$, and @/gm-2$, small molecules are @sm-1$ and @/sm-1$, and diseases are @di-1$ and @/di-1$).
然后,将标注完成的小型数据集按8:2的比例划分为训练集和验证集,并采用PubMedBERT或BioLinkBERT等非生成式预训练语言模型进行微调训练和评估。该关系提取模型的最后一层是一个三分类模型,经过一个SoftMax函数得到该段落属于三种关系的概率,该概率可以直接作为非Unrelated三元组的置信分值。假设三元组 总共有n篇文献来源,y i、j i和s i分别表示第 篇( )来源文章的发表年份、期刊影响因子以及关系提取模型给出的置信分值。定义该三元组的综合置信分值为 ,其中k n、k yi和k ji分别表示基于n、y i和j i的线性分段加权系数,所述加权系数是单调递增的,它们都小于1且需要人工凭经验定义线性分段的转折点。 Then, the labeled small dataset is divided into training and validation sets in an 8:2 ratio, and fine-tuned and evaluated using non-generative pre-trained language models such as PubMedBERT or BioLinkBERT. The last layer of this relation extraction model is a three-class classification model, which uses a SoftMax function to obtain the probability that the paragraph belongs to one of the three relations. This probability can be directly used as the confidence score for non-unrelated triples. Assuming triples... There are a total of n sources, where y <sub> i</sub>, j<sub>i</sub>, and s<sub>i</sub> represent the nth, j<sub>i</sub> , and s<sub>i</sub> sources, respectively. Article ( The source article's publication year, journal impact factor, and confidence score given by the relation extraction model are used. The overall confidence score of this triple is defined as follows: , where k <sub>n</sub> , k<sub>yi</sub> , and k<sub> ji </sub> represent linear piecewise weighting coefficients based on n, yi , and j , respectively. These weighting coefficients are monotonically increasing, all less than 1, and the inflection points of the linear pieces need to be defined manually based on experience.
(3)构建肠道微生物知识图谱(GMKG-200)(3) Constructing a gut microbiome knowledge graph (GMKG-200)
根据菌-菌、菌-人和人-人这三部分知识构建多模态不确定性肠道微生物知识图谱,并命名为“GMKG-200”。A multimodal, uncertain gut microbiome knowledge graph was constructed based on three parts: microbe-microbe, microbe-human, and human-human, and named "GMKG-200".
菌-菌的知识由卓微知识库提供,它们是从8个权威数据库中结构化提取的,已在前期通过研究者们的人工审核,因此置信分值都直接定义为1.0。The knowledge about bacteria is provided by the Zhuowei Knowledge Base, which is structured and extracted from eight authoritative databases. It has already passed the manual review of researchers in the early stage, so the confidence score is directly defined as 1.0.
人-人的知识库即“临床医学数据库”由发明人此前构建的融合了61个分子生物学和临床医学权威数据库的PMapp数据库提供,置信分值也定义为1.0,可以为肠道微生物知识图谱(GMKG-200)提供完备的知识支撑。The human-to-human knowledge base, namely the "clinical medical database," is provided by the PMapp database, which integrates 61 authoritative molecular biology and clinical medical databases previously built by the inventors. The confidence score is also defined as 1.0, which can provide complete knowledge support for the gut microbiome knowledge graph (GMKG-200).
菌-人的知识从PubTator3中自动化提取,存在一定的提取误差。在关系提取时结合所有来源信息计算了综合置信分值,可以很好地衡量这种不确定性。本部分包含3种关系标注时已定义的关系类型:(肠道微生物,Symbiotic或Exclusion,肠道微生物)、(肠道微生物,Secrete或Consume,小分子)和(肠道微生物,Activate或Inhibit,疾病)。PubTator3连接了卓微知识库和PMapp知识库,只需将小分子、药物和疾病实体进行统一对齐就可以最终构建得到肠道微生物知识图谱(GMKG-200)。Knowledge about gut microbiota and humans is automatically extracted from PubTator3, which introduces some extraction error. A comprehensive confidence score is calculated by combining information from all sources during relation extraction, effectively measuring this uncertainty. This section includes three relation types defined during annotation: (Gut Microbiota, Symbiotic or Exclusion, Gut Microbiota), (Gut Microbiota, Secrete or Consume, Small Molecules), and (Gut Microbiota, Activate or Inhibit, Diseases). PubTator3 connects the Zhuowei Knowledge Base and the PMapp Knowledge Base; simply aligning small molecule, drug, and disease entities allows for the final construction of the gut microbiota knowledge graph (GMKG-200).
(4)知识图谱多模态不确定推理系统(4) Knowledge graph multimodal uncertain reasoning system
基于肠道微生物知识图谱(GMKG-200)的三元组构建一个知识图谱多模态不确定性推理系统,为肠道微生物预测潜在的关联疾病、药物和基因等。A knowledge graph multimodal uncertainty reasoning system is constructed based on triples of the gut microbiome knowledge graph (GMKG-200) to predict potential associated diseases, drugs, and genes for gut microbiota.
首先,头实体、尾实体和关系都各有一个随机初始化的嵌入向量 、 和 。此外,头实体和尾实体还各有一个基于多模态注释的嵌入向量 和 ,不同实体类型的多模态特征向量提取方式各不相同。肠道微生物实体基于基因表达矩阵进行向量归一化处理,基因实体基于基因本体类别进行多类别向量平均池化处理,疾病实体基于文本描述采用PubMedBERT或BioLinkBERT进行编码处理,蛋白质实体基于氨基酸序列采用ProteinBERT进行编码处理,小分子和药物实体基于SMILES式采用ChemBERTa进行编码处理。 First, the head entity, tail entity, and relation each have a randomly initialized embedding vector. , and In addition, the head entity and the tail entity each have an embedding vector based on multimodal annotations. and The methods for extracting multimodal feature vectors differ for different entity types. Gut microbial entities are processed using vector normalization based on gene expression matrices; gene entities are processed using multi-class vector average pooling based on gene ontology categories; disease entities are encoded using PubMedBERT or BioLinkBERT based on text descriptions; protein entities are encoded using ProteinBERT based on amino acid sequences; and small molecule and drug entities are encoded using ChemBERT based on SMILES.
其次 采用混合专家系统的动态门控网络融合每个实体的两个嵌入向量得到其最终向量表示,以头实体向量 为例: ,其中 , 是基于关系的权重向量,它也是随机初始化的。尾实体向量 的计算同理。对于一个不存在于GMKG-200的实体,只需将 和 设置为零向量, 或 就等于零, 或 将完全基于 或 生成。因此,任意实体只要提取了多模态注释,本模型就可以实现冷启动推理。 Secondly, a dynamic gating network of a hybrid expert system is used to fuse the two embedding vectors of each entity to obtain its final vector representation, with the head entity vector as the first example. For example: ,in , It is a relation-based weight vector, and it is also randomly initialized. (Tail entity vector) The calculation is similar. For an entity that does not exist in GMKG-200, simply... and Set as the zero vector. or It equals zero. or Will be completely based on or Therefore, as long as multimodal annotations are extracted for any entity, this model can achieve cold-start inference.
最后,将三元组的最终向量表示 代入基础知识图谱嵌入模型和转化函数 生成预测的置信分值 ,并与真实置信分值比较定义损失函数。例如,TransE表示在平面空间上平移后的距离: ,RotatE表示在复数空间上旋转后的距离: ,其中 表示向量元素乘积, 表示复数实部和虚部的连接。 Finally, the final vector representation of the triples. Substitute the basic knowledge graph embedding model and transformation function Generate confidence scores for predictions The loss function is defined by comparing the result with the true confidence score. For example, TransE represents the distance translated in planar space: RotatE represents the distance after rotation in complex space: ,in Represents the element-wise product of vectors. It indicates the connection between the real and imaginary parts of a complex number.
(5)图像数据库,可用于查询,提供可视化查询结果,以及数据实现开源分享。(5) Image database, which can be used for querying, providing visual query results, and enabling open-source sharing of data.
进一步的,可以采用Neo4j图数据库实现上述功能。Furthermore, the Neo4j graph database can be used to achieve the above functions.
在肠道微生物知识图谱(GMKG-200)中,肠道微生物与疾病和药物直接相连,预测新的潜在关联就相当于是进行简单的知识图谱补全任务。但是,肠道微生物与人类基因并无直接关联,只能通过疾病、药物和小分子进行多步间接关联,传统的知识图谱链接预测无法实现。将基于置信分值达成已存在与新推理关系的同质化,只需要综合计算概率期望,就可以有效实现长链关系的置信分值估算与关联度排名。In the Gut Microbiome Knowledge Graph (GMKG-200), gut microbes are directly linked to diseases and drugs, making the prediction of new potential associations akin to a simple knowledge graph completion task. However, gut microbes are not directly linked to human genes; associations are only established indirectly through multiple steps via diseases, drugs, and small molecules, which traditional knowledge graph link prediction methods cannot achieve. By using confidence scores to homogenize existing and new inferred relationships, and by comprehensively calculating probability expectations, confidence score estimation and association ranking of long-chain relationships can be effectively realized.
Claims (12)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410601024.6A CN118506887B (en) | 2024-05-15 | 2024-05-15 | A knowledge graph system for intestinal microorganisms |
| CN202410601024.6 | 2024-05-15 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025237327A1 true WO2025237327A1 (en) | 2025-11-20 |
Family
ID=92230299
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2025/094816 Pending WO2025237327A1 (en) | 2024-05-15 | 2025-05-14 | Gut microbe knowledge graph system |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN118506887B (en) |
| WO (1) | WO2025237327A1 (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118506887B (en) * | 2024-05-15 | 2025-05-06 | 上海力山生物医药有限公司 | A knowledge graph system for intestinal microorganisms |
| CN119851767B (en) * | 2024-12-27 | 2025-10-14 | 汕头大学 | Medicine and microorganism association prediction method based on node class sensitivity knowledge graph learning and gating enhanced element path semantic fusion |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220261668A1 (en) * | 2021-02-12 | 2022-08-18 | Tempus Labs, Inc. | Artificial intelligence engine for directed hypothesis generation and ranking |
| CN117076681A (en) * | 2023-07-13 | 2023-11-17 | 哈尔滨理工大学 | Medical knowledge graph construction method based on unified medical language system |
| CN117292846A (en) * | 2023-11-27 | 2023-12-26 | 神州医疗科技股份有限公司 | Construction method and device of intestinal microorganism knowledge graph |
| CN118506887A (en) * | 2024-05-15 | 2024-08-16 | 上海力山生物医药有限公司 | Intestinal microorganism knowledge graph system |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102244086B1 (en) * | 2019-12-05 | 2021-04-23 | 경기대학교 산학협력단 | System for visual commonsense reasoning using knowledge graph |
| CN112288091B (en) * | 2020-10-30 | 2023-03-21 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Knowledge inference method based on multi-mode knowledge graph |
| CN113871003B (en) * | 2021-12-01 | 2022-04-08 | 浙江大学 | Disease auxiliary differential diagnosis system based on causal medical knowledge graph |
| CN115588148A (en) * | 2022-08-29 | 2023-01-10 | 河海大学 | A multi-modal fusion video classification method and system based on brain-inspired feedback interaction |
| CN116303670A (en) * | 2023-02-21 | 2023-06-23 | 同济大学 | A human-computer interaction method and system for aero-engine health management |
| CN116226404A (en) * | 2023-03-13 | 2023-06-06 | 福建医科大学 | A knowledge map construction method and knowledge map system for the gut-brain axis |
| CN117131933B (en) * | 2023-08-31 | 2025-08-26 | 华中师范大学 | A method for establishing a multimodal knowledge graph and its application |
-
2024
- 2024-05-15 CN CN202410601024.6A patent/CN118506887B/en active Active
-
2025
- 2025-05-14 WO PCT/CN2025/094816 patent/WO2025237327A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220261668A1 (en) * | 2021-02-12 | 2022-08-18 | Tempus Labs, Inc. | Artificial intelligence engine for directed hypothesis generation and ranking |
| CN117076681A (en) * | 2023-07-13 | 2023-11-17 | 哈尔滨理工大学 | Medical knowledge graph construction method based on unified medical language system |
| CN117292846A (en) * | 2023-11-27 | 2023-12-26 | 神州医疗科技股份有限公司 | Construction method and device of intestinal microorganism knowledge graph |
| CN118506887A (en) * | 2024-05-15 | 2024-08-16 | 上海力山生物医药有限公司 | Intestinal microorganism knowledge graph system |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118506887B (en) | 2025-05-06 |
| CN118506887A (en) | 2024-08-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Karim et al. | Drug-drug interaction prediction based on knowledge graph embeddings and convolutional-LSTM network | |
| Cohen et al. | A survey of current work in biomedical text mining | |
| WO2025237327A1 (en) | Gut microbe knowledge graph system | |
| Li et al. | Co-mention network of R packages: Scientific impact and clustering structure | |
| Agapito et al. | Extracting cross-ontology weighted association rules from gene ontology annotations | |
| Manda et al. | Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology annotations for the discovery of new GO relationships | |
| CN115114445B (en) | Cell knowledge graph construction method, device, computing device and storage medium | |
| Sybrandt et al. | Large-scale validation of hypothesis generation systems via candidate ranking | |
| Mandreoli et al. | Dealing with data heterogeneity in a data fusion perspective: models, methodologies, and algorithms | |
| Karami | Fuzzy topic modeling for medical corpora | |
| Gasco et al. | Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials | |
| Ebrahimi et al. | Analysis of persian bioinformatics research with topic modeling | |
| Zengul et al. | A practical and empirical comparison of three topic modeling methods using a COVID-19 corpus: LSA, LDA, and Top2Vec | |
| da Silva et al. | Big data trends in bioinformatics | |
| Andreopoulos et al. | Word sense disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering | |
| Querfurth et al. | mcBERT: Patient-Level Single-cell Transcriptomics Data Representation | |
| CN113946647A (en) | DDIs (distributed denial of service) search engine based on medical entity vector and construction method thereof | |
| CN114927168B (en) | Construction method of biomechanical regulation and control bone reconstruction text mining interaction website | |
| Xing et al. | Phenotype extraction based on word embedding to sentence embedding cascaded approach | |
| Zhang et al. | CA-SQBG: Cross-attention guided Siamese quantum BiGRU for drug-drug interaction extraction | |
| Asgari et al. | Deep genomics and proteomics: Language model-based embedding of biological sequences and their applications in bioinformatics | |
| CN114519355A (en) | Medicine named entity recognition and entity standardization method | |
| CN118355391A (en) | Biological entity recognition method and system for drug discovery | |
| Xu et al. | A semi-supervised method for extracting multiple relations of adverse drug events from biomedical literature | |
| Wang et al. | Capsule neural network and its applications in drug discovery |