[go: up one dir, main page]

WO2024026259A1 - Biomedical knowledge graph - Google Patents

Biomedical knowledge graph Download PDF

Info

Publication number
WO2024026259A1
WO2024026259A1 PCT/US2023/070822 US2023070822W WO2024026259A1 WO 2024026259 A1 WO2024026259 A1 WO 2024026259A1 US 2023070822 W US2023070822 W US 2023070822W WO 2024026259 A1 WO2024026259 A1 WO 2024026259A1
Authority
WO
WIPO (PCT)
Prior art keywords
biomedical
entities
knowledge graph
data
relationships
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2023/070822
Other languages
French (fr)
Inventor
Chaohui Guo
Vishakha Sharma
Antoaneta VLADIMIROVA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
F Hoffmann La Roche AG
Roche Diagnostics GmbH
Roche Molecular Systems Inc
Original Assignee
F Hoffmann La Roche AG
Roche Diagnostics GmbH
Roche Molecular Systems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by F Hoffmann La Roche AG, Roche Diagnostics GmbH, Roche Molecular Systems Inc filed Critical F Hoffmann La Roche AG
Publication of WO2024026259A1 publication Critical patent/WO2024026259A1/en
Priority to US19/037,080 priority Critical patent/US20250246318A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/20Heterogeneous data integration
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • Finding updated information often involves use of generalized search tools with key words or phrases, often yielding inconsistent results and non-relevant material (e.g., non-medical or outdated material). Some compilations of materials may not be updated with the most recent and accurate information. Updating them often relies on manual review or verification by skilled (or unskilled) personnel, making them potentially unreliable and/or outdated.
  • aspects of the disclosure provide systems and methods for generating and operating a biomedical knowledge graph from a plurality of disparate sources in a context-based graphical structure. Queries for biomedical information may be adapted and used to search the knowledge graphs in a highly efficient and targeted manner for obtaining biomedical information.
  • the sources of information may include periodicals, clinical trial results, biomedical compendiums, news articles, and other sources, including online sources.
  • a biomedical knowledge graph is generated by first accessing source material and extracting it (e.g., using a natural language processor (NLP)), based on which medical data entities and relationships between the entities are established.
  • Entities may include certain diseases, therapies, and tests, for example, and a relationship can be defined by an association between entities.
  • a relationship for example, may include a test entity used diagnose a particular disease entity or a therapy to treat the disease. Multiple relationships between multiple entities may be extracted. In some embodiments, these entities and relationships are established
  • SUBSTITUTE SHEET ( RULE 26 ) and/or verified utilizing machine-learning analysis for parsing and extracting information from multiple sources and validating their relevance and accuracy based on the analysis.
  • predetermined identifiers or patterns in the data are identified such as through the use of an identifier index or named entity resolution (NER) module to determine entities and biomedical entity types.
  • NER named entity resolution
  • Clusters of biomedical entity types are established (e.g., by machine learning) for particular types of biomedical entities to which they are assigned.
  • Context types for the biomedical entities are identified based on the assigned cluster and/or by analyzing aspects of the data from which the entities were extracted. Based on the identified context type and entity type of the biomedical data, an entry or record is added into a biomedical knowledge graph according to a predefined schema.
  • a context may include an identification of a biomarker associated with a disease, a gene sequence, and/or a hypothesis within a medical publication, for example.
  • queries for biomedical information or answers from the biomedical knowledge graph are facilitated through a query engine configured to interpret structured or unstructured queries (e.g., natural language questions, standard query languagebased queries).
  • the query is converted into a structured query form based on the predefined schema of the biomedical knowledge graph for rapid access and retrieval of relevant information. Conversion may include use of a natural language processor (e.g., named entity resolution) to correlate portions of a query with entities and entity types of the biomedical knowledge graph, along with the relationships inquired about.
  • a natural language processor e.g., named entity resolution
  • Query results or answers can include a traversable graph of entities and their relationships pertaining to the query.
  • Results can also include tabulated results with corresponding information about each particular result and its elements (e.g., treatment for a particular disease and success rate).
  • Links or summaries of the sources of the information may be embedded or accessible with generated results. Reports including statistical analysis of the results may also be generated.
  • FIG. 1 is an illustrative diagram of processes for constructing a biomedical knowledge graph from disparate data sources according to some embodiments.
  • FIG. 2A is an illustrative diagram of a process for constructing a biomedical knowledge graph from a generalized knowledge graph according to some embodiments.
  • FIG. 2B is an exemplary list of biomedical entity types from a generalized knowledge graph.
  • FIG. 2C is an exemplary list of relationship types between entities extracted from a generalized knowledge graph according to some embodiments.
  • FIG. 3 is an illustrative diagram of a process for extracting biomedical data from a general data source according to some embodiments.
  • FIG. 4 is an illustrative diagram of a process for extracting biomedical data from clinical trial sources according to some embodiments.
  • FIG. 5 is a representative schema for a biomedical knowledge graph according to some embodiments.
  • FIG. 6 is a process for executing a query of a biomedical knowledge graph according to some embodiments.
  • FIG. 7A is a graphical user interface for building a query according to some embodiments.
  • FIG. 7B is executable query code generated from the query built from the interface of FIG. 7A.
  • FIG. 8A is query output from a biomedical knowledge graph in a navigable knowledge graph representation according to some embodiments.
  • FIG. 8B is a query output from a biomedical knowledge graph in a tabular representation according to some embodiments.
  • FIG. 9A is a trend output from a biomedical knowledge graph according to some embodiments.
  • FIG. 9B is a graphical representation of predicted responses for a particular disease or condition in response to particular therapeutic compounds according to some embodiments.
  • FIG. 10A is an exemplary process for training and utilizing a machine learning system for predicting alternative paths between therapeutic indicators using a biomedical knowledge graph according to some embodiments.
  • FIG. 10B is an exemplary output of paths between a sequence variant to drug found within a biomedical knowledge graph according to some embodiments.
  • FIG. 10C is an exemplary output of an alternative path based on the machine learning process of FIG. 10A.
  • FIG. 11 is an illustrative diagram of computing devices and processing components of a system for generating and utilizing biomedical knowledge graphs according to some embodiments.
  • aspects of the disclosure include methods and systems for generating and querying biomedical knowledge graphs constructed from numerous disparate data resources.
  • the resources from which the biomedical knowledge graphs are generated include existing generalized knowledge graphs and biomedical-specific resources such as, for example, clinical trial data and other published materials.
  • Some embodiments include systems and methods for extracting data from these resources and building/updating/augmenting biomedical knowledge graphs according to a uniform schema using methods adapted to the type and form of resource from which the data is extracted.
  • Some embodiments include methods for querying the biomedical knowledge graphs in order to obtain results to reflect optimally up-to-date, relevant, and accurate medical information.
  • FIG. 1 is an illustrative diagram of processes for constructing a biomedical knowledge graph from disparate data sources according to some embodiments.
  • a biomedical knowledge graph 170 is constructed by extracting from existing biomedical data that may include a knowledge graph 150 of generalized data (e.g., including but not limited to biomedical data), biomedical publications 130 such as journal articles, and clinical trials data 100A (e.g., summarized from a publication) and clinical trials relational data 100B (e.g., raw data records).
  • a biomedical knowledge graph represents nodes of biomedical entities connected by contextual relationships between the entities.
  • a node represent a disease, condition, therapy, molecule, etc...
  • a relationship may be a treatment or therapy with a molecule for a disease or condition, for example.
  • These nodes and relationships may be represented according to a pre-defined schema or structure.
  • a generalized (or generic) knowledge graph 180 (e.g., which can include any graph- structured representation of data) is accessed and parsed at 185 to identity portions of the knowledge graph related to biomedical information.
  • the identified portions are represented as biomedical subset graphs 145.
  • a generalized knowledge graph may include Wikidata ( DBpedia).
  • biomedical data In order to extract biomedical data from a generalized knowledge graph 150, some data is identified by particular identifiers or predicates (e.g., alphanumeric IDs) already known or classified as representing biomedical information (e.g., particular diseases, therapeutic compounds, and/or types thereof). In some embodiments, data is identified as biomedical information utilizing a named entity resolution (NER) and/or machine learning component such as further described herein.
  • NER named entity resolution
  • the biomedical subsets 185 are (re)structured utilizing a predefined schema and/or ontology consistent across the biomedical knowledge graph 150.
  • subsets 145 of a generalized knowledge graph form the basis of our biomedical knowledge graph 170, which may then be augmented using data extracted from other sources such as biomedical publications 130, clinical trial data 100A and 100B, and information/updates of generalized knowledge graphs such as further described herein.
  • the text/graphics of the publications or records 130 are parsed such as utilizing a NLP/ NER, image processing module, and/or machine learning system. Triples (or other structured forms) of the extracted data are generated at 135.
  • entity resolution methods include NCBI Gene and Spark NLP.
  • Other examples of deriving biomedical relationships from text include Percha B, Altman RB. “A global network of biomedical relationships derived from text.” Bioinformatics. 2018 Aug l;34(15):2614-2624.
  • the extracted/structured biomedical data is incorporated into a (subset) biomedical knowledge graph 140 consistent with the predefined schema/ontology of the biomedical knowledge graph 170.
  • Nodes and relationships of the biomedical knowledge graph may be stored within a computer database 155A and indexed within source index 155B.
  • data for the knowledge graph 170 is obtained or enhanced from clinical trials and/or other trial, medical images, diagnostic test data, and/or other historical or analytical data sources 100A and 100B (e.g., FDA submissions).
  • clinical trials data may be in textual form in 100A (e.g., from a report or journal publication) and/or in a data format in 100B (e.g., a relational table of results). Once the data is extracted, it can be collated into normalized datasets 110.
  • the normalized data sets 110 are then translated/converted into normalized knowledge graphs 120.
  • the normalized knowledge graphs 120 are then used to augment biomedical knowledge graph 170. For example, the relationship and correlation between entities
  • SUBSTITUTE SHEET (RULE 26 ) in the knowledge graph (e.g., a compound and treatment of a condition) may be established, reinforced, or discounted by the data.
  • these various data sources are periodically monitored (e.g., “scraped”) to determine if new/updated data is available for augmenting/updating biomedical knowledge graph 170. For example, data that is periodically identified in these data sources can be compared to the entities and relationships in the knowledge graph 170 to determine if an update is needed.
  • information about the original source (e.g., website URL, citation) from which the data originated is stored with the biomedical knowledge graph and made accessible in results of queries performed on the knowledge graph that contain entities and relationships pertaining to the search results.
  • biomedical knowledge graph 170 Once biomedical knowledge graph 170 has been generated, queries can be performed utilizing the knowledge graph to rapidly obtain requested biomedical information and/or to perform analysis (e.g., machine learning model generation as described in reference to FIGs. 10A-10C) for predicting biomedical relationships and/or reporting trends.
  • a query engine 175 may be utilized to perform queries such as received through an external interface (e.g., a GUI as further described in reference to FIGs. 7A and 7B).
  • Query engine 175 is configured to perform queries particularly adapted to biomedical knowledge graphs as described further herein.
  • FIG. 2A is an illustrative diagram of a process for constructing a biomedical knowledge graph from a generalized knowledge graph.
  • a generalized knowledge graph 200 may include information regarding a wide array of data types and sources. For example, WikiData provides a knowledge graph with a large breadth of data pertaining to history, social topics, politics, and science, including biomedical information. Some portions of the generalized knowledge graph 210 may be identified as pertaining to biomedical information. Identification may be performed utilizing identifiers or predicates associated with the generalized knowledge graph that identify entities and relationships by type or context (e.g., particular disease, treatment, or a type of biomedical information).
  • FIG. 2B is an exemplary list of biomedical entity types from a generalized knowledge graph
  • FIG. 2C is an exemplary list of relationship types between entities extracted from a generalized knowledge graph according to some embodiments.
  • these types and/or relationships may be explicitly identified and represented as biomedical-related.
  • the types or contexts (and/or the data itself) of a generalized knowledge graph are analyzed (e.g., using NLP/NER) and classified as biomedical information (or not) based on the analysis.
  • the relationships include one or more of positively regulating or
  • the relationships include at least one of an action via a receptor, indirect alteration of an effect of an endogenous agonist, the inhibition of transport processes, enzyme inhibition, enzymatic action or activation of enzymatic activity, chelation, osmosis, and/or anesthesia.
  • portions of the generalized knowledge graph are identified as biomedical information, these portions are incorporated into a biomedical knowledge graph 220.
  • the incorporated portions may be used as a basis or foundation for a biomedical knowledge graph and/or used to augment an existing biomedical knowledge graph 220.
  • the schema of the biomedical knowledge graph 220 maintains the original schema of the generalized knowledge graph 200 or, in some embodiments, incorporated data is adapted/converted to another predetermined schema.
  • information from the original generalized knowledge graph 200 may not have utilized a particular predicate or identifier for a type of biomedical type, relationship, and/or context and, based on analyzing the data, it is classified with a particular biomedical identifier and/or context according to a schema/index of the biomedical knowledge graph 220.
  • FIG. 3 is an illustrative diagram of a process for extracting biomedical data from a general data source according to some embodiments.
  • General data source 310 may include periodicals, website content, medical records or reports, and/or other sources in which the data may not be structured in a predetermined way.
  • This general data may be analyzed at 320 and extracted such as by utilizing a NLP (e.g., NER) to identify biomedical information.
  • the analyzing and extraction may include identifying and extracting triples (e.g., entities and relationships) from one or more sentences of unstructured text. Entities, relationships, and contexts are based on the analyzed data.
  • the extracted entities, relationships, and/or contexts are then incorporated into the biomedical knowledge graph 330.
  • Incorporation can include adding, updating, or removing portions of the biomedical knowledge graph 330 based on analyzing and/or comparing the extracted data with data in the biomedical knowledge graph 330.
  • FIG. 4 is an illustrative diagram of a process for extracting biomedical data from clinical trial sources according to some embodiments.
  • Clinical trials sources 410 provide information about clinical trials including, for example, government reporting sites, academic centers, journal publications, intemal/proprietary testing records, and other sources.
  • the format of the trials data may include summaries, statistics, and/or raw data. Information based on summaries and statistics may be obtained in a similar way as that described in reference to general data sources and translated into entities and relationships for a clinical trial sources.
  • SUBSTITUTE SHEET ( RULE 26 ) trials graph 420.
  • Raw data may also be extracted, after which statistics are calculated on the raw data and used, for example, to establish entities and relationships for the clinical trials knowledge graph 420.
  • Clinical trials graph 420 is then incorporated into (e.g., adding, augmenting) biomedical knowledge graph 430. Certain entities and relationships, for example, can be added, removed, or updated based on comparing them with those of biomedical knowledge graph 430.
  • a weight is determined and associated with clinical trials data 410 before it is incorporated. For example, the strength of a relationship may be determined based on a determined reliability or accuracy of the associated data and associated with it in the clinical trials graph 420. The weight or strength of identified relationships may then, for example, be represented in results or analysis (e.g., trends) generated in response to queries (e.g., by query engine 175 of FIG. 1).
  • FIG. 5 is a representative schema for a biomedical knowledge graph according to some embodiments. Numerous exemplary biomedical entity types 510 and relationship types 520 are represented. When populating a biomedical knowledge graph, elements or entities of extracted data (e.g., a disease, gene) are correlated with one or more biomedical entity types, and recorded as nodes by their type(s). Relationships with other entities are identified by relationship type (e.g., gene expressions) and recorded as connections between nodes. In some embodiments, relationships are identified based on determining a particular context in which the data is presented (e.g., type of clinical trial, diseased or healthy patients, stage of cancer, geographic location) and can be enhanced with machine learning techniques. In some embodiments, entities are classified as a gene, sequence, anatomic structure, chemical substance, disease, and/or phenotypic feature.
  • relationship type e.g., gene expressions
  • FIG. 6 is a process for executing a query of a biomedical knowledge graph according to some embodiments.
  • a structured or unstructured query (or question) is received (e.g., “what are the top treatments for squamous-cell carcinoma of the lung?”).
  • the query is translated and structured into a format (if necessary) for performing a lookup utilizing a query index search 620 and index of terms 625.
  • results from the search are generated and may include results pertaining to many possible contexts 630A1, 630A2, ....,630AN (e.g., stage of cancer).
  • the result(s) are analyzed and verified for conformity with particular quality criteria (e.g., utilizing a generative adversarial network (GAN)). Based on the analyzing, conforming/verified results 650A,...,650M are presented as answers
  • SUBSTITUTE SHEET ( RULE 26 ) answer(l) answer(m).
  • the answers may be transmitted to a recipient (e.g., across a computer network) and displayed to a user (e.g., in a graphical user interface).
  • FIG. 7A is a graphical user interface for building a query according to some embodiments.
  • the interface provides fields for entering search terms (e.g., at 710, 715, and 730), relationships between the terms (e.g., at 720), and qualifiers (e.g., at 735) pertaining to the terms and relationships for building a query.
  • fields for terms, relationships, and/or qualifiers are configured with pre-populated and selectable entries or options.
  • FIG. 7B is executable query code generated from the query built from the interface of FIG. 7A. Based on the entries and/or selections entered in the interface of FIG. 7A, query code for executing a biomedical graph query is generated such as illustrated.
  • the code may be generated according to a particular query format or language such as those known to one of ordinary skill in the art (e.g, SQL, GraphQL, XQuery, JSONiq, etc..).
  • a query may be received in textual form.
  • the query is then parsed (e.g., utilizing an NLP/NER) and analyzed to identify biomedical entities and relationships to be searched in connection with the query.
  • the identified entities and relationships can be identified/correlated with respect to those of the biomedical knowledge graph.
  • the biomedical knowledge graph is searched (e.g., by way of an index of terms) for the identified entities relating to the queried relationships (e.g., a cluster of relationships) between the entities.
  • the search may be narrowed to relationships between the entities that best correspond to the query (e.g., a context identified/resolved from the query text).
  • Results may be further evaluated for conformance with expected standards (e.g., based on a trained adversarial network).
  • the results may be translated/converted into a particular format (e.g., sentence form) and/or used to generate statistical and/or trend analysis (e.g., as illustrated in FIGs. 9A and 9B).
  • FIG. 8A is query output from a biomedical knowledge graph in a navigable knowledge graph format according to some embodiments.
  • Numerous results may be generated in response to some queries, particularly broad queries. In order to permit easier navigation of these results, they are generated in a knowledge graph format including the potentially numerous result entities and relationships between them.
  • the queried source is originally in a knowledge graph form, generating results in similar form is relatively fast and efficient.
  • links or citations to original sources e.g., publications, clinical trial data, images
  • FIG. 8B is a query output from a biomedical knowledge graph in a tabular representation according to some embodiments.
  • a set of results may also be presented in a table or spreadsheet form such as in rows and columns of various result fields. That way, the results themselves may be more easily organized, analyzed, and/or searched. For example, charts or plots may be generated based on tabulated results such as illustrated in FIGs. 9 A and 9B.
  • FIG. 9A is a trend output from a biomedical knowledge graph according to some embodiments. Based on analyzing results of one or more queries of a biomedical knowledge graph, trends in the result data are generated and presented. For example, a chart of the relative proportions of patients successfully responding to particular therapies can be presented with respect to each other.
  • FIG. 9B is a graphical representation of predicted responses for a particular disease or condition in response to particular therapeutic compounds.
  • Predicted responses may be based on machine learning models trained with data accessible in a biomedical knowledge graph.
  • particular compounds may be predicted to access therapeutic pathways or provide outcomes based on historical use of the compound for the treatment of other diseases, similarities in chemical structure to other compounds with positive outcomes, and/or related treatments of similar conditions.
  • FIG. 10A is an exemplary process for training and utilizing a machine learning system for predicting alternative paths between therapeutic indicators using a biomedical knowledge graph according to some embodiments.
  • FIG. 10B is an exemplary output of paths between a sequence variant to drug found within a biomedical knowledge graph according to some embodiments.
  • FIG. 10C is an exemplary output of an alternative path based on the machine learning process of FIG. 10A.
  • the aspects 1010 of the process include obtaining therapeutic indicator pairs from the biomedical knowledge graph and determining all possible alternative paths between the pairs.
  • paths between the pairs are selected (e.g., based on a type of path) and the biomedical knowledge graph data is further searched to identify all pairs on the selected paths.
  • criteria is used to narrow the identified pairs to those most likely to have a positive therapeutic impact.
  • a portion (e.g., 70%) of the narrowed set is used to train a machine learning model to generate responses to queries, generate analytics, and/or clinical predictions (e.g., predicting effective therapies, alternative paths), for example, while another portion (e.g., 30%) is used to test hypothesis generated by the model.
  • FIG. 11 is an illustrative diagram of computing devices and processing components of a system for generating and utilizing biomedical knowledge graphs according to some embodiments. Any of the computer systems mentioned herein may utilize any suitable number of subsystems.
  • a computer system includes a single computer apparatus, where the subsystems can be the components of the computer apparatus.
  • a computer system can include multiple computer apparatuses, each being a subsystem, with internal components.
  • a computer system can include desktop and laptop computers, tablets, mobile phones and other mobile devices.
  • a cloud infrastructure e.g., Amazon Web Services
  • a graphical processing unit (GPU), etc. can be used to implement the disclosed techniques.
  • I/O controller 71 The subsystems shown in FIG. 11 are interconnected via a system bus 75. Additional subsystems such as a printer 74, keyboard 78, storage device(s) 79, monitor 76, which is coupled to display adapter 82, and others are shown. Peripherals and input/output (I/O) devices, which couple to I/O controller 71, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 77 (e.g., USB, FireWire®). For example, I/O port 77 or external interface 81 (e.g.
  • Ethernet, Wi-Fi, etc. can be used to connect computer system 10 to a wide area network such as the Internet, a mouse input device, or a scanner.
  • the interconnection via system bus 75 allows the central processor 73 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 72 or the storage device(s) 79 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems.
  • the system memory 72 and/or the storage device(s) 79 may embody a computer readable medium.
  • Another subsystem is a data collection device 85, such as a camera, microphone, accelerometer, and the like. Any of the data mentioned herein can be output from one component to another component and can be output to the user.
  • a computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 81 or by an internal interface. In some embodiments,
  • SUBSTITUTE SHEET (RULE 26 ) computer systems, subsystem, or apparatuses can communicate over a network.
  • one computer can be considered a client and another computer a server, where each can be part of a same computer system.
  • a client and a server can each include multiple systems, subsystems, or components.
  • aspects of embodiments can be implemented in the form of control logic using hardware (e.g. an application specific integrated circuit or field programmable gate array) and/or using computer software with a generally programmable processor in a modular or integrated manner.
  • a processor includes a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked.
  • Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques.
  • the software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission.
  • a suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a harddrive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk), flash memory, and the like.
  • the computer readable medium may be any combination of such storage or transmission devices.
  • Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet.
  • a computer readable medium may be created using a data signal encoded with such programs.
  • Computer readable media encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g. a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network.
  • a computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
  • any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps.
  • embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective steps or a respective group of steps.
  • steps of methods herein can be performed at a same time or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means for performing these steps.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A biomedical knowledge graph system, the system including a computer database of records, the records comprising nodes of biomedical entities and connections between the entities representing biomedical relationships. One or more processors are programmed and configured to extract data from a plurality of data sources, determine biomedical entities and relationships between the entities based on analyzing the data, wherein analyzing the data comprises searching for predetermined identifiers or patterns in the data. Based on the determined biomedical entities, assigning each biomedical entity to a cluster of biomedical entity types. The one or more processors are configured to identify a context for each of the identified biomedical entities based on the assigned cluster and based on elements of the expression of the biomedical data within which the entity is expressed. Based on the identified context and type of the biomedical entity, incorporating records of nodes and connections between nodes into the knowledge graph, the nodes representing biomedical entities and the connections representing biomedical relationships between the entities structured according to the predefined schema.

Description

Figure imgf000003_0001
[0001] Rapidly obtaining the most updated and accurate information for performing clinical care and biomedical research is critical. Vast amounts of information is distributed through a wide variety of resources including published guidelines, periodicals, clinical studies, and online medical compendiums. Searching through these materials and extracting relevant updated information can be cumbersome and time consuming.
[0002] Finding updated information often involves use of generalized search tools with key words or phrases, often yielding inconsistent results and non-relevant material (e.g., non-medical or outdated material). Some compilations of materials may not be updated with the most recent and accurate information. Updating them often relies on manual review or verification by skilled (or unskilled) personnel, making them potentially unreliable and/or outdated.
[0003] There is thus a need for methods of obtaining relevant and updated biomedical information in an efficient and timely manner.
[0004] Aspects of the disclosure provide systems and methods for generating and operating a biomedical knowledge graph from a plurality of disparate sources in a context-based graphical structure. Queries for biomedical information may be adapted and used to search the knowledge graphs in a highly efficient and targeted manner for obtaining biomedical information. The sources of information may include periodicals, clinical trial results, biomedical compendiums, news articles, and other sources, including online sources.
[0005] A biomedical knowledge graph is generated by first accessing source material and extracting it (e.g., using a natural language processor (NLP)), based on which medical data entities and relationships between the entities are established. Entities may include certain diseases, therapies, and tests, for example, and a relationship can be defined by an association between entities. A relationship, for example, may include a test entity used diagnose a particular disease entity or a therapy to treat the disease. Multiple relationships between multiple entities may be extracted. In some embodiments, these entities and relationships are established
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000004_0001
and/or verified utilizing machine-learning analysis for parsing and extracting information from multiple sources and validating their relevance and accuracy based on the analysis.
[0006] In some embodiments, after obtaining data from one or more sources, predetermined identifiers or patterns in the data are identified such as through the use of an identifier index or named entity resolution (NER) module to determine entities and biomedical entity types.
Clusters of biomedical entity types (or themes) are established (e.g., by machine learning) for particular types of biomedical entities to which they are assigned. Context types for the biomedical entities are identified based on the assigned cluster and/or by analyzing aspects of the data from which the entities were extracted. Based on the identified context type and entity type of the biomedical data, an entry or record is added into a biomedical knowledge graph according to a predefined schema. A context may include an identification of a biomarker associated with a disease, a gene sequence, and/or a hypothesis within a medical publication, for example.
[0007] In some embodiments, queries for biomedical information or answers from the biomedical knowledge graph are facilitated through a query engine configured to interpret structured or unstructured queries (e.g., natural language questions, standard query languagebased queries). In some embodiments, the query is converted into a structured query form based on the predefined schema of the biomedical knowledge graph for rapid access and retrieval of relevant information. Conversion may include use of a natural language processor (e.g., named entity resolution) to correlate portions of a query with entities and entity types of the biomedical knowledge graph, along with the relationships inquired about.
[0008] Query results or answers can include a traversable graph of entities and their relationships pertaining to the query. Results can also include tabulated results with corresponding information about each particular result and its elements (e.g., treatment for a particular disease and success rate). Links or summaries of the sources of the information (e.g., particular clinical trials) may be embedded or accessible with generated results. Reports including statistical analysis of the results may also be generated.
Figure imgf000004_0002
[0009] Various objects and advantages of the disclosure will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:
[0010] FIG. 1 is an illustrative diagram of processes for constructing a biomedical knowledge graph from disparate data sources according to some embodiments.
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000005_0001
[0011] FIG. 2A is an illustrative diagram of a process for constructing a biomedical knowledge graph from a generalized knowledge graph according to some embodiments.
[0012] FIG. 2B is an exemplary list of biomedical entity types from a generalized knowledge graph.
[0013] FIG. 2C is an exemplary list of relationship types between entities extracted from a generalized knowledge graph according to some embodiments.
[0014] FIG. 3 is an illustrative diagram of a process for extracting biomedical data from a general data source according to some embodiments.
[0015] FIG. 4 is an illustrative diagram of a process for extracting biomedical data from clinical trial sources according to some embodiments.
[0016] FIG. 5 is a representative schema for a biomedical knowledge graph according to some embodiments.
[0017] FIG. 6 is a process for executing a query of a biomedical knowledge graph according to some embodiments.
[0018] FIG. 7A is a graphical user interface for building a query according to some embodiments.
[0019] FIG. 7B is executable query code generated from the query built from the interface of FIG. 7A.
[0020] FIG. 8A is query output from a biomedical knowledge graph in a navigable knowledge graph representation according to some embodiments.
[0021] FIG. 8B is a query output from a biomedical knowledge graph in a tabular representation according to some embodiments.
[0022] FIG. 9A is a trend output from a biomedical knowledge graph according to some embodiments.
[0023] FIG. 9B is a graphical representation of predicted responses for a particular disease or condition in response to particular therapeutic compounds according to some embodiments.
[0024] FIG. 10A is an exemplary process for training and utilizing a machine learning system for predicting alternative paths between therapeutic indicators using a biomedical knowledge graph according to some embodiments.
[0025] FIG. 10B is an exemplary output of paths between a sequence variant to drug found within a biomedical knowledge graph according to some embodiments.
[0026] FIG. 10C is an exemplary output of an alternative path based on the machine learning process of FIG. 10A.
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000006_0001
[0027] FIG. 11 is an illustrative diagram of computing devices and processing components of a system for generating and utilizing biomedical knowledge graphs according to some embodiments.
Detailed
Figure imgf000006_0002
[0028] Aspects of the disclosure include methods and systems for generating and querying biomedical knowledge graphs constructed from numerous disparate data resources. The resources from which the biomedical knowledge graphs are generated include existing generalized knowledge graphs and biomedical-specific resources such as, for example, clinical trial data and other published materials. Some embodiments include systems and methods for extracting data from these resources and building/updating/augmenting biomedical knowledge graphs according to a uniform schema using methods adapted to the type and form of resource from which the data is extracted. Some embodiments include methods for querying the biomedical knowledge graphs in order to obtain results to reflect optimally up-to-date, relevant, and accurate medical information.
[0029] FIG. 1 is an illustrative diagram of processes for constructing a biomedical knowledge graph from disparate data sources according to some embodiments. A biomedical knowledge graph 170 is constructed by extracting from existing biomedical data that may include a knowledge graph 150 of generalized data (e.g., including but not limited to biomedical data), biomedical publications 130 such as journal articles, and clinical trials data 100A (e.g., summarized from a publication) and clinical trials relational data 100B (e.g., raw data records). [0030] In some embodiments, a biomedical knowledge graph represents nodes of biomedical entities connected by contextual relationships between the entities. For example, a node represent a disease, condition, therapy, molecule, etc..., and a relationship may be a treatment or therapy with a molecule for a disease or condition, for example. These nodes and relationships may be represented according to a pre-defined schema or structure.
[0031] A generalized (or generic) knowledge graph 180 (e.g., which can include any graph- structured representation of data) is accessed and parsed at 185 to identity portions of the knowledge graph related to biomedical information. The identified portions are represented as biomedical subset graphs 145. For example, a generalized knowledge graph may include Wikidata (
Figure imgf000006_0003
DBpedia
(https://www.dbpedia.org/), and/or many others.
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000007_0001
[0032] In order to extract biomedical data from a generalized knowledge graph 150, some data is identified by particular identifiers or predicates (e.g., alphanumeric IDs) already known or classified as representing biomedical information (e.g., particular diseases, therapeutic compounds, and/or types thereof). In some embodiments, data is identified as biomedical information utilizing a named entity resolution (NER) and/or machine learning component such as further described herein.
[0033] In some embodiments, the biomedical subsets 185 are (re)structured utilizing a predefined schema and/or ontology consistent across the biomedical knowledge graph 150. In some embodiments, subsets 145 of a generalized knowledge graph form the basis of our biomedical knowledge graph 170, which may then be augmented using data extracted from other sources such as biomedical publications 130, clinical trial data 100A and 100B, and information/updates of generalized knowledge graphs such as further described herein.
[0034] In order to extract biomedical data from textual/graphical biomedical publications or records 130 that are typically unstructured, the text/graphics of the publications or records 130 are parsed such as utilizing a NLP/ NER, image processing module, and/or machine learning system. Triples (or other structured forms) of the extracted data are generated at 135. Some examples of entity resolution methods include NCBI Gene and Spark NLP. Other examples of deriving biomedical relationships from text include Percha B, Altman RB. “A global network of biomedical relationships derived from text.” Bioinformatics. 2018 Aug l;34(15):2614-2624.
The extracted/structured biomedical data is incorporated into a (subset) biomedical knowledge graph 140 consistent with the predefined schema/ontology of the biomedical knowledge graph 170. Nodes and relationships of the biomedical knowledge graph may be stored within a computer database 155A and indexed within source index 155B.
[0035] In some embodiments, data for the knowledge graph 170 is obtained or enhanced from clinical trials and/or other trial, medical images, diagnostic test data, and/or other historical or analytical data sources 100A and 100B (e.g., FDA submissions). In some embodiments, clinical trials data may be in textual form in 100A (e.g., from a report or journal publication) and/or in a data format in 100B (e.g., a relational table of results). Once the data is extracted, it can be collated into normalized datasets 110.
[0036] The normalized data sets 110 are then translated/converted into normalized knowledge graphs 120. The normalized knowledge graphs 120 are then used to augment biomedical knowledge graph 170. For example, the relationship and correlation between entities
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000008_0001
in the knowledge graph (e.g., a compound and treatment of a condition) may be established, reinforced, or discounted by the data.
[0037] In some embodiments, these various data sources are periodically monitored (e.g., “scraped”) to determine if new/updated data is available for augmenting/updating biomedical knowledge graph 170. For example, data that is periodically identified in these data sources can be compared to the entities and relationships in the knowledge graph 170 to determine if an update is needed. In some embodiments, information about the original source (e.g., website URL, citation) from which the data originated is stored with the biomedical knowledge graph and made accessible in results of queries performed on the knowledge graph that contain entities and relationships pertaining to the search results.
[0038] Once biomedical knowledge graph 170 has been generated, queries can be performed utilizing the knowledge graph to rapidly obtain requested biomedical information and/or to perform analysis (e.g., machine learning model generation as described in reference to FIGs. 10A-10C) for predicting biomedical relationships and/or reporting trends. A query engine 175 may be utilized to perform queries such as received through an external interface (e.g., a GUI as further described in reference to FIGs. 7A and 7B). Query engine 175 is configured to perform queries particularly adapted to biomedical knowledge graphs as described further herein.
[0039] FIG. 2A is an illustrative diagram of a process for constructing a biomedical knowledge graph from a generalized knowledge graph. A generalized knowledge graph 200 may include information regarding a wide array of data types and sources. For example, WikiData provides a knowledge graph with a large breadth of data pertaining to history, social topics, politics, and science, including biomedical information. Some portions of the generalized knowledge graph 210 may be identified as pertaining to biomedical information. Identification may be performed utilizing identifiers or predicates associated with the generalized knowledge graph that identify entities and relationships by type or context (e.g., particular disease, treatment, or a type of biomedical information).
[0040] FIG. 2B is an exemplary list of biomedical entity types from a generalized knowledge graph and FIG. 2C is an exemplary list of relationship types between entities extracted from a generalized knowledge graph according to some embodiments. As indicated above, these types and/or relationships may be explicitly identified and represented as biomedical-related. In some embodiments, the types or contexts (and/or the data itself) of a generalized knowledge graph are analyzed (e.g., using NLP/NER) and classified as biomedical information (or not) based on the analysis. In some embodiments, the relationships include one or more of positively regulating or
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000009_0001
negatively regulating biological factors. In some embodiments, the relationships include at least one of an action via a receptor, indirect alteration of an effect of an endogenous agonist, the inhibition of transport processes, enzyme inhibition, enzymatic action or activation of enzymatic activity, chelation, osmosis, and/or anesthesia.
[0041] After portions of the generalized knowledge graph are identified as biomedical information, these portions are incorporated into a biomedical knowledge graph 220. The incorporated portions may be used as a basis or foundation for a biomedical knowledge graph and/or used to augment an existing biomedical knowledge graph 220. In some embodiments, the schema of the biomedical knowledge graph 220 maintains the original schema of the generalized knowledge graph 200 or, in some embodiments, incorporated data is adapted/converted to another predetermined schema. For example, information from the original generalized knowledge graph 200 may not have utilized a particular predicate or identifier for a type of biomedical type, relationship, and/or context and, based on analyzing the data, it is classified with a particular biomedical identifier and/or context according to a schema/index of the biomedical knowledge graph 220.
[0042] FIG. 3 is an illustrative diagram of a process for extracting biomedical data from a general data source according to some embodiments. General data source 310 may include periodicals, website content, medical records or reports, and/or other sources in which the data may not be structured in a predetermined way. This general data may be analyzed at 320 and extracted such as by utilizing a NLP (e.g., NER) to identify biomedical information. The analyzing and extraction may include identifying and extracting triples (e.g., entities and relationships) from one or more sentences of unstructured text. Entities, relationships, and contexts are based on the analyzed data. The extracted entities, relationships, and/or contexts are then incorporated into the biomedical knowledge graph 330. Incorporation can include adding, updating, or removing portions of the biomedical knowledge graph 330 based on analyzing and/or comparing the extracted data with data in the biomedical knowledge graph 330.
[0043] FIG. 4 is an illustrative diagram of a process for extracting biomedical data from clinical trial sources according to some embodiments. Clinical trials sources 410 provide information about clinical trials including, for example, government reporting sites, academic centers, journal publications, intemal/proprietary testing records, and other sources. In some cases, the format of the trials data may include summaries, statistics, and/or raw data. Information based on summaries and statistics may be obtained in a similar way as that described in reference to general data sources and translated into entities and relationships for a clinical
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000010_0001
trials graph 420. Raw data may also be extracted, after which statistics are calculated on the raw data and used, for example, to establish entities and relationships for the clinical trials knowledge graph 420.
[0044] Clinical trials graph 420 is then incorporated into (e.g., adding, augmenting) biomedical knowledge graph 430. Certain entities and relationships, for example, can be added, removed, or updated based on comparing them with those of biomedical knowledge graph 430. [0045] In some embodiments, a weight is determined and associated with clinical trials data 410 before it is incorporated. For example, the strength of a relationship may be determined based on a determined reliability or accuracy of the associated data and associated with it in the clinical trials graph 420. The weight or strength of identified relationships may then, for example, be represented in results or analysis (e.g., trends) generated in response to queries (e.g., by query engine 175 of FIG. 1).
[0046] FIG. 5 is a representative schema for a biomedical knowledge graph according to some embodiments. Numerous exemplary biomedical entity types 510 and relationship types 520 are represented. When populating a biomedical knowledge graph, elements or entities of extracted data (e.g., a disease, gene) are correlated with one or more biomedical entity types, and recorded as nodes by their type(s). Relationships with other entities are identified by relationship type (e.g., gene expressions) and recorded as connections between nodes. In some embodiments, relationships are identified based on determining a particular context in which the data is presented (e.g., type of clinical trial, diseased or healthy patients, stage of cancer, geographic location) and can be enhanced with machine learning techniques. In some embodiments, entities are classified as a gene, sequence, anatomic structure, chemical substance, disease, and/or phenotypic feature.
[0047] FIG. 6 is a process for executing a query of a biomedical knowledge graph according to some embodiments. At 610, a structured or unstructured query (or question) is received (e.g., “what are the top treatments for squamous-cell carcinoma of the lung?”). In some embodiments, the query is translated and structured into a format (if necessary) for performing a lookup utilizing a query index search 620 and index of terms 625. At 630, results from the search are generated and may include results pertaining to many possible contexts 630A1, 630A2, ....,630AN (e.g., stage of cancer). At 640, the result(s) are analyzed and verified for conformity with particular quality criteria (e.g., utilizing a generative adversarial network (GAN)). Based on the analyzing, conforming/verified results 650A,...,650M are presented as answers
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000011_0001
answer(l) answer(m). The answers may be transmitted to a recipient (e.g., across a computer network) and displayed to a user (e.g., in a graphical user interface).
[0048] FIG. 7A is a graphical user interface for building a query according to some embodiments. The interface provides fields for entering search terms (e.g., at 710, 715, and 730), relationships between the terms (e.g., at 720), and qualifiers (e.g., at 735) pertaining to the terms and relationships for building a query. In some embodiments, fields for terms, relationships, and/or qualifiers are configured with pre-populated and selectable entries or options.
[0049] FIG. 7B is executable query code generated from the query built from the interface of FIG. 7A. Based on the entries and/or selections entered in the interface of FIG. 7A, query code for executing a biomedical graph query is generated such as illustrated. The code may be generated according to a particular query format or language such as those known to one of ordinary skill in the art (e.g, SQL, GraphQL, XQuery, JSONiq, etc..).
[0050] A query may be received in textual form. The query is then parsed (e.g., utilizing an NLP/NER) and analyzed to identify biomedical entities and relationships to be searched in connection with the query. The identified entities and relationships can be identified/correlated with respect to those of the biomedical knowledge graph. The biomedical knowledge graph is searched (e.g., by way of an index of terms) for the identified entities relating to the queried relationships (e.g., a cluster of relationships) between the entities. The search may be narrowed to relationships between the entities that best correspond to the query (e.g., a context identified/resolved from the query text). Results may be further evaluated for conformance with expected standards (e.g., based on a trained adversarial network). The results may be translated/converted into a particular format (e.g., sentence form) and/or used to generate statistical and/or trend analysis (e.g., as illustrated in FIGs. 9A and 9B).
[0051] FIG. 8A is query output from a biomedical knowledge graph in a navigable knowledge graph format according to some embodiments. Numerous results may be generated in response to some queries, particularly broad queries. In order to permit easier navigation of these results, they are generated in a knowledge graph format including the potentially numerous result entities and relationships between them. Because the queried source is originally in a knowledge graph form, generating results in similar form is relatively fast and efficient. In some embodiments, links or citations to original sources (e.g., publications, clinical trial data, images) may be accessible or navigable within a graph-based interface as they pertain to results/relationships identified from the queried biomedical knowledge graph.
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000012_0001
[0052] FIG. 8B is a query output from a biomedical knowledge graph in a tabular representation according to some embodiments. A set of results may also be presented in a table or spreadsheet form such as in rows and columns of various result fields. That way, the results themselves may be more easily organized, analyzed, and/or searched. For example, charts or plots may be generated based on tabulated results such as illustrated in FIGs. 9 A and 9B.
[0053] FIG. 9A is a trend output from a biomedical knowledge graph according to some embodiments. Based on analyzing results of one or more queries of a biomedical knowledge graph, trends in the result data are generated and presented. For example, a chart of the relative proportions of patients successfully responding to particular therapies can be presented with respect to each other.
[0054] FIG. 9B is a graphical representation of predicted responses for a particular disease or condition in response to particular therapeutic compounds. Predicted responses may be based on machine learning models trained with data accessible in a biomedical knowledge graph. For example, particular compounds may be predicted to access therapeutic pathways or provide outcomes based on historical use of the compound for the treatment of other diseases, similarities in chemical structure to other compounds with positive outcomes, and/or related treatments of similar conditions.
[0055] FIG. 10A is an exemplary process for training and utilizing a machine learning system for predicting alternative paths between therapeutic indicators using a biomedical knowledge graph according to some embodiments. FIG. 10B is an exemplary output of paths between a sequence variant to drug found within a biomedical knowledge graph according to some embodiments. FIG. 10C is an exemplary output of an alternative path based on the machine learning process of FIG. 10A. The aspects 1010 of the process include obtaining therapeutic indicator pairs from the biomedical knowledge graph and determining all possible alternative paths between the pairs. At 1020, paths between the pairs are selected (e.g., based on a type of path) and the biomedical knowledge graph data is further searched to identify all pairs on the selected paths. At 1030, criteria is used to narrow the identified pairs to those most likely to have a positive therapeutic impact. At 1040, a portion (e.g., 70%) of the narrowed set is used to train a machine learning model to generate responses to queries, generate analytics, and/or clinical predictions (e.g., predicting effective therapies, alternative paths), for example, while another portion (e.g., 30%) is used to test hypothesis generated by the model.
[0056] The above-described embodiments of the present disclosure are presented for purposes of illustration and not of limitation, and the present disclosure is limited only by the claims which
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000013_0001
follow. Furthermore, it should be noted that the features and limitations described in any one embodiment may be applied to any other embodiment herein, and flowcharts or examples relating to one embodiment may be combined with any other embodiment in a suitable manner, done in different orders, or done in parallel. In addition, the systems and methods described herein may be performed in real time. It should also be noted, the systems and/or methods described above may be applied to, or used in accordance with, other systems and/or methods. [0057] FIG. 11 is an illustrative diagram of computing devices and processing components of a system for generating and utilizing biomedical knowledge graphs according to some embodiments. Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in FIG. 11 in computer system 10. In some embodiments, a computer system includes a single computer apparatus, where the subsystems can be the components of the computer apparatus. In other embodiments, a computer system can include multiple computer apparatuses, each being a subsystem, with internal components. A computer system can include desktop and laptop computers, tablets, mobile phones and other mobile devices. In some embodiments, a cloud infrastructure (e.g., Amazon Web Services), a graphical processing unit (GPU), etc., can be used to implement the disclosed techniques.
[0058] The subsystems shown in FIG. 11 are interconnected via a system bus 75. Additional subsystems such as a printer 74, keyboard 78, storage device(s) 79, monitor 76, which is coupled to display adapter 82, and others are shown. Peripherals and input/output (I/O) devices, which couple to I/O controller 71, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 77 (e.g., USB, FireWire®). For example, I/O port 77 or external interface 81 (e.g. Ethernet, Wi-Fi, etc.) can be used to connect computer system 10 to a wide area network such as the Internet, a mouse input device, or a scanner. The interconnection via system bus 75 allows the central processor 73 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 72 or the storage device(s) 79 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems. The system memory 72 and/or the storage device(s) 79 may embody a computer readable medium. Another subsystem is a data collection device 85, such as a camera, microphone, accelerometer, and the like. Any of the data mentioned herein can be output from one component to another component and can be output to the user. [0059] A computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 81 or by an internal interface. In some embodiments,
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000014_0001
computer systems, subsystem, or apparatuses can communicate over a network. In such instances, one computer can be considered a client and another computer a server, where each can be part of a same computer system. A client and a server can each include multiple systems, subsystems, or components.
[0060] Aspects of embodiments can be implemented in the form of control logic using hardware (e.g. an application specific integrated circuit or field programmable gate array) and/or using computer software with a generally programmable processor in a modular or integrated manner. As used herein, a processor includes a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement embodiments of the present invention using hardware and a combination of hardware and software.
[0061] Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission. A suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a harddrive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk), flash memory, and the like. The computer readable medium may be any combination of such storage or transmission devices.
[0062] Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet. As such, a computer readable medium may be created using a data signal encoded with such programs. Computer readable media encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g. a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network. A computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000015_0001
[0063] Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Thus, embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective steps or a respective group of steps. Although presented as numbered steps, steps of methods herein can be performed at a same time or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means for performing these steps.
[0064] The specific details of particular embodiments may be combined in any suitable manner without departing from the spirit and scope of embodiments of the invention. However, other embodiments of the invention may be directed to specific embodiments relating to each individual aspect, or specific combinations of these individual aspects.
[0065] The above description of example embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form described, and many modifications and variations are possible in light of the teaching above. A recitation of "a", "an" or "the" is intended to mean "one or more" unless specifically indicated to the contrary. The use of "or" is intended to mean an "inclusive or," and not an "exclusive or" unless specifically indicated to the contrary. Reference to a "first" component does not necessarily require that a second component be provided. Moreover reference to a "first" or a "second" component does not limit the referenced component to a particular location unless expressly stated.
[0066] All patents, patent applications, publications, and descriptions mentioned herein are incorporated by reference in their entirety. None is admitted to be prior art.
SUBSTITUTE SHEET ( RULE 26 )

Claims

Figure imgf000016_0001
What is Claimed is:
1. A computer-implemented method for structuring and retrieving biomedical data, the method comprising: obtaining biomedical data from one or more data sources; extracting, from the biomedical data, a plurality of biomedical entities and biomedical relationships between the entities; translating the plurality of entities and relationships according to a predefined schema for a biomedical knowledge graph, the translating comprising: determining biomedical entities and entity types in the data by searching for predetermined identifiers or patterns in the data; based on the determined biomedical type of the entity, assigning each biomedical entity to a cluster of biomedical entity types; identifying a context for each of the identified biomedical entities based on the assigned cluster and based on elements of an expression of the biomedical data within which the entity is expressed; based on the identified context and type of the biomedical entities, incorporating records of nodes and connections between nodes into the knowledge graph, the nodes representing biomedical entities and the connections representing biomedical relationships between the entities structured according to the predefined schema; receiving a query for biomedical information; converting the query into a structured query expression based on the predefined schema; generating a query result of biomedical information based on searching through the knowledge graph using the structured query expression.
2. The method of claim 1 wherein the one or more data sources comprises clinical trials data, journal publications, and/or a generalized knowledge graph.
3. The method of claim 1 wherein determining biomedical entities and entity types comprises utilizing a named entity resolution (NER) module.
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000017_0001
4. The method of claim 1 wherein identifying a context comprises utilizing a natural language processing (NLP) module.
5. The method of claim 1 wherein the schema comprises entities classified as one of a gene, sequence, anatomic structure, chemical substance, disease, and phenotypic feature.
6. The method of claim 1 wherein the schema comprises relationship types among entities classified as at least one of positively regulating or negatively regulating biological factors.
7. The method of claim 1 wherein the context comprises at least one of a gene sequence, an identification of a biomarker associated with a disease, or a hypothesis within a medical publication.
8. The method of claim 3 wherein the relationships comprise at least one an action via a receptor, indirect alteration of an effect of an endogenous agonist, an inhibition of transport processes, enzyme inhibition, enzymatic action or activation of enzymatic activity, chelation, osmosis, and/or anesthesia.
9. The method of claim 1 further comprising predicting and reporting biomedical relationships among the biomedical entities using a machine learning model that receives data from the biomedical knowledge graph as input.
10. The method of claim 1 wherein converting the query into a structured query expression comprises utilizing a machine learning based NLP model optimized for structuring the query for searching the biomedical knowledge graph with the predefined schema.
11. A biomedical knowledge graph system, the system comprising: a computer database of records, the records comprising nodes of biomedical entities and connections between the entities representing biomedical relationships; one or more processors programmed and configured to: extract data from a plurality of data sources;
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000018_0001
determine biomedical entities and relationships between the entities based on analyzing the data, wherein analyzing the data comprises searching for predetermined identifiers or patterns in the data; based on the determined biomedical entities, assign each biomedical entity to a cluster of biomedical entity types; identify a context for each of the identified biomedical entities based on the assigned cluster and based on elements of the expression of the biomedical data within which the entity is expressed; based on the identified context and type of the biomedical entity, incorporating records of nodes and connections between nodes into the knowledge graph, the nodes representing biomedical entities and the connections representing biomedical relationships between the entities structured according to a predefined schema.
1 . The biomedical knowledge graph system of claim 11, wherein the one or more data sources comprises clinical trials data, journal publications, and/or a generalized knowledge graph.
13. The biomedical knowledge graph system of claim 11 wherein the one or more processors are programmed and configured to: receive a query for biomedical information; convert the query into a structured query expression based on the predefined schema; determine a query result of biomedical information based on searching through the knowledge graph using the structured query expression; generate a query result of biomedical information.
14. The biomedical knowledge graph system of claim 13 wherein converting the query into a structured query expression comprises utilizing a machine learning based NLP model optimized for structuring the query for searching the biomedical knowledge graph with the predefined schema.
15. The biomedical knowledge graph system of claim 11 wherein determining biomedical entities and entity types comprises utilizing a named entity resolution (NER) module.
SUBSTITUTE SHEET ( RULE 26 )
Figure imgf000019_0001
16. The biomedical knowledge graph system of claim 11 wherein determining biomedical entities and entity types comprises utilizing a named entity resolution (NER) module, and wherein identifying a context comprises utilizing a natural language processing (NLP) module.
17. The biomedical knowledge graph system of claim 11 wherein the schema comprises entities classified as one of a gene, sequence, anatomic structure, chemical substance, disease, and phenotypic feature.
18. The biomedical knowledge graph system of claim 11 wherein the schema comprises relationship types among entities classified as at least one of positively regulating or negatively regulating biological factors.
19. The biomedical knowledge graph system of claim 11 wherein the context comprises at least one of an identification of a biomarker associated with a disease, a gene sequence, or a hypothesis within a medical publication.
20. The biomedical knowledge graph system of claim 11 wherein the relationships comprise at least one an action via a receptor, indirect alteration of an effect of an endogenous agonist, an inhibition of transport processes, enzyme inhibition, enzymatic action or activation of enzymatic activity, chelation, osmosis, and/or anesthesia.
21. The biomedical knowledge graph system of claim 11 wherein the one or more processors are programmed and configured to: predict and report biomedical relationships among the biomedical entities using a machine learning model trained with data from the biomedical knowledge graph.
SUBSTITUTE SHEET ( RULE 26 )
PCT/US2023/070822 2022-07-26 2023-07-24 Biomedical knowledge graph Ceased WO2024026259A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US19/037,080 US20250246318A1 (en) 2022-07-26 2025-01-25 Biomedical knowledge graph

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263369470P 2022-07-26 2022-07-26
US63/369,470 2022-07-26
US202263379010P 2022-10-11 2022-10-11
US63/379,010 2022-10-11

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/037,080 Continuation-In-Part US20250246318A1 (en) 2022-07-26 2025-01-25 Biomedical knowledge graph

Publications (1)

Publication Number Publication Date
WO2024026259A1 true WO2024026259A1 (en) 2024-02-01

Family

ID=87571639

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/070822 Ceased WO2024026259A1 (en) 2022-07-26 2023-07-24 Biomedical knowledge graph

Country Status (1)

Country Link
WO (1) WO2024026259A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117995426A (en) * 2024-04-07 2024-05-07 北京惠每云科技有限公司 Medical knowledge graph construction method and device, electronic equipment and storage medium
CN119545610A (en) * 2024-11-04 2025-02-28 黄山学院 An intelligent lighting control method and system for a smart nursing home

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190392074A1 (en) * 2018-06-21 2019-12-26 LeapAnalysis Inc. Scalable capturing, modeling and reasoning over complex types of data for high level analysis applications

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190392074A1 (en) * 2018-06-21 2019-12-26 LeapAnalysis Inc. Scalable capturing, modeling and reasoning over complex types of data for high level analysis applications

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DÖRPINGHAUS JENS ET AL: "Context mining and graph queries on giant biomedical knowledge graphs", KNOWLEDGE AND INFORMATION SYSTEMS, SPRINGER VERLAG,LONDON, GB, vol. 64, no. 5, 29 March 2022 (2022-03-29), pages 1239 - 1262, XP037824025, ISSN: 0219-1377, [retrieved on 20220329], DOI: 10.1007/S10115-022-01668-7 *
PERCHA B.ALTMAN RB.: "A global network of biomedical relationships derived from text", BIOINFORMATICS., vol. 34, no. 15, 1 August 2018 (2018-08-01), pages 2614 - 2624

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117995426A (en) * 2024-04-07 2024-05-07 北京惠每云科技有限公司 Medical knowledge graph construction method and device, electronic equipment and storage medium
CN119545610A (en) * 2024-11-04 2025-02-28 黄山学院 An intelligent lighting control method and system for a smart nursing home

Similar Documents

Publication Publication Date Title
US11232365B2 (en) Digital assistant platform
US11188819B2 (en) Entity model establishment
CN111801741B (en) Adverse drug reaction analysis
US10102254B2 (en) Confidence ranking of answers based on temporal semantics
US9965548B2 (en) Analyzing natural language questions to determine missing information in order to improve accuracy of answers
Kefeli et al. TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models
US11900063B2 (en) System and method for actionizing comments
US20200311610A1 (en) Rule-based feature engineering, model creation and hosting
US9760828B2 (en) Utilizing temporal indicators to weight semantic values
CN113345545B (en) Method, device, electronic device and readable storage medium for auditing clinical data
Sfakianaki et al. Semantic biomedical resource discovery: a Natural Language Processing framework
WO2024026259A1 (en) Biomedical knowledge graph
US8972406B2 (en) Generating epigenetic cohorts through clustering of epigenetic surprisal data based on parameters
Meystre et al. Natural language processing enabling COVID-19 predictive analytics to support data-driven patient advising and pooled testing
US20180067986A1 (en) Database model with improved storage and search string generation techniques
CN113658712A (en) Doctor-patient matching method, device, equipment and storage medium
US20250246318A1 (en) Biomedical knowledge graph
US11170172B1 (en) System and method for actionizing comments
CN118227736A (en) Text processing method, text processing device, electronic equipment and readable storage medium
Van De Burgt et al. Development of a text mining algorithm for identifying adverse drug reactions in electronic health records
Musunuru litreviewer: A Python Package for Review of Literature (RoL)
Salamun et al. Analyzing adverse event signal detection with publicly available web sources
US12505298B2 (en) System and method for actionizing comments
Kiourtis et al. Prioritization of IoT devices healthcare data based on attribute scoring and metadata annotation
AU2021216435A1 (en) Method and system for incorporating patient information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23754975

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 23754975

Country of ref document: EP

Kind code of ref document: A1