WO2021030193A1 - Système et procédé de classification de données génomiques - Google Patents
Système et procédé de classification de données génomiques Download PDFInfo
- Publication number
- WO2021030193A1 WO2021030193A1 PCT/US2020/045421 US2020045421W WO2021030193A1 WO 2021030193 A1 WO2021030193 A1 WO 2021030193A1 US 2020045421 W US2020045421 W US 2020045421W WO 2021030193 A1 WO2021030193 A1 WO 2021030193A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- model
- biological
- dataspace
- models
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Definitions
- the field of the invention is systems and methods for classifying biological data using an ensemble of tissue models to predict the origin of neoplastic tissue.
- the numbers expressing quantities of ingredients, properties such as concentration, reaction conditions, and so forth, used to describe and claim certain embodiments of the inventive subject matter are to be understood as being modified in some instances by the term “about.” Accordingly, in some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the inventive subject matter are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable. The numerical values presented in some embodiments of the inventive subject matter may contain certain errors necessarily resulting from the standard deviation found in their respective testing measurements.
- Cancer is a disease of altered tissue growth regulation, and treatments for different types of cancers are often determined by a presumed origin of the tumor. However, it can be difficult to determine the presumed origin (or origins) of a tumor and improved methods and systems are needed to predict the origin(s).
- embodiments of the present disclosure relate to data classification methods and systems.
- One such non-limiting embodiment of the present invention is a method that includes receiving a neoplastic tissue sample and preparing a biological data profile of the neoplastic tissue sample.
- the biological data profile is input into a model dataspace comprising models having two or more dimensions and comprising data characteristic of multiple neoplastic tissue types.
- the method further includes comparing the biological data profile with each model within the model dataspace to determine a best fit of a model in the model dataspace with the biological data profile. The best fit can be communicated to a user device.
- Another embodiment of the invention is a system that includes a prediction server for receiving input for processing biological data.
- the server comprises a microprocessor and a computer-readable medium coupled thereto, and the microprocessor receives instructions from the computer-readable medium.
- the microprocessor is programmed to prepare a biological data profile of a neoplastic tissue sample and to input the biological data profile into a model dataspace having models having two or more dimensions and including data characteristic of multiple neoplastic tissue types.
- the microprocessor is also programmed to compare the biological data profile with each model within the model dataspace to determine a best fit of a model in the model dataspace with the biological data profile.
- the microprocessor is further programmed to communicate the best fit to a user device.
- a further embodiment of the invention is a method for treatment of a patient having neoplastic tissue, such as a tumor or other tissue growth.
- Medical professionals working with patients having such conditions need to identify the nature of the tissue to make an accurate diagnosis and determine an appropriate course of treatment. For example, a medical professional must determine whether the tissue growth is malignant or benign, and for example, in the case of metastatic cancer, the site of origin of the cancer. Correctly identifying the nature of the tissue is critical to correctly diagnosing the condition of the patient and recommending an appropriate treatment.
- the method includes analyzing a neoplastic tissue sample from the patient to obtain biological sample data which is input into a model dataspace having models having two or more dimensions and comprising data characteristic of multiple neoplastic tissue types.
- the method further includes identifying the neoplastic tissue sample as the neoplastic tissue type with which the biological sample data of the neoplastic tissue sample has the best fit in the model dataspace and treating the patient with a treatment suitable to the identified neoplastic tissue type.
- the transcriptional profile data is maintained in a normalized data space.
- the models have five or fewer dimensions, such as having two or three dimensions.
- the model dataspace is built by a support vector machine (SVM).
- the model dataspace can be prepared by T-distributed Stochastic Neighbor Embedding (t-SNE).
- t-SNE T-distributed Stochastic Neighbor Embedding
- some embodiments include the biological data profile comprising RNA sequence data.
- two or more of the models can be built using one or more support vector machines. Such embodiments can further include retraining one or more of the models by comparing training data and the training data can include a set of biological training data, wherein a subset of the biological training data is used in the retraining.
- the subset can be a set having a best fit of the subset of biological training data determined using at least one of the support vector machines.
- the subset can be a set of most frequently varying genes.
- each of the expressions “at least one of A, B and C,” “at least one of A, B, or C,” “one or more of A, B, and C,” “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone,
- A, B, and C in the above expressions refers to an element, such as X, Y, and Z, or class of elements, such as Xl-Xn, Yl-Ym, and Zl-Zo
- the phrase is intended to refer to a single element selected from X, Y, and Z, a combination of elements selected from the same class (e.g., XI and X2) as well as a combination of elements selected from two or more classes (e.g., Y1 and Zo).
- Fig. l is a block diagram view of a system in accordance with at least some embodiments of the present disclosure
- Fig. 2 is a block diagram view of additional details of a system in accordance with at least some embodiments of the present disclosure
- FIG. 3 is a flow chart view of a method of building a prediction model in accordance with at least some embodiments of the present disclosure
- Fig. 4 is a flow chart view of a method of determining a tissue prediction in accordance with at least some embodiments of the present disclosure
- Fig. 5 is a flow chart view of a method of retraining a model in accordance with at least some embodiments of the present disclosure.
- Fig. 6 is a schematic that outlines the process for the validation and performance estimation of provided methodology and tools.
- Figs. 7A, 7B, and 7C are Venn diagrams that show the overlap of tumors with cancer type and/or ICD10 annotations.
- Figs. 8A and 8B show the cancer types used for training.
- Fig. 8A shows cancer types in the FFPE set
- Fig. 8B shows cancer types in the TCGA set.
- Fig. 9 shows the cancer types that were used for training and cancer types that were not used.
- the cancer types to the left of the dashed line were used, while those to the right were not used.
- the lower bars represent the tumors from the FFPE set, and the upper bars represent tumors from TCGA. Above the bars, the bottom number is the number of FFPE samples, and the top number is the number of TCGA samples.
- Fig. 10 shows the categories of cancer types on which the model predicts. Those categories that do not have enough validation samples are marked with an asterisk. Beside the bars, the left number is the number of FFPE samples, and the right number is the number of TCGA samples.
- Fig. 11 shows the confusion matrix summarizing the results for all predictions (both high and low confidence).
- the true labels are on the x-axis and the predicted labels are on the y-axis.
- Figs. 12A and 12B show the Per-Tissue Accuracy (Fig. 12A) and PPV (12B) of FFPE samples. The points in the plots represent point estimates of each metric.
- Fig. 13 shows the confusion matrix summarizing the results for all predictions (both high and low confidence). The true labels are on the x-axis and the predicted labels are on the y-axis.
- Figs. 14A and 14B show the Per-Tissue Accuracy (Fig. 12A) and PPV (12B) of TCGA samples. The points in the plots represent point estimates of each metric. Confidence intervals indicate 95 th percentile binomial distribution confidence interval.
- Fig. 16 shows the confusion matrix for the comparison of true labels to the predicted labels.
- the true labels are on the x-axis and the predicted labels are on the y- axis.
- Fig. 17 shows the summary of the comparison of the predicted labels to true labels after applying the prediction model to 2075 TCGA validation samples.
- the true labels are on the x-axis and the predicted labels are on the y-axis.
- Fig. 18 shows the confusion matrix for the comparison of true labels to the predicted labels for certain FFPE samples.
- the true labels are on the x-axis and the predicted labels are on the y-axis.
- Figs. 19A, 19B, and 19C show a successful prediction by the model of Example 1.
- Fig. 19A shows the individual samples grouped by similarity, which separates into tissue of origin clusters as indicated by labelled tissue types. The patient’s tumor, in the Colorectal cluster is denoted with a star, and the most similar tumors are denoted with large circles, as indicated in the inset box.
- Fig. 19B shows the molecularly most similar tumors in TCGA dataset (clinical samples are not displayed on the report for privacy reasons).
- Fig. 19C shows the distributions of true and false positives and negatives, with the patient’s tumor’s score indicated by the dashed line.
- Fig. 20 shows a successful prediction by the model of Example 1.
- the individual samples are grouped by similarity, which separates into tissue of origin clusters as indicated by labelled tissue types.
- the patient’s tumor, in the Brain cluster, is denoted with a star, and the most similar tumors are denoted with large circles, as indicated in the inset box.
- Fig. 21 shows a prediction by the model of Example 1.
- the individual samples are grouped by similarity, which separates into tissue of origin clusters as indicated by labelled tissue types.
- the patient’s tumor, in the Breast Basal cluster, is denoted with a star, and the most similar tumors are denoted with large circles, as indicated in the inset box.
- any language directed to a computer should be read to include any suitable combination of computing devices, including servers, interfaces, systems, databases, agents, peers, engines, controllers, modules, or other types of computing devices operating individually or collectively.
- the computing devices comprise a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (e.g., hard drive, FPGA, PLA, solid state drive, RAM, flash, ROM, etc.).
- the software instructions configure or program the computing device to provide the roles, responsibilities, or other functionality as discussed below with respect to the disclosed apparatus.
- the disclosed technologies can be embodied as a computer program product that includes a non-transitory computer readable medium storing the software instructions that causes a processor to execute the disclosed steps associated with implementations of computer- based algorithms, processes, methods, or other instructions.
- the various servers, systems, databases, or interfaces exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods.
- Data exchanges among devices can be conducted over a packet-switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network; a circuit switched network; cell switched network; or other type of network.
- the disclosed techniques provide many advantageous technical effects including the analysis of biological characteristics to create various models of biological tissue types, where the models can predict origins of cancer within the body.
- the models may be within a model dataspace allowing multi-dimensional matching of data within one or more models.
- the computer modeling of the data may advantageously provide the ability to analyze and use data in methods and systems that were not previously available.
- advantageous technical effects include the ability of systems and methods of the invention to more accurately predict one or more origins of neoplastic tissue in a biological sample from a patient and to more easily visualize the relationship between cancers of known origin and a sample tissue of a patient. This knowledge allows the patient to be more effectively treated because the effectiveness of cancer treatments can be dependent on the origin of cancer within the body.
- the focus of the disclosed inventive subject matter is to enable construction or configuration of a computing device to operate on vast quantities of digital data, beyond the capabilities of a human.
- the digital data represents the biological characteristics of a patient or patient tissue
- the digital data is a representation of one or more digital models of the biological characteristics of a patient or patient tissue not the biological characteristics of a patient or patient tissue themselves.
- inventive subject matter provides many example embodiments of the inventive subject matter. Although each embodiment represents a single combination of inventive elements, the inventive subject matter is considered to include all possible combinations of the disclosed elements. Thus if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly disclosed.
- Coupled to is intended to include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements). Therefore, the terms “coupled to” and “coupled with” are used synonymously.
- illustrative systems 100, 200 will be described in accordance with at least some embodiments of the present disclosure.
- the systems 100, 200 may include one or more computing devices operating in cooperation with one another to provide classification of biological data using an ensemble of tissue models.
- the components of the system 100, 200 may be utilized to facilitate one, some, or all of the methods described herein or portions thereof without departing from the scope of the present disclosure.
- a server is depicted as including particular components or instruction sets, it should be appreciated that embodiments of the present disclosure are not so limited. For instance, although a single server may be provided with all of the instruction sets depicted and described in the server of Fig. 1, various instruction sets may reside in multiple servers. Alternatively, different instruction sets may exist other than those depicted in Fig. 1.
- the systems 100, 200 are shown to include a communication network 104 that facilitates machine-to-machine communications between server 116 and one or more other devices.
- the system 100 is shown to include a prediction server 116.
- the system 200 is shown to include a prediction server 116 that communicates with a client device 204.
- the communication network 104 may comprise any type of known communication medium or collection of communication media and may use any type of protocols to transport messages between endpoints.
- the communication network 104 may include wired and/or wireless communication technologies.
- the Internet is an example of the communication network 104 that constitutes an Internet Protocol (IP) network consisting of many computers, computing networks, and other communication devices located all over the world, which are connected through many telephone systems and other means.
- IP Internet Protocol
- the communication network 104 examples include, without limitation, a standard Plain Old Telephone System (POTS), an Integrated Services Digital Network (ISDN), the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Session Initiation Protocol (SIP) network, a Voice over Internet Protocol (VoIP) network, a cellular network, and any other type of packet- switched or circuit-switched network known in the art.
- POTS Plain Old Telephone System
- ISDN Integrated Services Digital Network
- PSTN Public Switched Telephone Network
- LAN Local Area Network
- WAN Wide Area Network
- VoIP Voice over Internet Protocol
- cellular network any other type of packet- switched or circuit-switched network known in the art.
- the communication network 104 need not be limited to any one network type, and instead may be comprised of a number of different networks and/or network types.
- the communication network 104 may comprise a number of different communication media such as coaxial cable, copper
- the client device 204 may correspond to any type of computing resource that includes a processor, computer memory, and a user interface.
- the client device 204 may also include one or more network interfaces that connect the client device 204 to the communication network 104 and enable the client device 204 to send/receive packets via the communication network 104.
- Non-limiting examples of client devices 204 include personal computers, laptops, mobile phones, smart phones, tablets, etc.
- the client device 204 is configured to be used by and/or carried by a user 208. As will be discussed in further detail herein, the user 208 may utilize a client device 204 to receive and/or view various outputs of the prediction server 116.
- the prediction server 116 or components thereof may be provided as a single server or in a cloud-computing environment.
- the prediction server 116 may be configured to execute one or multiple different types of instruction sets.
- the prediction server 116 may be configured to execute instruction sets in connection with processing patient data (i.e., biological sample data) received from a patient data source 156 and transforming the patient data into biological sample data that is useable by the prediction server 116.
- patient data i.e., biological sample data
- the biological sample data received from the patient data source 156 may include data relating to a biological tissue and in particular, neoplastic tissue, including tissue exhibiting dysplasia or hyperplasia, benign tumors and malignant tumors.
- the prediction server 116 may be configured to classify the biological sample data using one or more data space(s) in which the prediction server 116 is configured to process data. In this way, the prediction server 116 can transform the biological sample data into data that comprises a format necessary for further processing by the prediction server 116.
- Bio sample data can refer to genes; nucleic acid molecules (DNA or RNA), including sequence information; RNA polymerase levels and/or activity; RNA processing; proteins, including amino acid sequence information and/or three-dimensional structure and/or post-translational modifications; organelles; cells; cellular structures; cell signaling (including chemical and receptor signaling); cell cycle information; organs; and organisms.
- Biological sample data may be obtained from the cancer genome atlas (TCGA) data.
- Biological sample data can include information regarding different states of a biological or chemical entity, for example, information regarding an unmodified protein as compared to phosphorylated protein or a free base form of a drug as compared to a salt of the drug.
- Biological sample data can also include any “omics” data, including genomics, transcriptomics, proteomics or metabolomics.
- the prediction server 116 is shown to include a processor 120, memory 124, and network interface 128.
- the prediction server 116 is also shown to include a database interface 152, which may be provided as a physical set of database links and drivers. Alternatively or additionally, the database interface 152 may be provided as one or more instruction sets in memory 124 that enable the processor 120 of the prediction server 116 to interact with the databases 156, 157, 159.
- the network interface 128 provides the server 116 with the ability to send and receive communication packets over the communication network 104.
- the network interface 128 may be provided as a network interface card (NIC), a network port, drivers for the same, and the like. Communications between the components of the server 116 and other devices connected to the communication network 104 may all flow through the network interface 128.
- NIC network interface card
- the processor 120 may correspond to one or many computer processing devices.
- the processor 120 may be provided as silicon, as a Field Programmable Gate Array (FPGA), an Application-Specific Integrated Circuit (ASIC), any other type of Integrated Circuit (IC) chip, a collection of IC chips, or the like.
- the processor 120 may be provided as a microprocessor, Central Processing Unit (CPU), or plurality of microprocessors that are configured to execute the instructions sets stored in memory 124. Upon executing the instruction sets stored in memory 124, the processor 120 enables various functions of the prediction server 116.
- the memory 124 may include any type of computer memory device or collection of computer memory devices. Non-limiting examples of memory 124 include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Electronically - Erasable Programmable ROM (EEPROM), Dynamic RAM (DRAM), etc.
- RAM Random Access Memory
- ROM Read Only Memory
- EEPROM Electronically - Erasable Programmable ROM
- DRAM Dynamic RAM
- the memory 124 may be configured to store the instruction sets depicted in addition to temporarily storing data for the processor 120 to execute various types of routines or functions. Although not depicted, the memory 124 may include instructions that enable the processor 120 to store or retrieve data from the databases 156, 157, 159.
- the memory 124 may include instructions that enable the prediction server 116 to process various types of data (for example, use training data from the training data database 157 to create (e.g., train) or retrain prediction models 142, use patient data from the patient data database 156 to provide predictions using the prediction models 142, etc.).
- various types of data for example, use training data from the training data database 157 to create (e.g., train) or retrain prediction models 142, use patient data from the patient data database 156 to provide predictions using the prediction models 142, etc.
- the illustrative instruction sets that may be stored in memory 124 include, without limitation, data organization instructions 136, an inference engine 144, a training engine 146, a prediction engine 134, arbitration instructions 148, and verification instructions 147.
- the patient data source 156 stores biological sample data of patient tissue samples.
- the patient data database 156 can store RNA sequencing data and transcription/expression data, mRNA data, DNA data, and protein data, among others.
- the patient sample data corresponds to textual data.
- the training data source 157 (also referred to herein as the training data database) stores biological data used to train or retrain the prediction models 142.
- the training data database 157 stores biological sample data.
- Such data can include, for example, RNA expression data, RNA expression labels, clinical data, and The Cancer Genome Atlas (TCGA) data, among others.
- TCGA Cancer Genome Atlas
- the inference engine 144 when executed by the processor 120, enables the prediction server 116 to scan and analyze the biological sample data (e.g., received from the patient data source 156 and/or the training data source 157) and, if necessary, manipulate the data or obtain additional biological data.
- the inference engine 144 may obtain RNA sequencing data and/or one or more RNA expression (e.g., transcription) profiles and/or expression levels related to various genes within biological sample data.
- the expression profiles and/or expression levels may be inferred for each gene in a sample (where the sample may be either sample data from the training data in training data database 157, or sample data from the patient data database 156).
- the inference engine 144 may access a data source to obtain the genetic material of other cells in the tissue sample or a different sample from a patient in order to obtain a full transcriptional profile of the biological sample data.
- the prediction server 116 may access a data source or instructions to generate (e.g., convert the data into) data types that are necessary for use with various processes of the prediction server 116, such as for use with the prediction models 142.
- the inference engine 144 enables the prediction server 116 to measure the relative activity of previously identified target genes within the biological sample data by performing expression profiling of the sample data.
- the inference engine 144 may be configured to automatically scan the text of the biological sample data and extract relevant data from within the biological sample data.
- the biological data and inferred expression levels may be stored on the training data source 157 (e.g., for training sample data) and on the patient data source 156 (e.g., for patient sample data).
- the inferred expression data may be compared to other training data (e.g., previous clinical data and TCGA data (the cancer genome atlas, also referred to as “genome atlas” herein)) in order to build the prediction models 142 (e.g., model A 142a through model N 142n).
- the training engine 146 when executed by the processor 120, may enable the prediction server 116 to train the prediction models 142 by comparing various training data with expression profiles.
- the training data and the expression profiles may be obtained from the training data database 157. Any amounts or types of training data may be compared with any of the expression profiles and/or inferred expression levels to create the prediction models 142.
- biological data from the training data database 157 is used to build several support vector machines (SVMs) for classification of biological sample data. Then, the training engine 146 compares data (e.g., from the training data database 146) against each of the SVMs to determine which genes should be used to build each of the prediction models 142. In some embodiments, the most frequently varying genes may be used to build the prediction models 142.
- SVMs support vector machines
- the training engine 146 when executed by the processor 120, may also be used to retrain any of the prediction models 142. Any criteria or data may be used to determine that retraining should be done. Retraining any of the prediction models 142 may be done using batches of data (e.g., from the training data database 157) or using hard user input.
- the retraining instructions may be configured to perform various tasks related to retraining the models, including but not limited to: determining whether retraining the models is necessary, suggesting retraining the models, providing suggested updates to the models, and retraining the models, if necessary.
- the prediction models 142 may be based on different tissue types. In various embodiments there may be a single prediction model for each tissue type. For example, there may be twenty-five different tissue types and each of these twenty-five types may have a corresponding prediction model within the prediction models 142. In some embodiments, the prediction models 142 may be three-dimensional models occupying a single data space. In some embodiments, some of the prediction models 142 may occupy one data space while others occupy one or more different data spaces. Although the individual prediction models contain separate data (e.g., based on different tissue types), the data from the models may overlap within the data space. In various embodiments, a data space may be processed using visualization techniques to improve the visualization of the data.
- a T-distributed Stochastic Neighbor Embedding (t-SNE) technique may be applied to one or more of the data spaces to improve visualization of the one or more prediction models 142 by reducing the dimensionality of the modeled data.
- t-SNE T-distributed Stochastic Neighbor Embedding
- the prediction engine 134 when executed by the processor 120, may enable the prediction server 116 to compare various patient data 156 (e.g., genetic data from one or more tissue samples) with one or more of the prediction models 142 (e.g., model A 142a through model N 142n) to provide predictions for the patient data.
- patient data 156 e.g., genetic data from one or more tissue samples
- the prediction models 142 e.g., model A 142a through model N 142n
- some or all of the patient data 156 is input to each of the models 142 to obtain prediction results (also referred to herein as prediction data), which the prediction engine 134 stores in the prediction data database 159).
- the prediction engine 134 obtains the expression data (e.g., transcription data or expression profiles from the genetic data from the patient data source 156, which may be inferred data provided by the inference engine 144) for one or more genes in a tissue sample and the prediction engine 134 compares the expression data to one or more of the prediction models 142.
- the prediction engine 134 matches the expression data from a tissue sample from the patient data database 156 to each of the prediction models 142 (i.e., each of model A 142a through model N 142n).
- the prediction engine 134 may obtain predictions for the samples from the comparisons of expression levels with the prediction models 142, and store the predictions on the prediction data database 159.
- the prediction models 142 can advantageously be used to predict one or more origins of cancer cells based on expression data (e.g., transcriptional profiles) of a biological sample.
- the verification instructions 147 include instructions to verify the models 142 and/or to verify other data.
- the verification instructions 147 can use any type of verification methods (e.g., models, programs, etc.) to perform the verifications.
- the verification instructions 147 can use an iterative modeling process with portions of data (e.g., training data 157) to determine an accuracy of the prediction models 142.
- a five fold vector prediction may be used to verify that an SVM model is a preferred type of model to train the prediction models 142.
- the model type may be used to build each of model A 142a through model N 142n in the prediction models 142 (e.g., the model type can be used to build a model for each tissue type).
- Models include, but are not limited to, random forest, nearest K neighbor, neural network, and ransac, among others. Models other than basic machine learning models may be used. Verification data (including but not limited to mean accuracy, positive predictive value, and false discover rate) may be monitored to determine how well any of the prediction models 142 are performing, and if any of the models 142 should be retrained, then the prediction server 116 can retrain using training instructions 146. In some embodiments, one or more of the multiple data models are updated automatically in response to a confidence score meeting or exceeding the predetermined confidence threshold.
- the data organization instructions 136 may organize data used by and generated by the prediction server 116.
- the data organization instructions 136 may be configured to organize the data output by the training engine 146 for eventual storage as model A 142a through model N 142n in prediction models 142.
- the data organization instructions 136 may enable the prediction server 116 to organize the model data based on the data outputs of the training engine 146.
- the data organization instructions 136 organize and classify the data, or portions of the data.
- the data organization instructions 136 may be configured to organize the various data inputs based on a genomics classifications and/or labeling.
- genomics classifications include, without limitation, shared/common pathways, cell communication behaviors, and/or cellular network behaviors.
- the data organization instructions 136 may be configured to organize the data output by the inference engine 144 for eventual storage as training data within the training data database 157 and/or transcriptional data within the patient data database 156. For instance, the data organization instructions 136 may enable the server 116 to organize the sample data based on inferences drawn by the inference engine 144.
- the arbitration instructions 148 may be configured to resolve conflicts within the instruction sets of the prediction server 116.
- the arbitration instructions 148 may be configured to resolve conflicts between inferences generated by the inference engine 144 and/or between conclusions drawn by the training engine 146.
- the arbitration instructions 148 may also enable the prediction server 116 to adhere to a predetermined policy or philosophy in connection with resolving such inference conflicts.
- these predetermined policies or philosophies may be applied to newly-generated inferences as well as inferences that were previously generated by the inference engine 144 and stored in connection with prediction models 142, prediction data 159, and/or training data 157.
- the prediction server 116 may also have one or more of its instruction sets (e.g., the inference engine 144) executed as a neural network or similar type of artificial intelligence data structure.
- these neural networks such as an intelligent inference engine 144, may be capable of being dynamically trained and updated based on outputs of the prediction server 116.
- one or more models used by an intelligent inference engine 144 may be constantly analyzed for possible improvements thereto. Such analysis may be done internally or by an external neural network that is specifically designed to train other neural networks.
- the data organization instructions 136 may be executed as a neural network whose coefficients between nodes are constantly updated in accordance with desired updates to the data organization for any of the data associated with the prediction server 116. For instance, if a particular normalized data space is initially used by the data organization instructions 136, but there is a desire to try a second, different, normalized data space that focuses on different biological information (e.g., shared pathways as compared to cellular communication behaviors), then the data organization instructions 136 may be reconfigured (e.g., offline rather than reconfiguring online with live data) to determine if using a different normalized data space is useful, provides certain benefits, or makes the overall system work less efficiently.
- a particular normalized data space is initially used by the data organization instructions 136, but there is a desire to try a second, different, normalized data space that focuses on different biological information (e.g., shared pathways as compared to cellular communication behaviors)
- the data organization instructions 136 may be reconfigured (e.g., offline rather than reconfiguring online with
- the data organization instructions 136 may be updated within the prediction server 116 to begin applying the new normalized data space to further organizations of the transcriptional profile data.
- a user may interact with the predication server 116 via a communication network 104 and a client device 204.
- the communication network 104 facilitates machine-to-machine communications between one or more servers (e.g., prediction server 116) and/or one or more client devices (e.g., client device 204).
- Figs. 3-5 various methods of operating the systems 100, 200 or components therein will be described. It should be appreciated that any of the following methods may be performed in part or in total by any of the components depicted and described in connection with Figs. 1 or 2.
- step 304 data source(s)
- step 308 the expression levels of RNA are inferred to obtain RNA expression profiles.
- the inference engine 144 may access a data source to obtain the genetic material of cells in one or more tissue samples in order to obtain a full transcriptional profile of the sample(s).
- the full transcriptional profile(s) After the full transcriptional profile(s) have been obtained, they are compared to the training data to build a prediction model at step 312. At step 316, the prediction data model is stored in a database. The steps of Fig. 3 may be duplicated and/or repeated for various tissue types to obtain prediction models for each tissue type.
- the methods of Fig. 4 may be applied after the prediction models are obtained, e.g., using the methods of Fig. 3.
- the methods begin at step 404, where at least one transcriptional profile of the tissue sample is received.
- the transcriptional profile is input into each of the prediction models (e.g., the prediction models obtained in Fig. 3) to obtain predictions for each tissue type.
- the prediction server 116 may compare the transcriptional profile with each of the SVMs obtained in Fig. 3 to obtain a result of the comparison.
- the comparison results are confirmed.
- this may be done using five fold vector prediction, where the mean accuracy, positive predictive value, false discovery rate, etc., of the results of each comparison are calculated to determine a best match of the comparisons with the SVMs.
- the result of the tissue prediction is obtained by determining the best match from step 412.
- the result of the tissue prediction may be a single predicted tissue type, or multiple predicted tissue types.
- retraining a model will be described in accordance with at least some embodiments of the present disclosure.
- the methods begin with receiving raw input data at step 504.
- the model is modified based on retraining data at step 512.
- the modified model is verified.
- the retrained model may be stored in a database in step 520. It should be appreciated that any combination of prediction processes depicted and described herein can be performed without departing from the scope of the present disclosure. Alternatively or additionally, any number of other prediction processes can be developed by combining various portions or sub-steps of the described prediction processes without departing from the scope of the present disclosure.
- This Example describes studies to validate methods of predicting the tissue of origin of a given single tumor sample (n-of-one case) based on RNA sequencing data.
- CUP Carcinomas of Unknown Primary
- CUP Carcinomas of Uncertain Primary
- OPT Occult Primary Tumors
- This Example provides methodology and software tools for computer-assisted site of origin diagnosis to:
- FIG. 6 is a schematic that outlines aspects of the current invention, including the process for the validation and performance estimation of provided methodology and tools.
- Nant clinical - set of clinical tumor samples sequenced and processed by Nantomics, LLC of Culver City, California.
- HUGO - HUGO Gene Nomenclature Committee is a committee of the Human Genome Organisation that sets the standards for human gene nomenclature. HUGO gene symbols are standardized and approved by this committee gene symbols.
- RNA sequencing data from combined cohort of FFPE and FF samples was used.
- FFPE samples came from non-metastatic (as defined in 2.3.1) Nant clinical FI cohort and FF samples came from non-metastatic TCGA tumor cohort.
- TCGA data was downloaded in raw sequencing format from GDC and processed to produce TPM estimates per gene.
- Tumor submissions have several entries that can be used to label a tumor, for example a text-based cancer type and an optional ICD10 code.
- An aim was to remove complications from annotation from metastatic tumors, which might have been entered by either the primary or secondary anatomic site.
- the cancer type, pathology and ICD10 description fields were examined for words “metastatic”, “metastasis”, or “secondary” and marked those samples as metastatic. All metastatic samples were set aside for the clinical curator to go through clinical reports and fill in these annotations, for use in other validation.
- Some tumors had cancer type annotation as Oral and Throat Cancers (Including Thyroid). Each of these was assigned to either Head and Neck (C00-C14 ICD10), Thyroid (C73 ICD10), or flagged as needing a review by the clinical curator if the ICD10 code matched neither.
- the prediction model was trained on RNA sequencing data from combined cohort of non-metastatic Nant clinical FI FFPE samples and non-metastatic TCGA tumor FF samples. TPM expression quantifications obtained by running RSEM bioinformatics tool were used. Normalization techniques were applied to the TPM values, as described in the methodology description, to make TCGA samples more comparable to FFPE samples and to avoid batch effects in the final dataset.
- the training dataset includes 8,110 TCGA samples and 559 FFPE samples. The cancer types are shown in Figs. 8A and 8B.
- the 29 categories shown in Fig. 9 are composed of varying numbers of FFPE and TCGA samples as shown in the barplot of Fig. 9 (upper bars - FFPE samples, lower bars - TCGA samples).
- CCSI Clinical and Laboratory Standards Institute
- CCSI Clinical and Laboratory Standards Institute
- Ribosomal RNA was degraded, and stranded CDNA was created with the Kapa Stranded RNA-seq Kit with RiboErase.
- TPM quantifications for protein coding genes were extracted from RSEM output files by computing sum of all TPM quantifications for all transcripts per HUGO symbol for those symbols that had at least one NM_ transcript
- the first TCGA dataset was quantile normalized by mapping per-gene quantiles between TCGA and FFPE data, with exclusion of at-zero expression from these distributions.
- the second TCGA dataset was normalized by using the first dataset's distributions to compute quantiles and then mapping those quantiles.
- tumor types have molecularly divergent subtypes. Information about the subtype is diagnostically, prognostically, and clinically important. Therefore, extra steps were taken to introduce subtype labels for the following tumor types: breast (basal or non-basal), esophageal (adenocarcinoma or squamous), and lung (adenocarcinoma or squamous). These subtypes were readily available for TCGA samples, as they were assigned by pathologists during data collection. These subtypes were not always available for FFPE clinical data. Therefore, a computational step was developed to predict tumor subtypes on samples for these three tumor types.
- centroid expression vector was computed (mean expression for each gene across that subtype cohort of samples).
- Lymphoma samples were of Non-Hodgkin’s subtype. There were not enough samples to separate this subtype from other subtypes of Lymphoma. Therefore, all Lymphoma samples were combined into a single category labeled “Lymphoma”.
- Adrenal and Pheochromocytoma and Paraganglioma tumor types are molecularly similar and were not well separated by the prediction model. Therefore, these two tumor types were combined into a single category labeled “Adrenal/PCPG”.
- SVM Support Vector Machine
- Tissue category labels 27 unique categories for each of these samples were used as training labels.
- a high confidence tissue category was any tissue that had a final accuracy of at least 95% and PPV of at least 95%.
- a low confidence tissue category was any tissue that had a final accuracy of below 95% or PPV of below 95%.
- 4.12.3.1.2.1.1 Between tumor purity and prediction evaluation metrics. For this analysis, computationally derived tumor purity values, a method based on allele frequencies, was used. [00230] 4.12.3.1.2.1.2. Between transcript integrity number (TIN), a post-sequencing computationally derived proxy for sample’s RNA quality, and prediction evaluation metrics.
- TIN transcript integrity number
- Fig. 11 shows the confusion matrix summarizing the results for all predictions (both high and low confidence).
- the true labels are on the x-axis and the predicted labels are on the y-axis.
- the first sample incorrectly predicted was a colorectal liver met, which was predicted to come from pancreas. In this case neither site of origin, nor site of metastasis was predicted. It is, however, possible that this sample was initially mis-annotated and molecular-derived site is a more correct one.
- the second incorrectly predicted sample was a cystic carcinoma metastasized to lymph nodes, predicted to come from lung with a low confidence score.
- Fig. 13 shows the confusion matrix summarizing the results for all predictions (both high and low confidence). The true labels are on the x-axis and the predicted labels are on the y-axis.
- Per-tissue accuracy (A) and PPV (B) were analyzed, with the results shown in Figs. 14A and 14B. The points in the plots represent point estimates of each metric. Confidence intervals indicate 95 th percentile binomial distribution confidence interval.
- Out of 318 metastatic TCGA samples in the training dataset 317 predicted correctly and 1 non-basal breast sample predicted to come from ovarian tissue.
- Figure 17 shows the summary of the comparison of the predicted labels to true labels after applying the prediction model to 2075 TCGA validation samples.
- the true labels are on the x-axis and the predicted labels are on the y-axis.
- the prediction model was run on FFPE samples annotated by a tissue type that was either not present in one of the model outputs or for which not enough validation samples were available, in order to test edge cases of when the sample’s site of origin was not present in the training data.
- the comparison of true labels to the predicted labels is summarized in the confusion matrix shown in Fig. 18. The true labels are on the x-axis and the predicted labels are on the y-axis.
- Figure 19 shows the results of a successful prediction of the model described in Example 1.
- the process of Example 1 successfully identified metastatic colon adenocarcinoma as being colon cancer.
- Figure 19A shows the locations of molecular signature groupings of many cancers, with the patient’s tumor depicted with a star and the most similar molecular signatures depicted as circles.
- Figure 19B shows the correlations and certain details for the most similar molecular signatures in the TCGA dataset, or those depicted as circles in Fig. 19A.
- Figure 19C shows the distributions of true positives and negatives and false positives and negatives from the process of Example 1. The dashed line indicates the score of the patient’s tumor, demonstrating that it was above any previously observed false positive prediction scores.
- Figure 20 shows the results of a successful prediction of the model described in Example 1.
- Figure 20 shows the locations of molecular signature groupings of many cancers, with the patient’s tumor depicted with a star and the most similar molecular signatures depicted as circles.
- the process of Example 1 successfully identified pediatric glioma, which was not in the training dataset, as similar to adult brain cancers.
- This Example demonstrates that the processes of the current invention are capable of identifying tumors that are unknown to the prediction models of the invention. The implications of this include the models’ abilities to encounter new and even unknown (to the model and/or to medicine in general) tumors and identify their origin and similarity to other tumors.
- those abilities would facilitate diagnosis, for example in identifying that a tumor has metastasized, and treatment, for example by helping with the selection of drugs that work on similar tumors.
- These aspects of the invention unexpectedly, can outperform traditional diagnosis and evaluation, for example by doctors, as the processes of the invention can identify novel and rare tumors that can be difficult or impossible to diagnose by other means.
- Figure 21 shows the results of a prediction for which the model matches the predicted tumor type to a different tumor type.
- the model matched the tumor with basal breast cancer, while the tumor had been annotated as an adenoid cystic carcinoma.
- Adenoid cystic carcinomas are rare cancers and were not in the training dataset.
- the model here demonstrates that it can make connections between tumor types that can inform treatment and lead to better outcomes, superior to what traditional methods might accomplish.
- the molecular similarity between the adenoid cystic carcinoma and basal breast cancer indicates that certain treatments for basal breast cancer could be used for the treatment of the patient’s adenoid cystic carcinoma. This surprising finding demonstrates that the model can outperform other methods of diagnosis and inform treatment options that are appropriate for tumors in a manner superior to traditional diagnosis.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Public Health (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Pathology (AREA)
- Primary Health Care (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
L'invention concerne un système et un procédé de prédiction de l'origine d'un échantillon de tissu néoplasique d'un patient pour aider au traitement du patient en informant précisément un professionnel médical de l'origine du tissu de telle sorte qu'un traitement approprié peut être proposé au patient. Le procédé comprend généralement la préparation d'un profil de données biologiques de l'échantillon de tissu néoplasique et la comparaison de celui-ci avec des modèles dans un modèle d'espace de données de modèle pour déterminer le meilleur ajustement d'un modèle avec le profil de données biologiques.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201962885996P | 2019-08-13 | 2019-08-13 | |
| US62/885,996 | 2019-08-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021030193A1 true WO2021030193A1 (fr) | 2021-02-18 |
Family
ID=74569802
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2020/045421 Ceased WO2021030193A1 (fr) | 2019-08-13 | 2020-08-07 | Système et procédé de classification de données génomiques |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2021030193A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116386043A (zh) * | 2023-03-27 | 2023-07-04 | 北京市神经外科研究所 | 一种脑神经医疗影像胶质瘤区域快速标记方法及系统 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170051281A1 (en) * | 2014-02-19 | 2017-02-23 | The Trustees Of Columbia University In The City Of New York | Method and composition for diagnosis or treatment of aggressive prostate cancer |
| WO2019018374A1 (fr) * | 2017-07-17 | 2019-01-24 | University Of Pittsburgh-Of The Commonwealth System Of Higher Education | Test de diagnostic et de pronostic pour de multiples types de cancer sur la base d'un profilage de transcrits |
| US10340031B2 (en) * | 2017-06-13 | 2019-07-02 | Bostongene Corporation | Systems and methods for identifying cancer treatments from normalized biomarker scores |
-
2020
- 2020-08-07 WO PCT/US2020/045421 patent/WO2021030193A1/fr not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170051281A1 (en) * | 2014-02-19 | 2017-02-23 | The Trustees Of Columbia University In The City Of New York | Method and composition for diagnosis or treatment of aggressive prostate cancer |
| US10340031B2 (en) * | 2017-06-13 | 2019-07-02 | Bostongene Corporation | Systems and methods for identifying cancer treatments from normalized biomarker scores |
| WO2019018374A1 (fr) * | 2017-07-17 | 2019-01-24 | University Of Pittsburgh-Of The Commonwealth System Of Higher Education | Test de diagnostic et de pronostic pour de multiples types de cancer sur la base d'un profilage de transcrits |
Non-Patent Citations (2)
| Title |
|---|
| JOAO C. GUIMARAES, MIHAELA ZAVOLAN: "Patterns of ribosomal protein expression specify normal and malignant human cells", GENOME BIOLOGY, vol. 17, no. 1, 1 December 2016 (2016-12-01), XP055566778, DOI: 10.1186/s13059-016-1104-z * |
| RAM AJORE, DAVID RAISER, MARIE MCCONKEY, MAGNUS JÖUD, BERND BOIDOL, BRENTON MAR, GORDON SAKSENA, DAVID M WEINSTOCK, SCOTT ARMSTRON: "Deletion of ribosomal protein genes is a common vulnerability in human cancer, especially in concert with TP53 mutations", EMBO MOLECULAR MEDICINE (ONLINE), WILEY - V C H VERLAG GMBH & CO. KGAA, DE, vol. 9, no. 4, 1 April 2017 (2017-04-01), DE, pages 498 - 507, XP055566779, ISSN: 1757-4684, DOI: 10.15252/emmm.201606660 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116386043A (zh) * | 2023-03-27 | 2023-07-04 | 北京市神经外科研究所 | 一种脑神经医疗影像胶质瘤区域快速标记方法及系统 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7689557B2 (ja) | 相同組換え欠損を推定するための統合された機械学習フレームワーク | |
| US11081210B2 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
| US20230114581A1 (en) | Systems and methods for predicting homologous recombination deficiency status of a specimen | |
| JP7685436B2 (ja) | 転移性組織サンプルのトランスクリプトームデコンボリューション | |
| US20210327534A1 (en) | Cancer classification using patch convolutional neural networks | |
| US20250061972A1 (en) | Molecular response and progression detection from circulating cell free dna | |
| US20220215900A1 (en) | Systems and methods for joint low-coverage whole genome sequencing and whole exome sequencing inference of copy number variation for clinical diagnostics | |
| EP4008005A1 (fr) | Procédés et systèmes de détection d'instabilité de microsatellites d'un cancer dans un dosage de biopsie liquide | |
| WO2020077232A1 (fr) | Procédés et systèmes pour détection et analyse des variants d'acides nucléiques | |
| CN112005306A (zh) | 选择、管理和分析高维数据的方法和系统 | |
| JP7041614B2 (ja) | 生体データにおけるパターン認識のマルチレベルアーキテクチャ | |
| US20250284956A1 (en) | Systems and methods for using a convolutional neural network to detect contamination | |
| US20230064530A1 (en) | Detection of Genetic Variants in Human Leukocyte Antigen Genes | |
| De Riso et al. | Artificial intelligence for epigenetics: towards personalized medicine | |
| WO2021030193A1 (fr) | Système et procédé de classification de données génomiques | |
| WO2024192105A1 (fr) | Optimisation de l'attribution des panels de séquençage | |
| EP3935638A1 (fr) | Système et procédé d'appel de variant | |
| Javed | Differential Expression Analysis of RNA-Seq Data and Co-expression Networks | |
| WO2019016353A1 (fr) | Classification de mutations somatiques à partir d'un échantillon hétérogène | |
| WO2025071851A1 (fr) | Systèmes et procédés d'appel de variant d'échantillon unique à l'aide de fractions d'allèle tumoral circulant | |
| WO2023009863A1 (fr) | Détection de variants génétiques dans les gènes de l'antigène leucocytaire humain | |
| Diaz-Herrera | Methods to Integrate Genetic and Clinical Data for Disease Subtyping |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20853000 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20853000 Country of ref document: EP Kind code of ref document: A1 |