WO2024186669A1 - Modèles de score de risque polygénique (prs) ajustés en ascendance et pipeline de modèle - Google Patents
Modèles de score de risque polygénique (prs) ajustés en ascendance et pipeline de modèle Download PDFInfo
- Publication number
- WO2024186669A1 WO2024186669A1 PCT/US2024/018190 US2024018190W WO2024186669A1 WO 2024186669 A1 WO2024186669 A1 WO 2024186669A1 US 2024018190 W US2024018190 W US 2024018190W WO 2024186669 A1 WO2024186669 A1 WO 2024186669A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- ancestry
- data
- model
- prs
- arrays
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/40—Population genetics; Linkage disequilibrium
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Definitions
- the present disclosure provides systems, software, and methods for generating and evaluating polygenic risk score models.
- the systems, software, and methods of the present disclosure may have applications in the areas of polygenic risk score (PRS) models for predicting disease state, and other phenotypes, of individuals based on genetic data corresponding to the individuals.
- the systems, software, and methods of the present disclosure may have applications in the areas of genetics, precision medicine, and personalized medicine.
- PRS polygenic risk score
- Early diagnosis and prevention of chronic modem diseases, including type 2 diabetes (T2D) and hypertension may have the potential to make a significant impact in patient outcomes. Early diagnosis may allow for better allocation of intervention strategies that are effective at reducing the risk of disease progression. According to medical practitioners, challenges with insufficient screening may arise due to the fact that chronic diseases tend to progress slowly until they manifest clinically later in life.
- PRS Polygenic risk scores
- PRS may be generated by operating a trained PRS model on genetic data corresponding to an individual.
- PRS represent a measure of the individual's overall genetic liability to a trait or disease.
- the rapid development of (PRS) stresses the importance of accurately assessing the ancestry makeup of participants in biomedical studies to avoid potential selection biases.
- current PRS models may mostly include single ancestry participants (typically European) which may not generalize across other ancestral groups.
- a PRS modelling system that generates predictive tools trained on heterogeneous datasets that are able to predict susceptibility using historical data available outside of clinical and research settings. Further, there is a need for a PRS modelling system that takes individual ancestry into account when generating PRS. These predictive tools may be effective for identifying individuals at risk of various undesirable outcomes, including risk of certain diseases.
- a major challenge to enabling precision health at a global scale is the bias between those who enroll in state sponsored genomic research and those suffering from chronic disease.
- the present disclosure provides a PRS programming pipeline that enables information contained in direct to consumer (DTC) databases of genetic and phenotype data to be used to generate PRS models.
- DTC direct to consumer
- DTC databases include information corresponding to groups of individuals having diverse ancestries as well as individuals having unique, identified, specific ancestry.
- the PRS programming pipeline is enabled to generate PRS models that provide improved ancestry-based performance of PRS models generated according to the technology disclosed herein.
- the present disclosure provides a computer-implemented method for generating a trained ancestry-adjusted polygenic risk score (PRS) model, comprising: (a) processing, using a genetic primary component analysis (PC A) model, genotype data and phenotype data corresponding to one or more populations of individuals to generate genetic primary components (PCs) corresponding to the genotype data and the phenotype data; (b) processing, using a genome-wide association study (GWAS) model, the genotype data and the phenotype data, thereby generating GWAS results, wherein the GWAS model comprises covariables comprising at least one of the genetic PCs; (c) obtaining estimates of global ancestry corresponding to individuals of the one or more populations of individuals; and (d) training a PRS model using at least the GWAS results and the estimates of global ancestry as covariates, thereby generating the trained ancestry-adjusted PRS model.
- PC A genetic primary component analysis
- GWAS
- (a) further comprises using reference genomic data from a population with known ancestry or ancestry distribution.
- the covariables comprise at least 10 of the genetic PCs.
- (c) further comprises receiving the estimates of global ancestry from a computer system.
- (c) further comprises processing the genotype data using a trained global ancestry model, thereby determining the estimates of global ancestry.
- the trained global ancestry model is trained using reference genomic data from a population with known ancestry or ancestry distribution.
- the trained global ancestry model is trained using an instance of a Neural ADMIXTURE model.
- the method further comprises training a global ancestry model using reference genomic data from a population with known ancestry or ancestry distribution, thereby generating the trained global ancestry model.
- training the global ancestry model further comprises training an instance of a Neural ADMIXTURE model.
- the genotype data and the phenotype data comprise a merged array.
- the method further comprises: receiving one or more arrays of genotype data and phenotype data, each of the one or more arrays corresponding to a population of individuals comprising a direct-to-consumer (DTC) platform cohort, each of the one or more arrays comprising a plurality of variants; performing one or more quality control (QC) operations on the genotype data and the phenotype data to identify individuals and variants for removal from the one or more arrays; removing, from the one or more arrays, genotype data and phenotype data corresponding to the identified individuals and variants; performing imputation on the genotype data, thereby harmonizing each of the one or more arrays, wherein each of the one or more harmonized arrays comprises the same plurality of variants as each other of the one or more arrays; and combining the one or more harmonized arrays to generate the merged array.
- DTC direct-to-consumer
- performing the imputation on the genotype data further comprises using reference genomic data from a population with known ancestry or ancestry distribution.
- the method further comprises identifying individuals and variants for removal when genotype data and phenotype data corresponding to the individuals and variants fail to meet a QC threshold.
- the method further comprises storing the ancestry-adjusted PRS model in a repository.
- (a) further comprises using reference genomic data from a population with known ancestry or ancestry distribution.
- the covariables comprise at least 10 of the genetic PCs.
- (c) further comprises receiving the estimates of global ancestry from a computer system.
- (c) further comprises processing the genotype data using a trained global ancestry model, thereby determining the estimates of global ancestry.
- the trained global ancestry model is trained using reference genomic data from a population with known ancestry or ancestry distribution.
- the trained global ancestry model is trained using an instance of a Neural ADMIXTURE model.
- the method further comprises training a global ancestry model using reference genomic data from a population with known ancestry or ancestry distribution, thereby generating the trained global ancestry model.
- training the global ancestry model further comprises training an instance of a Neural ADMIXTURE model.
- the genotype data and the phenotype data comprise a merged array.
- the method further comprises: receiving one or more arrays of genotype data and phenotype data, each of the one or more arrays corresponding to a population of individuals comprising a direct-to-consumer (DTC) platform cohort, each of the one or more arrays comprising a plurality of variants; performing one or more quality control (QC) operations on the genotype data and the phenotype data to identify individuals and variants for removal from the one or more arrays; removing, from the one or more arrays, genotype data and phenotype data corresponding to the identified individuals and variants; performing imputation on the genotype data, thereby harmonizing each of the one or more arrays, wherein each of the one or more harmonized arrays comprises the same plurality of variants as each other of the one or more arrays; and combining the one or more harmonized arrays to generate the merged array.
- DTC direct-to-consumer
- performing the imputation on the genotype data further comprises using reference genomic data from a population with known ancestry or ancestry distribution.
- the method further comprises identifying individuals and variants for removal when genotype data and phenotype data corresponding to the individuals and variants fail to meet a QC threshold.
- the method further comprises storing the ancestry-adjusted PRS model in a repository.
- the present disclosure provides a computer system comprising one or more computer processors and computer memory coupled thereto, wherein the computer memory comprises machine-executable code that, upon execution by the one or more computer processors, implements a method for harmonizing data corresponding to one or more direct-to- consumer (DTC) platforms, the method comprising: receiving one or more arrays of genotype data and phenotype data, each of the one or more arrays corresponding to a population of individuals comprising a DTC platform cohort, each of the one or more arrays comprising a plurality of variants; performing one or more quality control (QC) operations on the genotype data and the phenotype data to identify individuals and variants for removal from the one or more arrays; removing, from the one or more arrays, genotype data and phenotype data corresponding to the identified individuals and variants; performing imputation on the genotype data using reference genomic data from a population with known ancestry or ancestry distribution, thereby harmonizing the one or more
- DTC direct-to
- the method further comprises using the merged array to generate a PRS model.
- the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for generating a trained ancestry-adjusted polygenic risk score (PRS) model, the method comprising: (a) processing, using a genetic primary component analysis (PC A) model, genotype data and phenotype data corresponding to one or more populations of individuals to generate genetic primary components (PCs) corresponding to the genotype data and the phenotype data; (b) processing, using a genome-wide association study (GWAS) model, the genotype data and the phenotype data, thereby generating GWAS results, wherein the GWAS model comprises covariables comprising at least one of the genetic PCs; (c) obtaining estimates of global ancestry corresponding to individuals of the one or more populations of individuals; and (d) training a PRS model using at least the GWAS results and the estimates of global ancestry as
- PC A genetic primary component
- (a) further comprises using reference genomic data from a population with known ancestry or ancestry distribution.
- the covariables comprise at least 10 of the genetic PCs.
- (c) further comprises receiving the estimates of global ancestry from a computer system.
- (c) further comprises processing the genotype data using a trained global ancestry model, thereby determining the estimates of global ancestry.
- the trained global ancestry model is trained using reference genomic data from a population with known ancestry or ancestry distribution.
- the trained global ancestry model is trained using an instance of a Neural ADMIXTURE model.
- the method further comprises training a global ancestry model using reference genomic data from a population with known ancestry or ancestry distribution, thereby generating the trained global ancestry model.
- training the global ancestry model further comprises training an instance of a Neural ADMIXTURE model.
- the genotype data and the phenotype data comprise a merged array.
- the method further comprises: receiving one or more arrays of genotype data and phenotype data, each of the one or more arrays corresponding to a population of individuals comprising a direct-to-consumer (DTC) platform cohort, each of the one or more arrays comprising a plurality of variants; performing one or more quality control (QC) operations on the genotype data and the phenotype data to identify individuals and variants for removal from the one or more arrays; removing, from the one or more arrays, genotype data and phenotype data corresponding to the identified individuals and variants; performing imputation on the genotype data, thereby harmonizing each of the one or more arrays, wherein each of the one or more harmonized arrays comprises the same plurality of variants as each other of the one or more arrays; and combining the one or more harmonized arrays to generate the merged array.
- DTC direct-to-consumer
- performing the imputation on the genotype data further comprises using reference genomic data from a population with known ancestry or ancestry distribution.
- the method further comprises identifying individuals and variants for removal when genotype data and phenotype data corresponding to the individuals and variants fail to meet a QC threshold.
- the method further comprises storing the ancestry-adjusted PRS model in a repository.
- the present disclosure provides a non-transitory computer readable medium for harmonizing data corresponding to one or more direct-to-consumer (DTC) platforms, comprising, with one or more processors: receiving one or more arrays of genotype data and phenotype data, each of the one or more arrays corresponding to a population of individuals comprising a DTC platform cohort, each of the one or more arrays comprising a plurality of variants; performing one or more quality control (QC) operations on the genotype data and the phenotype data to identify individuals and variants for removal from the one or more arrays; removing, from the one or more arrays, genotype data and phenotype data corresponding to the identified individuals and variants; performing imputation on the genotype data using reference genomic data from a population with known ancestry or ancestry distribution, thereby harmonizing the one or more arrays, wherein each of the one or more harmonized arrays comprises the same plurality of variants as each other of the one or
- DTC direct-to
- the method further comprises using the merged array to generate a PRS model.
- the present disclosure provides a computer-implemented method for generating a trained ancestry-adjusted polygenic risk score (PRS) model, comprising: (a) processing genotype data and phenotype data corresponding to populations of individuals to generate genetic primary components (PCs) and genome-wide association study (GWAS) results using covariables comprising at least one of the genetic PCs; and (b) training a PRS model using at least the GWAS results and estimates of global ancestry corresponding to individuals of the populations as covariates, thereby generating the trained ancestry-adjusted PRS model.
- PRS polygenic risk score
- the present disclosure provides a computer-implemented method for generating a trained ancestry-adjusted polygenic risk score (PRS) model, comprising generating genome-wide association study (GWAS) results from populations of individuals, and training a PRS model using at least the GWAS results and estimates of global ancestry corresponding to individuals of the populations.
- PRS ancestry-adjusted polygenic risk score
- the present disclosure provides a computer-implemented method comprising processing test genotype data of a test subject with an ancestry-adjusted polygenic risk score (PRS) model to determine an ancestry-adjusted polygenic risk score of the test subject.
- PRS ancestry-adjusted polygenic risk score
- the ancestry-adjusted PRS model is obtained at least in part by: (a) processing, using a genetic primary component analysis (PCA) model, genotype data and phenotype data corresponding to one or more populations of individuals to generate genetic primary components (PCs) corresponding to the genotype data and the phenotype data; (b) processing, using a genome-wide association study (GWAS) model, the genotype data and the phenotype data, thereby generating GWAS results, wherein the GWAS model comprises covariables comprising at least one of the genetic PCs; (c) obtaining estimates of global ancestry corresponding to individuals of the one or more populations of individuals; and (d) training a PRS model using at least the GWAS results and the estimates of global ancestry as covariates, thereby generating the trained ancestry-adjusted PRS model.
- PCA genetic primary component analysis
- GWAS genome-wide association study
- Another aspect of the present disclosure provides a non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.
- Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto.
- the computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.
- Figure 1 is an example schematic diagram of a PRS system.
- Figure 2 illustrates an example of a process flow depicting and operating method.
- Figure 3 A is an example schematic diagram illustrating a first embodiment of a PRS pipeline.
- Figure 3B is an example schematic diagram illustrating a second embodiment of a PRS pipeline.
- Figure 4 depicts data plots illustrating results of an example implementation of the process flow of Figure 2.
- Figure 5A depicts tabulated data illustrating performance metrics corresponding to the example implementation of the process flow of Figure 2.
- Figure 5B depicts data plots illustrating performance metrics corresponding to the example implementation of the process flow of Figure 2.
- Figure 6 depicts data plots illustrating comparison of results of an example implementation of the process flow of Figure 2 with results of other PRS models.
- Figure 7A depicts a correlation plot between two PGS models for gout.
- Figure 7B depicts a QQ-plot between the two PGS models for gout of Figure 7A.
- Figure 7C depicts a correlation plot between the two PGS models for gout of Figure
- Figure 8 depicts correlation plots and agreement matrices between two pairs of PGS models for liver enzymes.
- Figure 9 depicts a schematic of a workflow to evaluate existing European-focused PRS.
- Figure 10 depicts an example of risk levels provided by PRS models.
- Figures 11 A-l 1C depict proportion of individual per ancestry for a) UKBB, and b) Galatea Bio collection; c) Ancestry deconvolution for Galatea Bio collection generated using GB proprietary software.
- Figure 12 shows a computer system 1201 that is programmed or otherwise configured to perform operations of the methods.
- PLINK 2 generally refers to a program used to identify genetically related individuals (by applying the King algorithm), to perform primary component analysis (PC A) to identify global ancestry per individual, and to perform genome wide association studies (GWAS) [48],
- Beagle generally refers to a program used to perform imputation on genotype data and arrays [49],
- Neurological ADMIXTURE generally refers to a trained neural network used to infer global ancestry per individual.
- 1000 Genomes Project Consortium generally refers to a provider of genotype data corresponding to a reference population having known ancestry used as a reference population for use in performing PCA and imputation and for training a Neural ADMIXTURE model [57].
- BASIL algorithm generally refers to a Batch Screening Iterative LASSO (BASIL) algorithm used to generate PRS models [50],
- pROC package generally refers to a pROC package in R, which may be used for evaluating PRS models using the area under the curve (AUC) receiver operating characteristic (ROC) curves [58],
- PLINK a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559-75, which is incorporated by reference in its entirety.
- BEAGLE an application programming interface and high-performance computing library for statistical phylogenetics. Syst Biol. 2012;61 ( 1 ) : 170-3, which is incorporated by reference in its entirety.
- Neural ADMIXTURE rapid population clustering with autoencoders; 2021. bioRxiv, which is incorporated by reference in its entirety.
- DTC direct-to-consumer
- the present disclosure provides a novel genotype-phenotype platform that integrates heterogeneous data sources and applies learnings to common chronic disease conditions including, for example, Type 2 diabetes (T2D) and hypertension.
- T2D Type 2 diabetes
- the present disclosure provides a PRS modelling system, or PRS system for short, that uses data from DTC platforms to invert research-based PRS models that are developed using genome sequencing and phenotypic data that is acquired as part of a designed research study.
- Quality control (QC) mechanisms of the present disclosure may enable traditional GWAS and PRS analyses based on DTC data.
- Inferred genetic ancestry information corresponding to individuals is used by the PRS system to adjust PRS models, thereby generating ancestry- adjusted PRS models.
- the genetics of ancestry-adjusted PRS models generated by the PRS system are supported by their ability to replicate known variants from publicly available independent GWAS studies.
- the genetics of many disease states may be studied extensively in controlled datasets, and various polygenic risk scores (PRS) may be developed.
- PRS polygenic risk scores
- the present disclosure enables predictive tools for both phenotypes trained with heterogeneous genotypic and phenotypic data, e.g. DTC data, generated outside of the clinical environment.
- Methods of the present disclosure recapitulate prior findings using various techniques with fidelity.
- the methods and systems of the present disclosure may be used to leverage DTC genetic repositories to identify individuals at risk of debilitating diseases based on their unique genetic landscape so that informed, timely clinical interventions can be incorporated.
- Ancestry- adjusted PRS models generated by the PRS system of the present disclosure can replicate previous findings and deliver enhanced discovery and single-variant resolution of causal T2D and hypertension risk and protection alleles. Additionally, the results provided by the present disclosure have confirmed the potential impact of DTC resources on mechanistic insights and clinical translation efforts.
- Ancestry-adjusted PRS models of the present disclosure may provide improvements over current PRS models by defining and taking into account the genetic ancestry of participants.
- genetic ancestry is embedded in as a covariate in the ancestry-adjusted PRS models.
- genetic ancestry is embedded by scaling PRS scores as part of post-processing step. In this manner, the present disclosure provides more accurate models than traditional filtering to only European-based PRS models.
- the present disclosure provides a validated DTC framework for validating and extending PRS models based on DTC data.
- This enables cost-effective approaches of enrolling understudied populations in complex disease genetics.
- An example embodiment of the DTC framework incorporates DTC data included in the Biobank Metaanalysis Network.
- the DTC framework provides a “straight to mobile instead of landlines” opportunity. This is advantageous for enabling low- and middle-income countries to harness advances in genomics for the study of their own populations where state-sponsored capacity for performing current methods is limited.
- the PRS system is capable of generating ancestry-adjusted PRS models for one or more disease states or other phenotypes from heterogeneous datasets, for example from datasets housing a combination of genetic data and self-reported information from one or more DTC genetics companies.
- the ancestry-adjusted PRS models are able to identify subsets of users at substantially increased risk of presenting one or more disease states or other phenotypes of interest. This is particularly advantageous because it indicates that the ever-increasing availability of genetic data from DTC providers, most of it not annotated for traits of clinical relevance, can be leveraged by the present disclosure to generate predictive tools capable of improving diagnosis and prevention of diseases with genetic determinants.
- DTC platforms can offer a wide range of information about personal wellness, ancestry, physical characteristics, and traits. Advances in genomic research may lead the DTC genomics industry to flourish and make accurate yet easy-to-interpret genomic results. Strict privacy policies of many companies may disallow them to share customers’ data without their consent. These platforms can serve as informative repositories giving actionable insights that aid traditional clinical approaches. The approach of subject recruitment for various complex phenotypes via online surveys may open up multiple avenues to complement conventional research and clinical strategies. DTC platforms also provide convenience along with a wider reach to recruit participants from various locations. They surpass barriers of single-point data collection centers to language restrictions thus allowing the aggregation of data from places with different ancestries and demographics. Democratizing the access to these genetic platforms and prediction tools may boost progress in precision medicine.
- a PRS system may be able to replicate reported results with a very fast turnaround time.
- the participation of individual customers with the PRS system platform allows for the generation of a rich dataset that enables the creation of ancestry- adjusted PRS models according to the subject technology.
- the comparable predictive performance of the ancestry-adjusted PRS models enables the PRS system to quickly generate more ancestry-adjusted PRS models.
- the ancestry-adjusted DTC models replicate the findings seen in academic and government-funded biobanks.
- the clinical actionability of current PRS models has yet to be determined through pragmatic trials involving real-world data.
- Ancestry- adjusted PRS models according to the present disclosure provide a novel source of information that can shed light on this important issue, and to improve early diagnosis and prevention bringing precision medicine at scale for all.
- Figure 1 shows an example PRS system 100.
- the PRS system 100 is configured to generate PRS models, including PRS models that are scaled based on ancestry.
- the PRS system 100 is configured to operate one or more ancestry models to generate estimates of ancestry for individuals.
- the PRS system 100 is also configured to operate one or more PRS models to generate PRS results and to scale the PRS results based on the ancestry estimates.
- the PRS system 100 is in communication with an external data source 200.
- the external data source 200 represents one or more external data sources, for example one or more DTC platforms or repositories of data provided by the one or more DTC platforms, one or more sources of vcf files, one or more PRS model catalogs, and one or more other data stores, for example a data store comprising genotype data corresponding to one or more populations of known ancestry.
- a data receiving interface 100 receives data from the data source 200, for example DTC data from one or more DTC platforms, and can communicate the data to a QC module 115 and to an internal data store 120.
- the QC module 115 is configured to perform one or more QC operations on the received data to generate QC’d data.
- the QC module 115 can detect and eliminate outliers in DTC data and can screen the DTC data for replicated individuals, related individuals, and individuals for whom data necessary for further processing steps is missing.
- the QC module communicates QC’d data to and imputation module 125 and to an ancestry module 145.
- the QC module may also store QC’s data in the data store 120.
- the imputation module 125 performs imputation on genotype data, for example on QC’d DTC data, to infer unobserved genotypes in the data. For example, if a genotyping array received from a first DTC platform includes one or more variants that are missing from a genotyping array received from a second DTC platform, the imputation module 125 may impute valued for the missing variants in the second DTC platform to harmonize the first and second genotyping arrays. The imputation module 125 may further merge imputed data sets, for example multiple imputed data sets based on DTC data received from multiple different DTC platforms, to generate merged genetic data, e.g. merged DTC data.
- the ancestry module 145 operates one or more models or methods one received data, for example on DTC data, to generate ancestry-related data.
- the ancestry module 145 can operate a primary components analysis (PCA) model to generate a set of primary components (PCs) corresponding to ancestry of a dataset (e.g., genetic PCs).
- PCA primary components analysis
- the ancestry module 145 can operate a global ancestry model, for example a trained Neural ADMIXTURE model, on genetic data to generate individual ancestry estimates.
- the PRS system 100 includes a GWAS module 130 that performs genome wide association studies (GWAS) on genetic data, for example on merged DTC data, to generate GWAS results.
- the GWAS module 130 receives genetic PCs from the ancestry module 145 and uses them as covariables in a GWAS model.
- the PRS system 100 includes an output module 150, which can include, for example, one or more of a display device, a data communication interface, a data store, and a processor for further processing of data, for example to enable display on the display device.
- the GWAS module 130 communicates GWAS results to the output module for display on a display device thereof.
- a PRS generation module 135 receives GWAS results from the GWAS module 130 as well as ancestry information from the ancestry module 145, for example individual ancestry estimates.
- the PRS generation module creates one or more PRS models based on the GWAS results and the ancestry information.
- the PRS generation can communicate PRS models to the output module 150, for example for sharing with one or more external systems or model repositories.
- the PRS system 100 includes a PRS execution module 140.
- the PRS module 140 is configured to receive or load a PRS model and operate the PRS model on individual genotype data to generate PRS.
- the PRS module 140 receives and operates PRS models generated by the PRS system 100 from the PRS generation module 135.
- the PRS module 140 loads and operates an external PRS model from the data store 120, or directly from the data receiving interface 110.
- the external PRS model includes, for example, a PRS model from a PRS model catalog.
- the PRS execution module 140 can also receive individual ancestry estimates from the ancestry module 145 and scale PRS generated by using a PRS model with the individual ancestry estimates.
- the PRS system 100 include a user interface (UI) 160 to enable a user to enter information to configure one or more aspects of the PRS system.
- the PRS system 100 further includes at least one processor 170 and associated memory for storing and operating the various modules and data stores. It is noted that embodiments of the PRS system 100 can include multiple processors 170, one or more individual computing devices, multiple UIs 160, and one or more of each of the various modules and data stores.
- Figure 2 shows an example method for generating PRS models based on DTC data. The method can be performed by a PRS system of the present disclosure.
- the PRS system receives, from a DTC source, DTC data.
- the DTC data include genetic data as well as self-reported answers to questions asked of participants corresponding to the genetic data, for example using a questionnaire or online survey.
- questions include, for example, questions about general conditions like diabetes, blood pressure, lipid profile, and medication intake, and questions about with age, sex, weight, height.
- the PRS system performs on or more quality control (QC) operations on the DTC data to generate QC’d DTC data.
- the QC operations can include, for example, QC checks, outlier detections, and corresponding data correction and/or pruning of the received DTC data.
- Non-limiting examples of QC operations that may be performed by the PRS system include one or more of: [0097] Excluding participants for whom genotype data is not available;
- the PRS system parses the QC’d DTC data to generate case and control groups for PRS model development.
- Case groups include individuals whose data indicate they may have a particular condition for which a particular PRS model is being generated.
- the system may include participants who report having high blood pressure, participants who are taking blood pressure medications, and/or participants reporting lab results that may be indicative of hypertension.
- Control groups include participants who did not report managing a health condition relevant to the particular PRS model, and those who did not report managing any health condition. Control groups can also include participants who report lab results or other information that is indicative of the participant not having the relevant condition.
- the PRS generates case and control groups prior to performing some or all of the QC operations.
- the PRS system performs imputation to generate values for missing genomic information in the QC’d DTC datasets, thereby generating imputed DTC datasets, i.e. datasets that include the QC’s DTC data and imputed data.
- the PRS system performs imputation using reference genomic data from a population with known ancestry or ancestry distribution and using an imputation method, for example the Beagle method.
- DTC genetic data may be generated using one or more of multiple array types and a database of DTC genetic data may include data generated by multiple array types, each of which may produce a dataset with different characteristics.
- the PRS system uses imputation across array types (up-imputing) to harmonize these diverse datasets.
- the PRS system merges the imputed DTC data to generate merged DTC data.
- the PRS system merges multiple imputed DTC datasets, each of which may correspond to a different array type, to generate the merged DTC data.
- the PRS system performs additional QC operations on the merged DTC data.
- the DTC system only includes imputed makers having a call rate greater than a threshold value, for example greater than 0.95, in the merged DTC data.
- the PRS system identifies individual ancestry of participants included in the DTC dataset.
- the PRS system performs principal component analysis (PCA) to identify global ancestry per individual in the merged DTC data, using reference genomic data from a population with known ancestry or ancestry distribution.
- PCA principal component analysis
- the PRS system performs PCA using a PCA method, for example PLINK 2, thereby generating a set of genetic principal components (PCs).
- the PRS system selects variables to be included as covariables in a genome-wide association study (GWAS) model.
- GWAS genome-wide association study
- the top ten genetic PCs, age, sex, and the merged DTC data are selected as covariables in the model.
- the PRS system operates the GWAS model to generate GWAS model output.
- GWAS model output may include genomic risk loci, e.g. genetic variants, for example SNPs, or blocks of correlated genetic variants that show a statistically significant association with a trait of interest.
- the GWAS is performed using an additive genetic model, for example using PLINK 2.
- the PRS system may, optionally, at operation 1052 validate the GWAS results that are generated at operation 1050.
- the PRS system may validate the GWAS results using one or more independent GWAS meta-analysis datasets, for example by comparing the p- values and the effect sizes for the variants assessed in the GWAS results that have identical chromosomal coordinates and alleles with the independent GWAS meta-analysis data sets.
- the GWAS results are displayed by the PRS system, for example using a GWAS results display program.
- the results are communicated to a processor running the qqman package in R and the qqman is used to format the results for presentation on a display device, for example on a display screen of a computer system.
- the PRS system may also store the results in a local or external data store.
- the GWAS results are also communicated to a PRS model generation module or system for use in generating one or more PRS models.
- the PRS system determines estimates of global ancestry for use in adjusting PRS models for ancestry.
- the PRS system utilizes the results of global ancestry inference as a covariate in the training of the PRS models.
- the global ancestry inference if generated, by the PRS system or by an external system, using an ancestry inference method, for example using Neural ADMIXTURE, with reference genomic data from a population with known ancestry or ancestry distribution.
- the PRS system, or an external system uses data from the 1000 Genomes Project Consortium for training a model in the supervised mode of Neural ADMIXTURE with default parameters.
- the PRS system generates one or more ancestry-adjusted PRS models based on the GWAS results generated at operation 1050 and, in some embodiments, using the results of global ancestry inference generated in operation 1065 as a covariate in the training of the PRS models.
- the PRS system generates the one or more ancestry-adjusted PRS models using a method, for example using a meta-algorithm (algorithms that learn from the output of other algorithms), e.g. using the Batch Screening Iterative LASSO (BASIL) algorithm.
- BASIL Batch Screening Iterative LASSO
- genetic ancestry is embedded by scaling PRS scores as part of post-processing operation when the PRS models are operated to generate PRS scores.
- the PRS scores that are scaled may be generated using various PRS models or using ancestry-adjusted PRS models.
- the PRS system provides the ancestry-adjusted PRS models to one or more output location which can include, for example, data stores (internal or external), PRS model repository, and a PRS execution engine, which operates a PRS model on genome data to generate PRS.
- the PRS model is processed through a PRS pipeline of the present disclosure.
- FIG. 3 A and 3B two example embodiments of PRS pipelines 2000 and 2010 for processing PGS models are shown. Both pipelines are configured to operate on inputs 2100, including a VCF file 2110 and a PGS model 2120 to generate PRS outputs including scores and bins 2510 and featured SNPs 2520.
- the VCF file 2110 typically includes genotype data, and more specifically variants such as SNPs, corresponding to a particular individual for whom a PRS is desired.
- a PGS model 2120 can include any model in PGS catalog format, for example a PGS model available from an external PGS catalog or an ancestry-adjusted PRS model generated according to the subject technology disclosure herein.
- each PRS pipeline 2000 and 2110 includes a run models engine 2200 which operates a PGS model 2120 on a VCF file 2110 to generate PRS output.
- Each run models engine 2200 includes an impute missing variants engine 2210, a filter VCF variants engine 2220, and a calculate PRS engine 2230.
- Each impute missing variants engine 2210 operates on genotype data, for example genotype data included in the VCF file 2110, to impute missing variants therein. Missing variants may include any variants that are included in the PGS model 2120 but not observed in the VCF file 2110.
- Each filter VCF variants engine 2220 operates on genotype data to filter out variants that are not useful as inputs to the PGS model 2120.
- the filter VCF variants engine 2220 performs filtering operations on imputed data generated by the impute missing VCF variants engine 2210.
- the filter variants engine 2220 performs filtering operations prior to imputation by the impute missing variables engine 2210.
- the filter variants engine 2220 performs additional QC operations to remove outliers and other non-useful data.
- the filter variant engine 2220 may perform one or more of the QC operations described in relation to operation 1015 (Perform QC on received data) of method 1000.
- Each compute PRS engine 2230 operates on imputed and filtered VCF data to generate PRS outputs, for example PRS outputs 2500.
- PRS pipeline 2010 includes additional ancestry -related engines, including an ancestry prediction engine 2300 and an ancestry scaling engine 2400.
- the ancestry prediction engine 2300 generates global ancestry predictions 2310.
- the ancestry prediction engine 2300 operates a trained global ancestry prediction model on genotype data from a VCF file 2110 to generate the global ancestry predictions.
- the ancestry prediction engine 2300 operates the trained global ancestry prediction model on filtered and/or imputed genotype data generated by the run models engine 2200.
- the trained global ancestry prediction model includes an instance of a Neural ADMIXTURE model, trained as described elsewhere herein.
- the ancestry scaling engine 2400 receives PRS outputs from the run models engine 2200 and global ancestry predictions 2310, e.g. PRS outputs 2500, from the ancestry prediction engine 2300.
- the ancestry scaling engine uses the global ancestry predictions 2310 to scale the PRS outputs, thereby generating ancestry-adjusted PRS outputs 2600 including scores and bins 2610 and featured SNPs 2620.
- the ancestry scaling engine operates an ancestry scaling ML model trained on ancestry and PRS data to scale the PRS outputs using the global ancestry predictions.
- a first embodiment of ancestry scaling ML model is trained using results generated by a particular PRS model 2120, global ancestry information, and risk scores associated with the global ancestry information. The trained first embodiment of the ancestry scaling ML model is then used to scale risk scores generated by the particular PRS models according to global ancestry predictions 2310 corresponding to a subject individual for whom VCF file 2110 is received by the PRS pipeline.
- the PRS pipelines include embodiments of PRS system 100.
- the impute missing variant engine 2210 may include or correspond to an imputation module 125 of PRS system 100.
- filter VCF variant engine 2220 may include or correspond to QC module 115; and calculate PRS engine 2230 may include or correspond to PRS execution module 140.
- Ancestry predictions engine 2310 and ancestry scaling module 2400 may each include or correspond to the same or separate instances of ancestry module 145.
- Genotype data and/or phenotype data may be generated from a subject.
- the subject may be a suspected of a suffering from a disease or condition, such as Type 2 diabetes (T2D) and hypertension.
- T2D Type 2 diabetes
- the subject may be asymptomatic for the disease or condition.
- the disease or condition may not exhibit any symptoms and the subject may be unaware of the presence of disease or condition.
- the methods described herein may allow a disease or condition to be identified at an earlier stage than otherwise. The identification of the presence of the disease or condition at an earlier stage may allow a treatment option or recommendation to be determined at an earlier stage and may allow the subject to have an improved prognosis.
- Biological samples may be obtained or derived from the subject.
- the biological sample may comprise nucleic acids.
- the biological sample be a cell-free deoxyribonucleic acid (cfDNA) sample or a cell-free ribonucleic acid (cfRNA) sample.
- the biological sample may comprise genomic DNA or germline DNA (gDNA).
- the nucleic acid may be a DNA (e.g. double-stranded DNA, single-stranded DNA, single-stranded DNA hairpins, cDNA, genomic DNA, germline DNA, circulating tumor DNA (ctDNA), cell-free DNA (cfDNA)), an RNA (e.g.
- the biological sample may be a derived from or contain a biological fluid.
- the biological sample may be a plasma sample, a serum sample, a buffy coat sample, a peripheral blood mononuclear cell (PBMC) sample, a red blood cell sample, a urine sample, a saliva sample, or other body fluid sample.
- the biological sample may comprise or be a pleural fluid sample, peritoneal fluid sample, amniotic fluid sample, cerebrospinal fluid sample, lymphatic fluid sample, sweat sample, tear sample, semen sample, or any combination of biological fluid.
- the samples may comprise RNA and DNA.
- a sample may comprise DNA or RNA, and may be analyzed by various methods (e.g., for determining genotype and/or phenotype).
- the biological sample may be collected, obtained, or derived from the subject using a collection tube.
- the collection tube may be an ethylenediaminetetraacetic acid (EDTA) collection tube, a cell-free RNA collection tube, or a cell-free deoxyribonucleic acid (DNA) collection tube and CTC collection tubes, or other blood collection tube.
- EDTA ethylenediaminetetraacetic acid
- DNA cell-free deoxyribonucleic acid
- CTC collection tubes or other blood collection tube.
- the collection tube may comprise additional reagents for stabilizing the nucleic acid molecules or blood cells.
- the collection tube may allow the nucleic acid or blood cells to be stable such to minimize degradation of the biological sample prior to assaying.
- the additional reagents may comprise buffer salts or chelators.
- the biological sample may be obtained or derived from a subject at a various times.
- the biological sample may be obtained or derived from a subject prior to the subject receiving a therapy for a disease or condition.
- the biological sample may be obtained or derived from a subject during receiving a therapy for the disease or condition.
- the biological sample may be obtained or derived from a subject after receiving a therapy for the disease or condition.
- the biological sample may be collected over 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or time points.
- the time points may occur over a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60 or more hour period.
- the time points may occur over a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60 or more day period.
- the time points may occur over a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60 or more week period.
- a clinical intervention or a therapy may be identified at least in part based on the identification of the presence, likelihood, or elevated risk of the disease or condition.
- the clinical intervention may be a plurality of clinical interventions.
- the clinical intervention may be selected from a plurality of clinical interventions.
- the clinical interventions may be administered to the subject.
- a sample may be obtained or derived from the subject such to monitor the treatment. Additionally, by performing the methods or systems iteratively, therapies or clinical interventions may be updated based on the results of the methods.
- the monitoring of the treatment may include an assessment as well as a difference in assessment from a previously generated assessment .
- the difference in an assessment of the disease or condition in the subject among a plurality of time points (or samples) may be indicative of one or more clinical indications such as a diagnosis of the disease or condition, a prognosis of the disease or condition, or an efficacy or non-efficacy of a course of treatment for treating the disease or condition of said subject.
- the biological samples may be subjected to additional reactions or conditions prior to assaying.
- the biological sample may be subjected to conditions that are sufficient to isolate, enrich, or extract nucleic acids, such DNA molecules or RNA molecules.
- the methods disclosed herein may comprise conducting one or more enrichment reactions on one or more nucleic acid molecules in a sample.
- the enrichment reactions may comprise contacting a sample with one or more beads or bead sets.
- the enrichment reactions may comprise one or more hybridization reactions.
- the enrichment reactions may comprise contacting a sample with one or more capture probes or bait molecules that hybridize to a nucleic acid molecule of the biological sample.
- the enrichment reaction may comprise differential amplification of a set of nucleic acid molecules.
- the enrichment reaction may enrich for a plurality of genetic loci or sequences corresponding to genetic loci.
- the enrichment reaction may enrich for sequences corresponding to certain genes.
- the enrichment reactions may comprise the use of primers or probes that may complementarity to sequences (or sequences upstream or downstream) of a sequence that is to be enriched.
- a capture probe may comprise sequence complementarity to a set of genomic loci and allow the enrichment of the genomic loci.
- the enrichments reactions may comprise a plurality of probes or primers.
- a plurality of probes may comprise 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, or 180 different probes.
- the methods disclosed herein may comprise conducting one or more isolation or purification reactions on one or more nucleic acid molecules in a sample.
- the isolation or purification reactions may comprise contacting a sample with one or more beads or bead sets.
- the isolation or purification reaction may comprise one or more hybridization reactions, enrichment reactions, amplification reactions, sequencing reactions, or a combination thereof.
- the isolation or purification reaction may comprise the use of one or more separators.
- the one or more separators may comprise a magnetic separator.
- the isolation or purification reaction may comprise separating bead bound nucleic acid molecules from bead free nucleic acid molecules.
- the isolation or purification reaction may comprise separating capture probe hybridized nucleic acid molecules from capture probe free nucleic acid molecules.
- the isolation reactions may comprises removing or separating a group of nucleic acid molecules from another group of nucleic acids.
- the methods disclosed herein may comprise conduction extraction reactions on one or more nucleic acids in a biological sample.
- the extraction reactions may lyse cells or disrupt nucleic acid interactions with the cell such that the nucleic acids may be isolated, purified, enriched or subjected to other reactions.
- the methods disclosed herein may comprise amplification or extension reactions.
- the amplification reactions may comprise polymerase chain reaction.
- the amplification reaction may comprise PCR-based amplifications, non-PCR based amplifications, or a combination thereof.
- the one or more PCR-based amplifications may comprise PCR, qPCR, nested PCR, linear amplification, or a combination thereof.
- the one or more non-PCR based amplifications may comprise multiple displacement amplification (MDA), transcription-mediated amplification (TMA), nucleic acid sequence-based amplification (NASBA), strand displacement amplification (SDA), real-time SDA, rolling circle amplification, circle-to-circle amplification or a combination thereof.
- MDA multiple displacement amplification
- TMA transcription-mediated amplification
- NASBA nucleic acid sequence-based amplification
- SDA strand displacement amplification
- real-time SDA rolling circle amplification, circle-to-circle
- the amplification reactions may comprise an isothermal amplification.
- the method disclosed herein may comprise a barcoding reaction.
- a barcoding reaction may comprise the additional of a barcode or tag to the nucleic acid.
- the barcode may be a molecular barcode or a sample barcode .
- a barcode nucleic acid may comprise a barcode sequence which may be a degenerate n-mer. The sequence may be randomly generated or generated such to synthesize a specific barcode sequence.
- the barcode nucleic acid may be added to a sample such to label the nucleic acid molecules in the sample.
- the barcodes may be specific to a sample.
- a plurality of barcode nucleic acids may be added to a sample in which the barcode sequence is the same.
- those originating from a same sample may have a same barcode sequence, and may allow a nucleic acid to be identified as belonging to a particular or given sample.
- a molecular barcode may also be used such that each molecule (or a plurality of molecules) in a same volume have a different molecular barcode.
- This barcode may be subjected to amplification such that all amplicons derived from a molecule have the same barcode. In this way, molecules originating from a same molecule may be identified.
- the sequences reads may be processed based on the barcode sequences.
- the processing may reduce errors or allow a molecule to be tracked.
- Barcode sequences may be appended or otherwise added or incorporated into a sequence by various reactions, for example an amplification, extension, or ligation reaction, and may be performed enzymatically using a nucleic acid polymerase or ligase.
- the ligation may be an overhang or blunt end ligation and the barcodes may comprise complementarity to nucleic acids to be barcoded. This complementarity may be a sequence derived from the sample from the subject or may be constant sequence generated via a reaction performed on the nucleic acids in the sample.
- the biological sample may comprise multiple components.
- the biological sample may be a whole blood sample.
- the biological sample may be subjected to reactions such to separate or fractionate a biological sample.
- a whole blood sample may be a fractionated and cell free nucleic acids may be obtained.
- the whole blood sample may be fractionated using centrifugation such that blood cells may be separated from the plasma (which may contain cell free nucleic acid).
- a sample may be subjected to multiple rounds of separation or fractionation.
- the nucleic acids may be subjected to sequencing reactions.
- the sequencing the reactions may be used on DNA, RNA, or other nucleic acid molecules.
- Example of a sequencing reaction that may be used include capillary sequencing, next generation sequencing, Sanger sequencing, sequencing by synthesis, single molecule nanopore sequencing, sequencing by ligation, sequencing by hybridization, sequencing by nanopore current restriction, or a combination thereof.
- Sequencing by synthesis may comprise reversible terminator sequencing, processive single molecule sequencing, sequential nucleotide flow sequencing, or a combination thereof.
- Sequential nucleotide flow sequencing may comprise pyrosequencing, pH-mediated sequencing, semiconductor sequencing or a combination thereof.
- the sequencing reactions may comprise whole genome sequencing, whole exome sequencing, low-pass whole genome sequencing, targeted sequencing, methylation-aware sequencing, enzymatic methylation sequencing, bisulfite methylation sequencing.
- the sequencing reaction may be a transcriptome sequencing, mRNA-seq, totalRNA- seq, smallRNA-seq, exosome sequencing, or combinations thereof. Combinations of sequencing reactions may be used in the methods described elsewhere herein.
- a sample may be subjected to whole genome sequencing and whole transcriptome sequencing.
- the samples may comprise multiple types of nucleic acids (e.g. RNA and DNA), sequencing reactions specific to DNA or RNA may be used such to obtain sequence reads relating to the nucleic acid type.
- the sequencing of nucleic acids may generate sequencing read data.
- the sequencing reads may be processed such to generate data of improved quality.
- the sequencing reads may be generated with a quality score.
- the quality score may indicate an accuracy of a sequence read or a level or signal above a nose threshold for a given base call.
- the quality scores may be used for filtering sequencing reads. For example, sequencing reads may be removed that do not meet a particular quality score threshold.
- the sequencing reads may be processed such to generate a consensus sequence or consensus base call.
- a given nucleic acid (or nucleic acid fragment) may be sequenced and errors in the sequence may be generated due to reactions prior or during sequencing. For example, amplification or PCR may generate error in amplicons such that the sequences are not identical to a parent sequence.
- error correction may include identifying sequence reads that do not corroborate with other sequences from a same sample or same original parent molecules.
- the use of barcodes may allow the identification or a same parent or sample.
- the sequence reads may be processed by performing single strand consensus calling or double stranded consensus call, thereby reducing or suppressing error.
- the methods and systems of the present disclosure may be implemented, at least in part, in hardware, software, firmware or any combination thereof.
- various aspects of the described techniques may be implemented within one or more processors, including one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or any other equivalent integrated or discrete logic circuitry, as well as any combinations of such components.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- FPGAs field programmable gate arrays
- processors or any other equivalent integrated or discrete logic circuitry, as well as any combinations of such components.
- the term “processor” or “processing circuitry” may generally refer to any of the foregoing logic circuitry, alone or in combination with other logic circuitry, or any other equivalent circuitry.
- a control unit comprising hardware may also perform one or more of the techniques of this disclosure.
- Such hardware, software, and firmware may be implemented within the same device or within separate devices to support the various operations and functions described in this disclosure.
- any of the described units, modules or components may be implemented together or separately as discrete but interoperable logic devices. Depiction of different features as modules or units is intended to highlight different functional aspects and does not necessarily imply that such modules or units must be realized by separate hardware or software components. Rather, functionality associated with one or more modules or units may be performed by separate hardware or software components or integrated within common or separate hardware or software components.
- the methods and systems of the present disclosure may also be embodied or encoded in a computer-readable medium, such as a computer-readable storage medium, containing instructions.
- Computer-readable media may include non-transitory computer-readable storage media and transient communication media.
- Computer readable storage media which is tangible and non-transitory, may include random access memory (RAM), read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), flash memory, a hard disk, a CD-ROM, a floppy disk, a cassette, magnetic media, optical media, or other computer-readable storage media.
- RAM random access memory
- ROM read only memory
- PROM programmable read only memory
- EPROM erasable programmable read only memory
- EEPROM electronically erasable programmable read only memory
- flash memory a hard disk, a CD-ROM, a floppy disk, a cassette, magnetic media, optical media, or other computer-readable storage media.
- FIG. 12 shows a computer system 1201 that is programmed or otherwise configured to perform operations of the methods, for example, generating trained ancestry-adjusted PRS models and harmonizing data.
- the computer system 1201 can regulate various aspects of methods and systems of the present disclosure, such as, for example, perform an algorithm, input training data, analyze data, store data in a repository, or output a result for the user.
- the computer system 1201 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
- the electronic device can be a mobile electronic device.
- the computer system 1201 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1205, which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 1201 also includes memory or memory location 1210 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1215 (e.g., hard disk), communication interface 1220 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1225, such as cache, other memory, data storage and/or electronic display adapters.
- the memory 1210, storage unit 1215, interface 1220 and peripheral devices 1225 are in communication with the CPU 1205 through a communication bus (solid lines), such as a motherboard.
- the storage unit 1215 can be a data storage unit (or data repository) for storing data.
- the computer system 1201 can be operatively coupled to a computer network (“network”) 1230 with the aid of the communication interface 1220.
- the network 1230 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network 1230 in some cases is a telecommunication and/or data network.
- the network 1230 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- the network 1230, in some cases with the aid of the computer system 1201, can implement a peer-to-peer network, which may enable devices coupled to the computer system 1201 to behave as a client or a server.
- the CPU 1205 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 1210.
- the instructions can be directed to the CPU 1205, which can subsequently program or otherwise configure the CPU 1205 to implement methods of the present disclosure. Examples of operations performed by the CPU 1205 can include fetch, decode, execute, and writeback.
- the CPU 1205 can be part of a circuit, such as an integrated circuit.
- a circuit such as an integrated circuit.
- One or more other components of the system 1201 can be included in the circuit.
- the circuit is an application specific integrated circuit (ASIC).
- ASIC application specific integrated circuit
- the storage unit 1215 can store files, such as drivers, libraries and saved programs.
- the storage unit 1215 can store user data, e.g., user preferences and user programs.
- the computer system 1201 in some cases can include one or more additional data storage units that are external to the computer system 1201, such as located on a remote server that is in communication with the computer system 1201 through an intranet or the Internet.
- the computer system 1201 can communicate with one or more remote computer systems through the network 1230.
- the computer system 1201 can communicate with a remote computer system of a user (e.g., a medical professional or patient).
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
- the user can access the computer system 1201 via the network 1230.
- Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1201, such as, for example, on the memory 1210 or electronic storage unit 1215.
- the machine executable or machine readable code can be provided in the form of software.
- the code can be executed by the processor 1205.
- the code can be retrieved from the storage unit 1215 and stored on the memory 1210 for ready access by the processor 1205.
- the electronic storage unit 1215 can be precluded, and machine-executable instructions are stored on memory 1210.
- the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a precompiled or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- a machine readable medium such as computer-executable code
- a tangible storage medium such as computer-executable code
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
- Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 1201 can include or be in communication with an electronic display 1235 that comprises a user interface (UI) 1240 for providing, for example, an input of data (e.g., genotype data and/or phenotype data), or an visual output.
- UI user interface
- Examples of UI’s include, without limitation, a graphical user interface (GUI) and web-based user interface.
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
- An algorithm can be implemented by way of software upon execution by the central processing unit 1205.
- the algorithm can, for example, generate trained ancestry-adjusted PRS models and harmonize data.
- Example 1 Implementation of the PRS system for generation of Type 2 Diabetes (T2D) and Hypertension ancestry-adjusted PRS models
- genotyped data was collected from a novel DTC platform.
- Adult participants from an international genetic platform were invited to participate. Participants upload their genotype data files and were invited to self-report their health status and metabolic traits. In particular, they were invited to answer general health questionnaires regarding cardiometabolic traits over a period of 6 months.
- AUC 0.68
- Participants were drawn from a research database which offers a DTC genetic traits platform with hundreds of thousands of users globally. After uploading their genetic information, generated in other DTC platforms, users can be informed on their susceptibility to an extensive set of genetic traits. All participants created an account and agreed to a consent on the use of their data and legal agreement. Upon signing up, participants were invited to undertake a health online survey. Participants were redirected to the survey once they gave online consent to be a part of the research.
- the online survey included questions about general conditions such as diabetes, blood pressure, lipid profile, and medication intake. It also included COVID-19, influenza and common cold-related questions along with age, sex, weight, height, and pandemic behavior. Data were collected over a period of six months, from May 01, 2021 to October 06, 2021.
- Genotype data quality control, imputation, and GWAS
- Genotype-level data for each array were processed by applying identical quality control and imputation procedures. Briefly, variants with a call rate of ⁇ 95% and palindromic markers (A/T, G/C, MAF > 0.4) were excluded. An exact test was performed for Hardy- Weinberg equilibrium for individuals of the largest ancestral group (p ⁇ 1 x I O 12 , globally).
- Individual quality control included genotype call rates > 97%, matching between gender identification and chromosomal sex, and no excess ancestry-adjusted heterozygosity. Samples genetically related to other individuals in the cohort and duplicates were detected and removed, by applying the King algorithm (-make-king, king estimate > 0.177; PLINK 2). Principal component analysis was performed to identify global ancestry per individual using 1000 genomes as reference population with PLINK 2.
- the Batch Screening Iterative LASSO (BASIL) algorithm is a meta-algorithm (algorithms that learn from the output of other algorithms), which employs a Lasso algorithm and enhances this output with another layer for faster variable selection in ultra-high-dimensional problems. Similar to the Lasso algorithm, BASIL may be used to find a parameter vector p whose components are the coefficients for the independent variable of the linear regression that approximates the solution of the problem.
- BASIL solves the Lasso solution path in an iterative fashion, starting with a sequence of candidate parameters. From these candidate solutions, each iteration discards the ones that do not meet the requirements to be a suitable solution.
- the variables that are included in the final set for a viable solution are those that were also screened satisfying a desired threshold requirement, while the others are discarded (i.e., those solutions in which the coefficients in their positions inside the parameter are meant to be 0). This process is repeated until the optimum parameter 20 is found, which is the one that minimizes (20).
- the BASIL algorithm guarantees to find the exact solution and not only an approximation, via the Karush-Kuhn-Tucker condition (the first derivative necessary conditions for a solution to be optimal) which is verified along each iteration. This condition is necessary and sufficient to prove the exact solution. [00178] Genetic ancestry
- Figure 4 shows genome-wide association results. A type 2 diabetes, and B hypertension. Top boxes show Q-Q plots, while bottom figures show Manhattan plots with two levels of significance ofp ⁇ 5 x 10' 8 (red line), and p ⁇ 1 x 10' 6 (blue line).
- T2D variants displayed significant evidence of replication (p ⁇ 0.05) in this dataset.
- variants were identified that are closely associated with genes linked to type 2 diabetes susceptibility (e.g., CDKAL1, KCNQ1 as well as variants in the FZO locus linked with both BMI and T2D.
- CDKAL1, KCNQ1 genes linked to type 2 diabetes susceptibility
- KCNQ1 genes linked to type 2 diabetes susceptibility
- 164 out of 272 variants were identified as showing identical effect direction to genome-significant findings in Europeans.
- hypertension dataset ten hypertension genetic markers were replicated and 230 out of 365 variants were identified as having identical effect direction.
- the GWAS was validated using independent GWAS meta-analysis datasets from Mahajan et al. 2018 (74,124 T2D cases, 824,006 controls) and Evangelou et al. 2018 (757,201 individuals).
- the /?-values and the effect sizes were compared for the variants assessed in both the studies that had identical chromosomal coordinates and alleles with the independent GWAS.
- the direction of the effect sizes (estimated as OR) was set to match the effect alleles in each study. It was observed that the effect sizes of the genome-wide significant variants in the independent GWAS were concordant in directionality in both our T2D and hypertension GWAS.
- PRS models were built for each phenotype using the BASIL algorithm.
- the predictor variable was binary (presence or absence of diabetes or hypertension) as reported by participants.
- the genotype-only models reported a predictive performance (AUC) of 0.56 for both diabetes and hypertension and increased to 0.68 for the full model (genotype and covariates together).
- AUC predictive performance
- the genotype-only models reported a predictive performance of 0.57 and 0.53, increasing to 0.69 and 0.66 in the full model, respectively.
- Tabulated AUC results are shown in Figure 5A.
- Figure 5B illustrates under the curve (AUC). Comparison of receiver operating characteristic (ROC) between two models Full model and European only model. Results for A) type 2 diabetes (T2D) were 0.68 and 0.69, respectively; while for B) hypertension were 0.68 and 0.66, respectively. After applying the DeLong method of ROC comparison, the models were not significantly different.
- ROC receiver operating characteristic
- EBI European Bioinformatics Institute
- PGS Catalog which is an open resource of published polygenic scores (including variants, alleles, and weights).
- Those published PRS were investigated for T2D and hypertension. For those with reported AUC, the number of variants and the number of individuals whose data was used to train the model under various ancestry groups was obtained.
- the ancestry-adjusted PRS models are comparable to other PRS models.
- Figure 6 shows the comparison in reported AUC for those models in the EBI PGS catalog including models generated by the inventors in accordance with the subject technology. The average AUC between these models was 0.70.
- the small number of variables used by the ancestry-adjusted PRS models (125 for T2D and 666 for hypertension) makes them comparable to described by Tanigawa et al. [43] using the BASIL algorithm. Likewise, the number of individuals whose data was used to train the models is modest in comparison with large academic and clinical databases.
- Figure 6 illustrates comparison of PRS published in the EBI PGS Catalog for T2D and Hypertension.
- the color of the bubble represents the population ancestry that was included to build the PRS model.
- the size of the bubble represents the number of variables (variants) that ended up in the model after training.
- the x-axis shows the number of individuals used to train the model.
- the y-axis shows the AUC results as reported in the EBI PGS Catalog.
- the horizontal line shows the average AUC across all models.
- PRS models were generated for T2D and hypertension from a heterogeneous dataset housing a combination of genetic data and self-reported information from a DTC genetics company.
- these PRS models are able to identify subsets of users at substantially increased risk of presenting T2D or hypertension. This finding is remarkable because it demonstrates that the ever-increasing availability of genetic data from DTC providers, most of it not annotated for traits of clinical relevance, can be leveraged to generate predictive tools able to improve diagnosis and prevention of diseases with genetic determinants.
- DTC platforms can offer a wide range of information about personal wellness, ancestry, physical characteristics, and traits. Advances in genomic research may lead the DTC genomics industry to flourish and make accurate yet easy-to-interpret genomic results. Strict privacy policies of many companies may disallow them to share customers’ data without their consent. These platforms can serve as informative repositories giving actionable insights that aid traditional clinical approaches. The approach of subject recruitment for various complex phenotypes via online surveys is opening up multiple avenues to complement conventional research and clinical strategies. DTC platforms also provide convenience along with a wider reach to recruit participants from various locations. They surpass barriers of single-point data collection centers to language restrictions thus allowing the aggregation of data from places with different ancestries and demographics. Democratizing the access to these genetic platforms and prediction tools may boost progress in precision medicine.
- Federated learning approaches can further improve the possibility to increase the power of studies in DTC genomic analysis, and meta-analysis can be done in combination with academic and clinical datasets (including those from large consortiums).
- the DTC platform and research strategy of the present disclosure are capable of replicating the reported results with a very fast turnaround time.
- the participation of individual customers in the platform allowed the generation of a rich dataset that enabled the creation of PRS cardiometabolic models.
- the comparable predictive performance of the ancestry-adjusted PRS models also is a great indication of how the present disclosure can be leveraged to quickly contribute more PRS models to the larger scientific community.
- publicly available PRS models for T2D and hypertension have an AUC of 0.7 on average as shown in Figure 6. This is still a low accuracy, and it is even lower when compared to the small difference between the full and genotype-only models.
- T2D and hypertension are multifactorial diseases that are impacted by genetic and environmental determinants, including lifestyle factors like nutrition and exercise habits.
- PRS models may have limitations to provide accurate disease predictions, which compels the need to interpret these findings with caution, especially when they come from DTC genetic services.
- the clinical actionability of PRS models has yet to be determined through pragmatic trials involving real-world data.
- the present disclosure provides a novel source of information that can shed light on this important issue. Therefore, providing personalized information about T2D and hypertension predisposition is poised to improve early diagnosis and prevention bringing precision medicine at scale for all.
- Example 2 Ancestry-adjusted PRS models
- Figure 7A depicts a correlation plot between two PGS models for gout, PGS001248 and PGS002030, both of which can be found in the polygenic score (PGS) catalog pgscatalog.org.
- Figure 7B shows a Q-plot between the two models.
- Figure 7C shows a correlation plot between the two models with ancestry. Ancestry is indicated by shading of the points in the plot as well as by distribution curves per ancestry corresponding to each of the models.
- Figure 8 depicts correlation plots between PGS models for liver enzymes, including correlation plots between PGS000670 and PGS002157 and between PGS000668 and PGS002158, all of which can be found in the PGS catalog.
- Figure 8 also depicts agreement matrices between the pairs of PGS models.
- Example 3 Ancestry-adjusted PRS models in non-European populations
- PRS PRS model accuracy
- PRS model accuracy is not portable across populations (e.g., different causal alleles, different effect sizes for causal alleles, local epistatic interactions, differential imputation accuracy)
- the problem remains that existing models are not clinically effective for the vast majority of the world’s people.
- the methods and systems of the present disclosure are used to develop and implement accurate PRS models in non-European populations.
- methods and systems of the present disclosure are used to dissect and “deconvolute” an individual’s genetic ancestry.
- the methods and systems of the present disclosure use algorithms that can accurately and quickly estimate genetic ancestry across an individual’s chromosomes, and may be implement into both a standalone API and a DNAnexus App.
- a dashboard may be used to allow for the organization, manipulation and visualization of ancestry results.
- “Ancestry deconvolution” or local ancestry inference (LAI) techniques have become an increasingly important tool for better understanding the genetics of complex human diseases.
- Neural ADMIXTURE is a neural network autoencoder which can perform soft- clustering of genomic sequences while inferring its ancestry composition.
- This neural network adopts the theoretical framework of the widely used ADMIXTURE algorithm, but incorporates recent advances in deep learning to provide faster computational times, more accurate clustering, the capability to estimate the ancestry of new sequences after training with very high speed, and simultaneous prediction of clustering results using multiple cluster numbers.
- G-Nomix our second algorithm, provides high-resolution ancestry predictions, where an ancestry label is predicted at each windowed region of the genetic sequence.
- the method makes use of multiple machine learning classifiers, including logistic regression, support vector machines with a novel string kernel, gradient boosting trees (e.g., XGBoost), conditional random fields, and convolutions, providing a two-stage process, with a multitude of classifiers providing an initial ancestry estimate within windowed regions of the chromosome, and a second classifier refining the initial predictions and correcting potential phasing errors.
- machine learning classifiers including logistic regression, support vector machines with a novel string kernel, gradient boosting trees (e.g., XGBoost), conditional random fields, and convolutions, providing a two-stage process, with a multitude of classifiers providing an initial ancestry estimate within windowed regions of the chromosome, and a second classifier refining the initial predictions and correcting potential phasing errors.
- Figure 10 depicts risk levels provided by PRS models.
- the best-performing PRS model in European individuals are determined for a dozen of phenotypes of high medical interest by comparing predictability (AUC or R 2 ) and enrichment of cases in the highest risk percentiles across different models and strategies in a harmonized context. Further, this enables the generation and integration of an interactive PRS dashboard.
- PRS models applicable to all individuals, regardless of ancestral background, are developed.
- Figures 11A-11C depict proportion of individual per ancestry for a) UKBB, and b) Galatea Bio collection; c) Ancestry deconvolution for Galatea Bio collection generated using GB proprietary software.
- Genotype -phenotype data sets Available data sets include:
- UK Biobank UK Biobank (UKBB), a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants.
- GBMI Global Biobank Meta-Analysis Initiative
- the Biobank of the Americas, as part of the GBMI consortium has access to GWAS summary statistics from the other consortium members.
- a biobank is used with DNA samples from more than 500,000 unique individuals, including linked genotype - phenotype data from more than 50 thousand non-European individuals.
- Ancestry-aware PRS methods Method and systems are developed for accurate phenotype prediction by integrating PRS with individual ancestral background information. Multi -ancestry input data are split into two independent tranches, consisting of discovery (80%) and testing (20%) data sets.
- PRS models are trained for selected traits/diseases using a supervised learning paradigm, using genetic sequence (SNPs), LAI information for each SNP and phenotypic labels as inputs. Such models map input genetic sequences into predicted phenotypes. Multiple methods may be used, including linear techniques utilized by other PRS methods (e.g., Snpnet or PRS-CX), as well as machine learning-based non-linear models, including neural networks or gradient boosting trees.
- multiple training strategies are adopted, such as training models with samples coming from one unique population group (single-ancestry training), multiple population groups (multi-ancestry training), or from admixed individuals (admixed training), with population information detected with Neural ADMIXTURE and G- Nomix.
- multiple training strategies such as training models with samples coming from one unique population group (single-ancestry training), multiple population groups (multi-ancestry training), or from admixed individuals (admixed training), with population information detected with Neural ADMIXTURE and G- Nomix.
- a second stage is conducted using machine learning models, taking as input the predictions of the collection of PRS models of a sample for a given phenotype (singlephenotype), or for all available phenotypes (multi-phenotype), and all covariates (including principal components, high-resolution local ancestry inference predictions, and global ancestry predictions).
- the combined information of predicted PRS scores and predicted ancestry descriptors provides a low-dimensional genetic characterization of the phenotypic and ancestral information for each individual. This low-dimensional descriptor can be seen as a compression of a SNP sequence into a lower-dimensional phenotypic and ancestral-aware representation.
- Such machine learning “ensembling” models map the low-dimensional PRS+ancestry representations into predicted phenotypes.
- the machine learning model is able to provide additional robustness across ancestral populations, and boost the predictive performance of admixed individuals.
- performance is evaluated across multiple population groups, failure cases are analyzed, and explainable machine learning techniques are adopted to obtain better insights into what components are important in order to obtain accurate predictions.
- Integration and visualization A PRS dashboard is constructed for non-European individuals, applying the same methods previously described.
- PRS models integrating LAI can increase the performance across 1) different ancestries, and 2) admixed individuals.
- PRS models can be developed and trained to meet the following criteria: 1) increased AUC of LAI adjusted PRSs with respect to non-LAI adjusted PRS; and 2) increased percentage of cases identified in the tail of the PRS distribution.
- the sequencing and improved ancestry reference panel creation may be expected to proceed smoothly.
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Genetics & Genomics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Public Health (AREA)
- Theoretical Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Pathology (AREA)
- Physiology (AREA)
- Ecology (AREA)
- Primary Health Care (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Selon un aspect, un procédé mis en œuvre par ordinateur pour générer un modèle de score de risque polygénique (PRS) ajusté en ascendante entraîné comprend (a) le traitement, à l'aide d'un modèle d'analyse de composant primaire (PCA) génétique, de données de génotype et de données de phénotype correspondant à une ou plusieurs populations d'individus pour générer des composants primaires génétiques (PC) correspondant aux données de génotype et de phénotype ; (b) le traitement, à l'aide d'un modèle d'étude d'association à l'échelle du génome (GWAS), des données de génotype et des données de phénotype, ce qui permet de générer des résultats de GWAS, le modèle de GWAS comprenant des co-variables comprenant au moins l'un des PC génétiques ; (c) l'obtention d'estimations d'ascendance globale correspondant à des individus de la ou des populations d'individus ; et (d) l'entraînement du modèle de PRS à l'aide d'au moins les résultats de GWAS et des estimations d'ascendance globale en tant que co-variables, ce qui permet de générer le modèle de PRS ajusté en ascendance entraîné.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363449703P | 2023-03-03 | 2023-03-03 | |
| US63/449,703 | 2023-03-03 | ||
| US202363536315P | 2023-09-01 | 2023-09-01 | |
| US63/536,315 | 2023-09-01 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024186669A1 true WO2024186669A1 (fr) | 2024-09-12 |
Family
ID=92675500
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/018190 Pending WO2024186669A1 (fr) | 2023-03-03 | 2024-03-01 | Modèles de score de risque polygénique (prs) ajustés en ascendance et pipeline de modèle |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024186669A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119296791A (zh) * | 2024-12-10 | 2025-01-10 | 神州医疗科技股份有限公司 | 融合图像识别、大模型和prs的疾病预测方法及系统 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160371431A1 (en) * | 2015-06-22 | 2016-12-22 | Counsyl, Inc. | Methods of predicting pathogenicity of genetic sequence variants |
| US20200202038A1 (en) * | 2017-05-12 | 2020-06-25 | Massachusetts Institute Of Technology | Systems and methods for crowdsourcing, analyzing, and/or matching personal data |
| US20210358565A1 (en) * | 2017-01-24 | 2021-11-18 | Sequenom, Inc. | Methods and processes for assessment of genetic variations |
| US20220367063A1 (en) * | 2019-09-30 | 2022-11-17 | Myome, Inc. | Polygenic risk score for in vitro fertilization |
-
2024
- 2024-03-01 WO PCT/US2024/018190 patent/WO2024186669A1/fr active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160371431A1 (en) * | 2015-06-22 | 2016-12-22 | Counsyl, Inc. | Methods of predicting pathogenicity of genetic sequence variants |
| US20210358565A1 (en) * | 2017-01-24 | 2021-11-18 | Sequenom, Inc. | Methods and processes for assessment of genetic variations |
| US20200202038A1 (en) * | 2017-05-12 | 2020-06-25 | Massachusetts Institute Of Technology | Systems and methods for crowdsourcing, analyzing, and/or matching personal data |
| US20220367063A1 (en) * | 2019-09-30 | 2022-11-17 | Myome, Inc. | Polygenic risk score for in vitro fertilization |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119296791A (zh) * | 2024-12-10 | 2025-01-10 | 神州医疗科技股份有限公司 | 融合图像识别、大模型和prs的疾病预测方法及系统 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Vadapalli et al. | Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine | |
| Bakhtiari et al. | Variable number tandem repeats mediate the expression of proximal genes | |
| Suwinski et al. | Advancing personalized medicine through the application of whole exome sequencing and big data analytics | |
| Shen et al. | SHEsisPlus, a toolset for genetic studies on polyploid species | |
| Bernau et al. | Cross-study validation for the assessment of prediction algorithms | |
| WO2023224709A1 (fr) | Systèmes et méthodes de détection d'un dysrégulation d'une voie cellulaire dans des échantillons de cancer | |
| Dharanipragada et al. | iCopyDAV: Integrated platform for copy number variations—Detection, annotation and visualization | |
| Tárraga et al. | GEPAS, a web-based tool for microarray data analysis and interpretation | |
| Chen et al. | The hitchhikers’ guide to RNA sequencing and functional analysis | |
| Zhao et al. | Correction for population stratification in random forest analysis | |
| Wang et al. | Imputing gene expression in uncollected tissues within and beyond GTEx | |
| Margoliash et al. | Polymorphic short tandem repeats make widespread contributions to blood and serum traits | |
| Cazares et al. | maxATAC: Genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks | |
| WO2014113522A1 (fr) | Méthodes de classification pharmacogénomique | |
| Pajuste et al. | FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads | |
| Kim et al. | MetaKTSP: a meta-analytic top scoring pair method for robust cross-study validation of omics prediction analysis | |
| Brown et al. | Expression reflects population structure | |
| Liang | Bioinformatics for biomedical science and clinical applications | |
| Readhead et al. | Translational bioinformatics approaches to drug development | |
| Brown et al. | Enhanced methods for local ancestry assignment in sequenced admixed individuals | |
| US20230253070A1 (en) | Systems and Methods for Detecting Cellular Pathway Dysregulation in Cancer Specimens | |
| Salleh et al. | Systematic pharmacogenomics analysis of a Malay whole genome: proof of concept for personalized medicine | |
| Ding et al. | xQTLbiolinks: a comprehensive and scalable tool for integrative analysis of molecular QTLs | |
| Alsaedi et al. | AI-powered precision medicine: utilizing genetic risk factor optimization to revolutionize healthcare | |
| WO2024186669A1 (fr) | Modèles de score de risque polygénique (prs) ajustés en ascendance et pipeline de modèle |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24767648 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024767648 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2024767648 Country of ref document: EP Effective date: 20251006 |