[go: up one dir, main page]

CN113066586A - Method for constructing disease classification model based on multi-gene risk scoring - Google Patents

Method for constructing disease classification model based on multi-gene risk scoring Download PDF

Info

Publication number
CN113066586A
CN113066586A CN202110355345.9A CN202110355345A CN113066586A CN 113066586 A CN113066586 A CN 113066586A CN 202110355345 A CN202110355345 A CN 202110355345A CN 113066586 A CN113066586 A CN 113066586A
Authority
CN
China
Prior art keywords
site
disease
sites
training set
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110355345.9A
Other languages
Chinese (zh)
Inventor
马玉昆
孙琼琳
温颜华
张晓伟
颜红
李伟华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Fruit Shell Biotechnology Co ltd
Original Assignee
Beijing Fruit Shell Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Fruit Shell Biotechnology Co ltd filed Critical Beijing Fruit Shell Biotechnology Co ltd
Priority to CN202110355345.9A priority Critical patent/CN113066586A/en
Publication of CN113066586A publication Critical patent/CN113066586A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for constructing a disease classification model based on multi-gene risk scoring. The method comprises the following steps: acquiring a GWAS statistical data file of the target disease of the crowd, and performing quality control on the position; acquiring the site genotypes of the whole genomes of the training set and the inspection set and the disease states of the samples, respectively carrying out quality control on the samples and the sites based on the site typing data of the whole genomes of the training set and the inspection set, and using the samples and the sites passing the quality control for subsequent analysis; screening a site set or adjusting a site effect value according to different strategies, and respectively calculating the PRS of the sample by adopting different parameters of five different methods; and constructing a disease classification model based on the PRS and the disease state of the training set sample, verifying the effectiveness of the disease classification model in the inspection set sample, and selecting the optimal model in the inspection set as a disease predictor. The method provided by the invention can help to discover and prevent target diseases in early clinical stage, and has important application value.

Description

Method for constructing disease classification model based on multi-gene risk scoring
Technical Field
The invention belongs to the field of bioinformatics, and particularly relates to a method for constructing a disease classification model based on multi-gene risk scoring.
Background
Complex diseases occur under the combined action of numerous factors. Over the past two decades, there has been increasing interest in studying the impact of genetic risk factors on changes in human behavior. Genome-wide association studies (GWAS) can determine associations between Single Nucleotide Polymorphisms (SNPs) and phenotypic traits. The GWAS approach is universally applied in the field of social science, and many common genetic variations associated with complex diseases have been identified. Most genetic variations usually contribute little to the risk of disease, with effector values or (odds ratios) usually between 1.1 and 1.5, not sufficient to directly predict disease status. They have a greater cumulative clinical significance for the risk of disease.
Multigenic risk scoring (PRS) is a statistical method that assesses the genetic risk of a disease or trait based on the genotype profile of an individual, i.e., the cumulative effect of multiple risk sites. In the classical PRS method, PRS is calculated as the sum of all risk alleles of an individual weighted by the effect values of the risk alleles. Research finds that compared with the method of only adopting SNP reaching GWAS significance, the method of calculating PRS by using a large amount of SNP is better in the prediction capability of disease state, but because not all sites can influence the researched traits, how to find the optimal prediction PRS of complex disease risk becomes an important problem to be solved at present.
Disclosure of Invention
The invention aims to discover target diseases at an early stage, and particularly needs to build a disease classification model to realize the early detection.
The invention firstly protects a method for constructing a disease classification model based on multi-gene risk scoring, which comprises the following steps:
(1) obtaining GWAS statistical data of the target disease; GWAS statistical data comprise site rs number and site effect values of main allele, secondary allele and significance P value associated with diseases, and the quality of the site is controlled for the GWAS statistical data;
(2) obtaining typing information (i.e., site genotypes) for the whole genome sites of the training and test sets based on the SNP chip or sequencing (e.g., high-throughput sequencing) data; the training set and the test set both contain disease and control samples;
(3) carrying out site filling on the SNP chip or the site information of the sample obtained by sequencing data based on the site information of the thousand human genomes to obtain the genotype information of the analysis site; respectively carrying out sample quality control and site quality control on the training set and the test set;
(4) respectively calculating PRSs (general procedures standards) of the training set samples by adopting different parameters in different strategy methods based on the post-quality control sites and post-quality control GWAS statistical data of the training set samples;
(5) based on training set data, PRS is used as an independent variable, disease traits are used as dependent variables to construct a logistic model, and the classification efficiency of the model in the training set is evaluated based on the area under an AUC curve of the model obtained by quintuplet cross validation; respectively constructing a logistic model for PRSs (general purpose processors) calculated by different parameters of various methods;
(6) applying the PRS calculation method of the training set to the data of the test set, and obtaining the model classification efficiency AUC of the test set after calculating the PRS of the sample; and screening a model with optimal training set evaluation efficiency and test set classification efficiency, wherein the model is a disease classification model.
In the step (1), the GWAS statistical data of the target disease may be downloaded from a database, or may be obtained by analyzing a large amount of data.
In the step (1), the site effect value may be an OR OR Beta value.
In said step (1), the quality control of the sites on the GWAS statistics may comprise removing duplicate sites, removing ambiguous sites, preserving sites with minimum allele frequency MAF greater than 0.01 and filling INFO values greater than 0.5 or 0.8.
In the step (1), the ambiguous site means that the reference base and the variant base are purine or pyrimidine at the same time.
In the step (2), the SNP chip or the sequencing data needs to obtain the locus and the genotype information thereof in the whole genome range by a biological information analysis means. The disease species in the test set are the same as in the training set and the test set and training set cannot contain the same disease or normal samples.
In the step (3), the method for obtaining genotype information of the analysis site may be to perform phasing, site filling, and site reservation based on the genome-wide site genotype and the data of the thousand human genomes in sequence. And combining the filling site and the site in the whole genome range to obtain the analysis site.
In the step (3), site filling is based on site genotypes in the whole genome range and data of the thousand human genomes, SHAPIet software is adopted for phasing, impute2 software is adopted for site filling, and filling sites with filling INFO values larger than 0.5 or 0.8 are reserved.
In the step (3), the step of respectively performing the sample quality control and the site quality control on the training set and the test set may include removing at least one of repetitive sites, removing sites which are different in allelic state from GWAS statistical data, removing sites with MAF value less than 0.01, removing sites with sample deletion rate greater than 0.01, removing sites with Hawthorn-Weibull frequency greater than 0.000001, removing samples with close relationship, and removing repetitive samples in the training set and the test set.
In the step (3), the quality control of the analysis sites may include at least one of removing repeat sites, removing sites which are different from GWAS statistical data alleles, removing sites with a MAF value of less than 0.01, removing sites with a sample deletion rate of more than 0.01, and removing sites with a Hawthorn frequency of more than 0.000001.
In the step (3), the quality control of the samples in the training set and the test set may include removing at least one of closely related samples, and removing duplicate samples in the training set and the test set.
In step (4), the different strategies mainly include two contraction strategies, and since the estimation of the effect value of a site has uncertainty and not all sites influence the studied trait, the use of an unadjusted effect estimation value for all sites may result in PRS with higher standard error and poorer predicted efficacy. To address this problem, two broad contraction strategies have been employed: the effect estimates for all sites are adjusted by standard or custom statistical technique contraction, using P-values or other screening thresholds as criteria for inclusion of sites.
In the step (4), the adopted different strategy methods may be rounding and threading, beta ringing, lassosum, ldpred and a deep neural network.
The pruning and reading method is based on P value of GWAS statistical data and linkage r of a training set2To screen the site set, the screening criteria is that P value can be set to 1, 0.5, 0.05, 0.0005, 0.000005, 0.00000005, linkage r2The values can be set to be 0.2, 0.4, 0.6 and 0.8, and the two screening standards are combined randomly to screen the SNP set; calculating PRS of the sample based on the effect value of the screened site set and the genotype of the site by adopting a score parameter of plink software; the formula for plink to calculate PRS is as follows:
Figure BDA0003003505770000031
wherein SiIs the effect value of the ith SNP, GijRepresents the number of effector alleles of SNPi in sample j, P is the ploidy of the sample (typically 2 in humans), and N represents the number of SNPs included in the calculation of the PRS. MjIs the number of SNPs that were not deleted in sample j. If SNP i of the sample is missing, G is replaced by the minimum allelic frequency MAF of the population multiplied by the ploidyij
The beta shrinkage method firstly screens a site set based on a P value of GWAS statistical data, wherein the P value can be set to be 1, 0.5, 0.05, 0.0005, 0.000005 and 0.00000005. And performing multiple iterative adjustment on the effect value of the screened locus set based on the high-dimensional Bayes linear model and linkage disequilibrium data of the thousand human genomes to obtain an optimal effect value, so that the classification model is optimal. The-score parameter of the plink software was used to calculate the PRS of the samples based on the optimal effect values of the site set after screening and the genotypes of the sites.
The ldpred method recalculates the effect values for the sites based on a bayesian method using a default threshold value ρ, where ρ values can be set to 1, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001, and the PRS for the samples is calculated based on a custom algorithm of the ldpred software using the recalculated effect values.
The lassosum method is characterized in that a penalty function is constructed based on a linear regression model to compress effect values of the sites, so that the effect values of part of the sites are 0, an optimal site set is obtained, and PRS (general purpose representation) of the sample is calculated by combining linkage disequilibrium data of the thousand human genomes.
The deep neural network method comprises the steps of screening a locus set based on a P value of GWAS statistical data, wherein the P value can be set to be 1, 0.5, 0.05, 0.0005, 0.000005 and 0.00000005; the genotype at the selected locus was 0/1/2 coded, 0 and 2 for homozygous genotypes representing 2 minor and 2 major alleles, respectively, and 1 for heterozygous genotypes. The coding value of the screened site set is used as an input layer of a neural network, a leakage Rectified Linear Unit (ReLU) activation function is applied to all hidden layers, a sigmoid activation function is used in an output layer, the output range of the sigmoid function is 0-1, and the value of the output layer in the range of 0-1 is used as the PRS of a sample.
In the step (5), quintupling cross validation can be to randomly divide the disease sample and the normal sample of the training set into 5 parts, construct the logistic model by using four parts of data as the training set, test the performance of the model by using the other part of data as the test set, and evaluate the classification performance of the model in the training set by repeating quintupling cross validation 100 times and taking the median or mean of AUC.
In step (5), the disease state is encoded at 0/1, where 0 encodes a normal control sample and 1 encodes a disease sample.
In the step (6), the model with the optimal classification performance means that the AUC value of the training set is as high as possible, and the AUC value of the test set is as close as possible to the AUC value of the training set. AUC values for the training and test sets of the disease classification model are typically greater than 0.6, and there is a significant difference in the multi-gene risk scoring PRS for disease and normal samples.
The disease classification model constructed by adopting any one of the methods also belongs to the protection scope of the invention.
The invention also protects the application of any disease classification model, which can be at least one of A1) -A4):
A1) assessing the risk of the target disease of the subject;
A2) preparing a product for assessing the risk of a target disease in a subject;
A3) preventing a target disease;
A4) preparing a product for preventing a target disease.
The invention discloses a method for constructing a disease classification model based on multi-gene risk scoring, which can help to discover and prevent target diseases in early clinical stage and has important application value.
Drawings
FIG. 1 is a technical roadmap of the method for constructing a disease classification model by multi-gene risk scoring developed in example 1.
FIG. 2 is a PRS profile of the training set and the test set in example 2.
Detailed Description
The present invention is described in further detail below with reference to specific embodiments, which are given for the purpose of illustration only and are not intended to limit the scope of the invention. The examples provided below serve as a guide for further modifications by a person skilled in the art and do not constitute a limitation of the invention in any way.
The experimental procedures in the following examples, unless otherwise indicated, are conventional and are carried out according to the techniques or conditions described in the literature in the field or according to the instructions of the products. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1 development of a method for constructing a disease Classification model based on Multi-Gene Risk Scoring
The invention discloses a method for constructing a disease classification model based on multi-gene risk scoring on the basis of GWAS statistical data and typing data of whole genome loci of a large number of training sets and test sets, which specifically comprises the following steps:
1. genome-wide association study (GWAS) statistics of target diseases are obtained from public database downloads or from mass data analysis. GWAS statistical data at least comprise site rs number and site effect values of main allele, secondary allele and significance P value associated with diseases, and the quality of the site is controlled for the GWAS statistical data.
In the step (1), the site effect value is OR OR Beta value. Quality control of sites on GWAS statistics comprises removing duplicate sites, removing ambiguous sites, retaining sites with minimum allele frequency MAF greater than 0.01 and filling in fo values greater than 0.5 or 0.8.
An ambiguous site is one where both the reference and variant bases are purine (A or T) or pyrimidine (G or C).
2. Typing information (i.e., site genotypes) for the whole genome sites of the training and test sets is obtained based on the SNP chip or sequencing (e.g., high throughput sequencing) data. Both training and test sets contained disease and control samples.
In step (2), the disease species in the test set are the same as those in the training set, but the test set and the training set cannot contain the same disease or normal samples.
The chip and sequencing data need to obtain the whole genome locus and the genotype information thereof by means of biological information analysis.
3. Carrying out site filling on the SNP chip or the site information of the sample obtained by sequencing data based on the downloaded site information of the thousand human genomes to obtain the genotype information of the analysis site; and respectively carrying out sample quality control and site quality control on the training set and the test set.
In the step (3), site filling is carried out on the basis of the site genotype of the whole genome range and the data of the thousand human genomes, SHAPIet software is adopted for phasing, impute2 software is adopted for site filling, and filling sites with filling INFO values larger than 0.5 or 0.8 are reserved; and combining the filling site and the site in the whole genome range to obtain the analysis site.
And performing quality control on the analysis sites, wherein the quality control comprises removing repeated sites, removing sites with allelic difference with GWAS statistical data, removing sites with MAF value less than 0.01, removing sites with sample deletion rate more than 0.01 and removing sites with Hardy Weinberg frequency more than 0.000001.
And performing quality control on the training set and the test set samples, wherein the quality control comprises removing closely related samples and removing repeated samples in the training set and the test set.
4. And respectively calculating PRS (general purpose algorithms) of the training set sample by adopting different parameters in different strategy methods based on the post-quality control locus of the training set sample and the post-quality control GWAS statistical data.
In step (4), different strategies mainly include two contraction strategies, and since the estimation of the effect values of the sites has uncertainty and not all the sites affect the studied trait, the use of unadjusted effect estimates for all the sites may result in PRS with higher standard error and poorer predicted performance. To address this problem, two broad contraction strategies have been employed: the effect estimates for all sites are adjusted by standard or custom statistical technique contraction, using P-values or other screening thresholds as criteria for inclusion of sites. Five methods related to the two strategies are adopted to respectively calculate the PRS, and the five methods are rounding and threading, beta ringing, lassosum, ldpred and a deep neural network.
pruning and reading method based on P value of GWAS statistical data and linkage r of training set2To screen the site set, the screening criteria is that P value can be set to 1, 0.5, 0.05, 0.0005, 0.000005, 0.00000005, linkage r2The values can be set to be 0.2, 0.4, 0.6 and 0.8, and the two screening standards are combined randomly to screen the SNP set; calculating PRS of the sample based on the effect value of the screened site set and the genotype of the site by adopting a score parameter of plink software; the formula for plink to calculate PRS is as follows:
Figure BDA0003003505770000061
wherein SiIs the effect value of the ith SNP, GijRepresents the number of effector alleles of SNPi in sample j, P is the ploidy of the sample (typically 2 in humans), and N represents the number of SNPs included in the calculation of the PRS. MjIs the number of SNPs that were not deleted in sample j. If SNP i of the sample is missing, G is replaced by the minimum allelic frequency MAF of the population multiplied by the ploidyij
The beta shrinkage method firstly screens a site set based on a P value of GWAS statistical data, wherein the P value can be set to 1, 0.5, 0.05, 0.0005, 0.000005 and 0.00000005. And performing multiple iterative adjustment on the effect value of the screened locus set based on the high-dimensional Bayes linear model and linkage disequilibrium data of the thousand human genomes to obtain an optimal effect value, so that the classification model is optimal. The-score parameter of the plink software was used to calculate the PRS of the samples based on the optimal effect values of the site set after screening and the genotypes of the sites.
The ldpred method recalculates the effect values for the sites based on a bayesian method using a default threshold value ρ, which can be set to 1, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001, using the recalculated effect values to calculate the PRS for the sample based on a custom algorithm of the ldpred software.
The lassosum method is based on a linear regression model to construct a penalty function to compress the effect values of the sites, so that the effect values of part of the sites are 0, an optimal site set is obtained, and the PRS of the sample is calculated by combining linkage disequilibrium data of the thousand human genomes.
The deep neural network method comprises the steps of firstly screening a locus set based on a P value of GWAS statistical data, wherein the P value can be set to be 1, 0.5, 0.05, 0.0005, 0.000005 and 0.00000005; the genotype at the selected locus was 0/1/2 coded, 0 and 2 for homozygous genotypes representing 2 minor and 2 major alleles, respectively, and 1 for heterozygous genotypes. The coding value of the screened site set is used as an input layer of a neural network, a leakage Rectified Linear Unit (ReLU) activation function is applied to all hidden layers, a sigmoid activation function is used in an output layer, the output range of the sigmoid function is 0-1, and the value of the output layer in the range of 0-1 is used as the PRS of a sample.
5. Based on training set data, PRS is used as an independent variable, disease traits are used as dependent variables to construct a logistic model, and the classification efficiency of the model in the training set is evaluated based on the area under the AUC curve of the model obtained by quintuplet cross validation.
And respectively constructing a logistic model for the PRSs calculated by different parameters of a plurality of methods.
In step (5), the disease state is encoded 0/1, where 0 encodes a normal control sample and 1 encodes a disease sample. Quintupling cross validation is to randomly divide disease samples and normal samples of a training set into 5 parts, construct a logistic model by taking four parts of data as a training set, and check the efficiency of the model by taking the other part of data as a check set. And repeating the quintupling cross validation for 100 times, and taking the median or mean of AUC to evaluate the classification efficiency of the model in the training set.
6. Applying the PRS calculation method of the training set to the data of the test set, and obtaining the model classification efficiency AUC of the test set after calculating the PRS of the sample; and screening a model with the optimal training set evaluation efficiency and test set classification efficiency, wherein the model is the optimal disease classification model.
In the step (6), the evaluation of the optimal performance means that the AUC value of the training set is as high as possible, and the AUC value of the test set is as close as possible to the AUC value of the training set. AUC values for the training and test sets of the optimal disease classification model are typically greater than 0.6, and there is a significant difference in the multi-gene risk scoring PRS for disease and normal samples.
Example 2, the method established in example 1 is adopted to construct a coronary heart disease classification model
1. The GWAS statistics results for the normal and control samples of 185000 coronary heart diseases were downloaded from the GWAS catalog database, totaling 945 ten thousand sites.
2. Genotype data for 73.8 ten thousand loci were obtained from ASA chip experiments and bioinformatics analysis by illumina for 1800 training set samples (including 900 normal and 900 controls) and 1500 test set samples (including 760 normal and 740 controls).
3. And (5) performing quality control on the training set and the test set samples. After quality control, based on haplotype composition of thousand-person genome data, judging the type of the haplotype according to non-deletion sites around the deletion site of the sample of the training set and the inspection set, then filling the deletion site of the sample according to the genotype of the haplotype, controlling the INFO to be more than 0.5, and obtaining the genotype data of 212 ten thousand sites in total.
And respectively carrying out quality control on the genotype data filled by the training set samples and the test set samples, removing sites not contained in GWAS statistical data, and removing samples with close genetic relationship.
And (3) carrying out quality control on the downloaded GWAS statistical data, removing the sites with MAF less than 0.01, removing the sites with filling INFO value less than 0.8 and removing fuzzy SNP sites, only keeping the sites contained in both the training set and the test set, and finally, remaining 130 ten thousand sites.
4. Respectively applying 24 parameters of a pruning and reading method, 6 parameters of a beta pruning method, 7 parameters of a lassosum method and an ldpred method and 6 parameters of a deep neural network method to samples of a training set, and calculating the PRS of each sample of the training set. The PRS algorithm of the training set is further applied to the test set data to calculate the PRS for each sample of the test set.
The PRS profiles of the training and test sets are shown in fig. 2.
5. And (3) constructing a logistic model by taking the PRS of the samples of the training set as independent variables and the disease states of the samples as dependent variables, and calculating the AUC value of the training set model by 100 times of quintupling cross validation. AUC values were calculated for all test sets simultaneously.
6. The best of all methods is the p-value of 0.0005 and r for the pruning and reading algorithm2The value is 0.2 parameter, the algorithm includes 455 SNP loci in total, wherein the AUC of the training set is 0.6211, and the AUC of the test set is 0.6205.
It can be seen that the p values of the pruning and reading algorithms are 0.0005 and r2The value of 0.2 parameter is that the coronary heart disease classification model can be applied to calculate the risk of the coronary heart disease of ordinary people, and has important application value for early clinical discovery and early prevention.
The present invention has been described in detail above. It will be apparent to those skilled in the art that the invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with reference to specific embodiments, it will be appreciated that the invention can be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. The use of some of the essential features is possible within the scope of the claims attached below.

Claims (10)

1. A method for constructing a disease classification model based on multi-gene risk scoring comprises the following steps:
(1) obtaining GWAS statistical data of the target disease; GWAS statistical data comprise site rs number and site effect values of main allele, secondary allele and significance P value associated with diseases, and the quality of the site is controlled for the GWAS statistical data;
(2) acquiring the typing information of the whole genome locus of a training set and a test set based on the SNP chip or sequencing data; the training set and the test set both contain disease and control samples;
(3) carrying out site filling on the SNP chip or the site information of the sample obtained by sequencing data based on the site information of the thousand human genomes to obtain the genotype information of the analysis site; respectively carrying out sample quality control and site quality control on the training set and the test set;
(4) respectively calculating PRSs (general procedures standards) of the training set samples by adopting different parameters in different strategy methods based on the post-quality control sites and post-quality control GWAS statistical data of the training set samples;
(5) based on training set data, PRS is used as an independent variable, disease traits are used as dependent variables to construct a logistic model, and the classification efficiency of the model in the training set is evaluated based on the area under an AUC curve of the model obtained by quintuplet cross validation; respectively constructing a logistic model for PRSs (general purpose processors) calculated by different parameters of various methods;
(6) applying the PRS calculation method of the training set to the data of the test set, and obtaining the model classification efficiency AUC of the test set after calculating the PRS of the sample; and screening a model with optimal training set evaluation efficiency and test set classification efficiency, wherein the model is a disease classification model.
2. The method of claim 1, wherein: in the step (1), the GWAS statistical data of the target disease may be downloaded from a database, or may be obtained by analyzing a large amount of data.
3. The method of claim 1, wherein: in the step (1), the step (c),
the site effect value is OR OR Beta value;
quality control of sites for GWAS statistics comprises removing duplicate sites, removing ambiguous sites, retaining sites with minimum allele frequency MAF greater than 0.01 and padding INFO values greater than 0.5 or 0.8;
the ambiguous site refers to the reference base and the variant base being both purine or pyrimidine.
4. The method of claim 1, wherein: in the step (2), the SNP chip or the sequencing data needs to obtain the locus and the genotype information thereof in the whole genome range by a biological information analysis means; the disease species in the test set are the same as in the training set and the test set and training set cannot contain the same disease or normal samples.
5. The method of claim 1, wherein: in the step (3), the method for obtaining the genotype information of the analysis sites comprises the steps of sequentially phasing, site filling and reserving filling sites based on the site genotype of the whole genome range and the data of the thousand human genomes; and combining the filling site and the site in the whole genome range to obtain the analysis site.
6. The method of claim 1, wherein: in the step (3), the step of respectively performing sample quality control and site quality control on the training set and the test set comprises removing at least one of repeated sites, removing sites with different alleles from GWAS statistical data, removing sites with MAF value less than 0.01, removing sites with sample deletion rate more than 0.01, removing sites with Hawthorn-Wenberg frequency more than 0.000001, removing samples with close relationship and removing repeated samples in the training set and the test set.
7. The method of claim 1, wherein: in the step (4), different strategy methods are adopted, namely rounding and threading, beta ringing, lassosum, ldpred and a deep neural network.
8. The method of claim 1, wherein: in the step (5), quintupling cross validation is to randomly divide the disease sample and the normal sample of the training set into 5 parts, construct a logistic model by taking four parts of data as a training set, test the efficiency of the model by taking the other part of data as a test set, and evaluate the classification efficiency of the model in the training set by repeating quintupling cross validation 100 times and taking the median or mean of AUC.
9. A disease classification model constructed using the method of any one of claims 1 to 8.
10. The use of the disease classification model of claim 9 as at least one of a1) -a 4):
A1) assessing the risk of the target disease of the subject;
A2) preparing a product for assessing the risk of a target disease in a subject;
A3) preventing a target disease;
A4) preparing a product for preventing a target disease.
CN202110355345.9A 2021-04-01 2021-04-01 Method for constructing disease classification model based on multi-gene risk scoring Withdrawn CN113066586A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110355345.9A CN113066586A (en) 2021-04-01 2021-04-01 Method for constructing disease classification model based on multi-gene risk scoring

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110355345.9A CN113066586A (en) 2021-04-01 2021-04-01 Method for constructing disease classification model based on multi-gene risk scoring

Publications (1)

Publication Number Publication Date
CN113066586A true CN113066586A (en) 2021-07-02

Family

ID=76565341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110355345.9A Withdrawn CN113066586A (en) 2021-04-01 2021-04-01 Method for constructing disease classification model based on multi-gene risk scoring

Country Status (1)

Country Link
CN (1) CN113066586A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593635A (en) * 2021-08-06 2021-11-02 上海市农业科学院 Corn phenotype prediction method and system
CN113593630A (en) * 2021-08-23 2021-11-02 北京果壳生物科技有限公司 Family coronary heart disease risk assessment and risk factor identification system
CN113628749A (en) * 2021-08-23 2021-11-09 北京果壳生物科技有限公司 Method and system for predicting disease risk based on machine learning model
CN113838525A (en) * 2021-09-29 2021-12-24 中山大学 A method and system for predicting disease-causing gene pairs
CN114841280A (en) * 2022-05-20 2022-08-02 北京安智因生物技术有限公司 Prediction classification method, system, medium, equipment and terminal for complex diseases
CN116052757A (en) * 2022-12-27 2023-05-02 广州市金域转化医学研究院有限公司 Adenovirus susceptibility risk assessment model and biomarkers
CN116913376A (en) * 2023-07-17 2023-10-20 首都医科大学附属北京朝阳医院 Method and system for determining genetic susceptibility sites for chronic thromboembolic pulmonary hypertension
CN119296791A (en) * 2024-12-10 2025-01-10 神州医疗科技股份有限公司 Disease prediction method and system integrating image recognition, large model and PRS

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190017119A1 (en) * 2017-07-12 2019-01-17 The General Hospital Corporation Genetic Risk Predictor
US20190330698A1 (en) * 2017-07-12 2019-10-31 The General Hospital Corporation Diabetes polygenic risk score
US20190345566A1 (en) * 2017-07-12 2019-11-14 The General Hospital Corporation Cancer polygenic risk score
CN111128298A (en) * 2019-12-24 2020-05-08 大连海事大学 A method and system for obtaining a polygenic risk score based on a deep learning model
CN111161799A (en) * 2019-12-24 2020-05-15 大连海事大学 A method and system for obtaining polygenic risk score based on multi-omics data
US20210065846A1 (en) * 2019-08-27 2021-03-04 The Board Of Trustees Of The Leland Stanford Junior University Assessment of Polygenic Trait Risk via Trait Components and Applications Thereof
CN112562858A (en) * 2020-12-31 2021-03-26 申友基因组研究院(南京)有限公司 Comprehensive scoring system related to lung cancer risk
CN112553327A (en) * 2020-12-30 2021-03-26 中日友好医院(中日友好临床医学研究所) Construction method of pulmonary thromboembolism risk prediction model based on single nucleotide polymorphism, SNP site combination and application

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190017119A1 (en) * 2017-07-12 2019-01-17 The General Hospital Corporation Genetic Risk Predictor
US20190330698A1 (en) * 2017-07-12 2019-10-31 The General Hospital Corporation Diabetes polygenic risk score
US20190345566A1 (en) * 2017-07-12 2019-11-14 The General Hospital Corporation Cancer polygenic risk score
US20210065846A1 (en) * 2019-08-27 2021-03-04 The Board Of Trustees Of The Leland Stanford Junior University Assessment of Polygenic Trait Risk via Trait Components and Applications Thereof
CN111128298A (en) * 2019-12-24 2020-05-08 大连海事大学 A method and system for obtaining a polygenic risk score based on a deep learning model
CN111161799A (en) * 2019-12-24 2020-05-15 大连海事大学 A method and system for obtaining polygenic risk score based on multi-omics data
CN112553327A (en) * 2020-12-30 2021-03-26 中日友好医院(中日友好临床医学研究所) Construction method of pulmonary thromboembolism risk prediction model based on single nucleotide polymorphism, SNP site combination and application
CN112562858A (en) * 2020-12-31 2021-03-26 申友基因组研究院(南京)有限公司 Comprehensive scoring system related to lung cancer risk

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
彭佳丽;刘春容;李旭;易芳;李佳圆;: "采用XGBoost和随机森林探索中国西部女性乳腺癌危险因素", 现代预防医学, no. 01, pages 1 - 4 *
马红霞: ""多基因风险评分在恶性肿瘤风险预测中的应用展望"", 《南京医科大学学报(自然科学版)》, vol. 40, no. 4, pages 467 - 469 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593635A (en) * 2021-08-06 2021-11-02 上海市农业科学院 Corn phenotype prediction method and system
CN113593630A (en) * 2021-08-23 2021-11-02 北京果壳生物科技有限公司 Family coronary heart disease risk assessment and risk factor identification system
CN113628749A (en) * 2021-08-23 2021-11-09 北京果壳生物科技有限公司 Method and system for predicting disease risk based on machine learning model
CN113838525A (en) * 2021-09-29 2021-12-24 中山大学 A method and system for predicting disease-causing gene pairs
CN113838525B (en) * 2021-09-29 2023-09-29 中山大学 A method and system for predicting disease-causing gene pairs
CN114841280A (en) * 2022-05-20 2022-08-02 北京安智因生物技术有限公司 Prediction classification method, system, medium, equipment and terminal for complex diseases
CN116052757A (en) * 2022-12-27 2023-05-02 广州市金域转化医学研究院有限公司 Adenovirus susceptibility risk assessment model and biomarkers
CN116913376A (en) * 2023-07-17 2023-10-20 首都医科大学附属北京朝阳医院 Method and system for determining genetic susceptibility sites for chronic thromboembolic pulmonary hypertension
CN119296791A (en) * 2024-12-10 2025-01-10 神州医疗科技股份有限公司 Disease prediction method and system integrating image recognition, large model and PRS

Similar Documents

Publication Publication Date Title
CN113066586A (en) Method for constructing disease classification model based on multi-gene risk scoring
Jeong et al. GMStool: GWAS-based marker selection tool for genomic prediction from genomic data
US7653491B2 (en) Computer systems and methods for subdividing a complex disease into component diseases
Mooney et al. Understanding the hidden complexity of Latin American population isolates
Ware et al. Heterogeneity in polygenic scores for common human traits
US7035739B2 (en) Computer systems and methods for identifying genes and determining pathways associated with traits
CN118072823A (en) Rice phenotype prediction method and system based on whole genome selection
US20060111849A1 (en) Computer systems and methods that use clinical and expression quantitative trait loci to associate genes with traits
Schadt et al. A new paradigm for drug discovery: integrating clinical, genetic, genomic and molecular phenotype data to identify drug targets
WO2022087478A1 (en) Machine learning platform for generating risk models
US20020119451A1 (en) System and method for predicting chromosomal regions that control phenotypic traits
Guinot et al. Learning the optimal scale for GWAS through hierarchical SNP aggregation
Grandke et al. Advantages of continuous genotype values over genotype classes for GWAS in higher polyploids: a comparative study in hexaploid chrysanthemum
Holland et al. The genetic architecture of human complex phenotypes is modulated by linkage disequilibrium and heterozygosity
Jiménez-Montero et al. Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle
CN117877573A (en) Construction method of polygene genetic risk assessment model by utilizing isooctane model
KR102733956B1 (en) Prediction system and method of cancer immunotherapy drug Sensitivity using multiclass classification A.I based on HLA Haplotype
CN112735594B (en) Method for screening mutation sites related to disease phenotype and application thereof
CN113393896A (en) I type diabetes risk assessment system based on deep neural network
Friedrichs et al. Filtering genetic variants and placing informative priors based on putative biological function
JP2004354373A (en) A method for estimating penetrance using genotype data and phenotype data and a method for testing the association between diplotype and phenotype
Seaman et al. A Bayesian partition model for case‐control studies on highly polymorphic candidate genes
Lozano et al. Comparative evolutionary analysis and prediction of deleterious mutation patterns between sorghum and maize
JP2005129024A (en) An algorithm that estimates and tests the relationship between haplotypes and quantitative phenotypes
Markus et al. Integration of SNP genotyping confidence scores in IBD inference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210702