[go: up one dir, main page]

AU2023261122A1 - Construction method for model for analyzing variation detection result - Google Patents

Construction method for model for analyzing variation detection result Download PDF

Info

Publication number
AU2023261122A1
AU2023261122A1 AU2023261122A AU2023261122A AU2023261122A1 AU 2023261122 A1 AU2023261122 A1 AU 2023261122A1 AU 2023261122 A AU2023261122 A AU 2023261122A AU 2023261122 A AU2023261122 A AU 2023261122A AU 2023261122 A1 AU2023261122 A1 AU 2023261122A1
Authority
AU
Australia
Prior art keywords
variation
data set
positive
value
sequencing data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
AU2023261122A
Inventor
Zhiyu Peng
Jun Sun
Fei Tang
Zhonghua Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bgi Tianjin
Tianjin Medical Laboratory Bgi
Original Assignee
Bgi Tianjin
Tianjin Medical Laboratory Bgi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bgi Tianjin, Tianjin Medical Laboratory Bgi filed Critical Bgi Tianjin
Publication of AU2023261122A1 publication Critical patent/AU2023261122A1/en
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • Organic Chemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Wood Science & Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Zoology (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Biochemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides a construction method for a model for analyzing a variation detection result. The method comprises: obtaining a positive sequencing data set specified for a positive variation site and a negative sequencing data set specified for a negative variation location; respectively extracting features of the variation sites from the positive sequencing data set and the negative sequencing data set; and constructing a model by using the feature result obtained in the previous step, wherein the features comprise at least one of the following: an AD0 value, an AD1 value, an AF0 value, an AF1 value, a GT value, a DP value, a GQ value, an MQ value, and a QUAL value.

Description

CONSTRUCTION METHOD FOR MODEL FOR ANALYZING VARIATION DETECTION RESULT FIELD
[0001] The present application relates to the field of biology. In particular, the present
application relates to a method for constructing a model for analyzing variation detection result.
BACKGROUND
[0002] Clinical next generation sequencing (cNGS) is widely used to determine the
molecular diagnosis of patients with genetic diseases. However, known NGS procedures suffer
from random and systematic errors in the steps of sequencing, alignment, and variation
invoking. As the reported variations can affect patient care and treatment, the American College
of Medical Genetics and Genomics (ACMG) and the College of American Pathologists (CAP)
recommend orthogonal validation of the reported variations to reduce the risk of false positive
results. Currently, Sanger sequencing is the main technique for molecular diagnosis of genetic
diseases. However, as evidenced by the growth of public databases such as ClinVar and OMIM,
the total number of candidate variations for clinical reporting is steadily increasing, which
multiplies the cost and turnover time of testing. Thus, it is impractical to completely detect the
candidate variations. Therefore, it is becoming increasingly urgent to reduce the need for
orthogonal testing by identifying false positive variations in cNGS data based on machine
learning models, which are trained with a large number of known data.
[0003] At present, there are the following problems in the study regarding variation false
positive: orthogonal experiments such as Sanger sequencing may substantially increase the
costs and turnover time; most of the features used in the existing models are Boolean tag values,
which may lead to information loss compared with the unchanged quantitative indicators; there
are relatively few false positive variations invoking in the existing model training set, which may lead to a wider confidence interval of some false positive capture rates (especially SNV); existing models use insufficient clinical data due to the cost, either deliberately complex for multiple scenarios with insufficient confidence, or sufficient confidence with a high risk of overfitting, inadequate for scenarios.
[0004] Therefore, it is urgent to improve the current methods for predicting variation false positives .
SUMMARY
[0005] The present disclosure aims to solve, at least to some extent, at least one of the technical problems existing in the prior art.
[0006] In an aspect, the present disclosure provides a method for constructing a model for analyzing variation detection result. According to an embodiment of the present disclosure, the
method includes: acquiring a positive sequencing data set of identified positive variation sites
and a negative sequencing data set of identified negative variation sites; extracting features of variation sites from the positive sequencing data set and the negative sequencing data set; and
constructing a model based on features and results acquired in the previous step. The features include at least one of: ADO value representing a depth of the first allele in the variation site
genotype; AD1 value representing a depth of the second allele in the variation site genotype;
AFO value representing a frequency of the first allele in the variation site genotype; AF1 value of representing a frequency of the second allele in the variation site genotype; GT value
representing a single numerical value (specifically, 0, 1, 2, or 3); DP value representing a
sequencing depth value; GQ value representing a quality value of variation site genotype; MQ value representing a quality of mapping of variation sites; and QUAL value representing a
quality value of probability of the variation sites.
[0007] Several dozens of result parameters can be generated by the variation detection and analysis software. The Applicant compared and analyzed these result parameters, and screened
out a group of result parameters. By using these result parameters as features a machine learning model is constructed for the data sets of the identified positive and negative variation sites. With
the obtained model, it can be accurately predicted whether the positive variation data are false positive, and the variation site genotype can be learned, thereby quickly and precisely locating the possible variations and reducing the cost and turnover time of orthogonal experiment.
[0008] In another aspect, the present disclosure provides a method for analyzing variation detection result. According to an embodiment of the present disclosure, the method includes:
predicting, by analyzing the candidate positive variation data set with a machine learning model obtained using the above-mentioned method for constructing the model for analyzing variation detection result, whether positive variation data in the candidate positive variation data set are
false positive and/or variation site genotype. In this way, using the method according to the
present disclosure, it can be accurately predicted whether the positive variation data are false positives using the methods of the present disclosure, while the variation genotype can be
determined, thereby quickly and precisely locating the possible variations and reducing the cost
and turnover time of orthogonal experiment.
[0009] In yet another aspect, the present disclosure provides an apparatus for constructing a model for analyzing variation detection result. According to an embodiment of the present
disclosure, the apparatus includes: an acquisition module configured to acquire a positive sequencing data set of identified positive variation sites and a negative sequencing data set of
identified negative variation sites; an extraction module configured to extract features of
variation sites from the positive sequencing data set and the negative sequencing data set; and a construction module configured to construct a model based on features and results acquired
by the extraction module. The features include at least one of: ADO value representing a depth of the first allele in the variation site genotype; ADI value representing a depth of the second allele in the variation site genotype; AFO value representing a frequency of the first allele in the
variation site genotype; AF Ivalue of representing a frequency of the second allele in the variation site genotype; GT value representing a single numerical value; DP value representing
a sequencing depth value; GQ value representing a quality value of variation site genotype; MQ
value representing a quality of mapping of variation sites; and QUAL value representing a quality value of probability of the variation sites. Thus, by using the model obtained by the
apparatus according to the present disclosure, it can be accurately predicted whether the positive
variation data are false positive, while the variation genotype can be determined, thereby quickly and precisely locating the possible variations and reducing the cost and turnover time of orthogonal experiment.
[0010] In yet another aspect, the present disclosure provides an executable storage medium. According to an embodiment of the present disclosure, the storage medium has computer
program instructions stored thereon. The computer program instructions, when run on a
processor, cause a processor to execute the method for analyzing variation detection result as described above. Thus, by executing the storage medium according to the present disclosure, it can be accurately predicted whether the positive variation data are false positive, while the
variation genotype can be determined, thereby quickly and precisely locating the possible
variations and reducing the cost and turnover time of orthogonal experiment.
[0011] In yet another aspect, the present disclosure provides an electronic device.
According to an embodiment of the present disclosure, the electronic device includes the above
mentioned executable storage medium and a processor configured to execute a computer program to implement the above-mentioned method for analyzing variation detection result.
Thus, by implementing the electronic device of the present disclosure, it can be accurately
predicted whether the positive variation data are false positive, while the variation genotype can be determined, thereby quickly and precisely locating the possible variations and reducing
the cost and turnover time of orthogonal experiment.
[0012] Additional aspects and advantages of the present disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be
learned by practice of the present disclosure.
DETAILED DESCRIPTION
[0013] Hereinafter, embodiments of the present disclosure are described in detail. The embodiments described below are illustrative only and are not to be construed as limitations of the present disclosure. The specific techniques or conditions, when not specified in the
examples, are performed according to techniques or conditions described in literatures in the
related art or according to the product instructions. The used reagents or instruments without indicating the manufacturer are conventional products that are commercially available.
Method for constructing a model for analyzing variation detection result
[0014] In an aspect, the present disclosure provides a method for constructing a model for analyzing variation detection result. According to an embodiment of the present disclosure, the method includes: acquiring a positive sequencing data set of identified positive variation sites
and a negative sequencing data set of identified negative variation sites; extracting features of
variation sites from the positive sequencing data set and the negative sequencing data set; and constructing a model based on features and results acquired in the previous step. The features include at least one of ADO value, ADi value, AFO value, AFi value, GT value, DP value, GQ
value, MQ value, and QUAL value.
[0015] Through a large number of experiments, the Applicant screened out nine result parameters as mentioned above, which are all the result parameters in the GATK software. The
specific meanings thereof can refer to the following table. These result parameters are used as
features to perform machine learning on the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the identified negative variation
sites to obtain a prediction model. Thus, by using the obtained model, it can be accurately
predicted whether the positive variation data are false positives, thereby quickly and precisely locating the possible variations and reducing the cost and turnover time of orthogonal
experiment.
[0016] [Table 1] Meanings of features Features Meanings
site genotype (e.g., if ADO Allele depth of the first allele in the variation GT = 0/1, it represents a depth of reference allele) site genotype AD1 Allele depth of the second allele in the variation (e.g., if GT = 0/1, it represents a depth of the first candidate allele) site genotype AF0 Allele frequency of the first allele in the variation (e.g., GT = 0/1 and AD = (10, 30), then AFO value is 0.25)
variation site genotype AF1 Allele frequency of the second allele in the (e.g., GT = 0/1 and AD = (10,30), then AFIvalue is 0.75)
value, GT Genotype field (GT) is converted to a single numerical where GT value can be 0, 1, 2, or 3 (e.g., GT = 0/1, GT value is 0) DP Sequencing depth value GQ Quality value of variation site genotype
MQ Quality of mapping of variation sites QUAL Quality value of the probability of variation sites
[0017] According to an embodiment of the present disclosure, the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the identified
negative variation sites are acquired by: acquiring a sequencing data set; aligning, by using GATK software, the sequencing data set with reference data to acquire a candidate positive
variation data set; and performing analysis processing on the candidate positive variation data
set to acquire the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the identified negative variation sites.
[0018] Preferably, the clinical gene sequencing data are first acquired, and the sequencing data are compared with the reference data (including operations such as alignment, variation detection, annotation, and filtration); the variations are identified with GATK to acquire
candidate positive variation data; and VCF file is output. By performing the analysis processing on the candidate positive variation data again, it can be clearly determined whether the data are
true positive or false positive. The data are divided into the positive sequencing data set for the
positive variation sites and the negative sequencing data set for the negative variation sites.
[0019] According to an embodiment of the present disclosure, the reference data is selected
from human genome hg19.
[0020] According to an embodiment of the present disclosure, the analysis processing includes: performing standard clinical interpretation on the candidate positive variation data set
to acquire a data set of potentially pathogenic variations; and performing orthogonal test
analysis on the data set of potentially pathogenic variations to acquire the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the
identified negative variation sites. The positive sequencing data set includes an SNV variation
type data set and anTNDEL variation type data set. Each of the SNV variation type data set and the INDEL variation type data set includes a homozygous genotype data set and a heterozygous
genotype data set.
[0021] The term "standard clinical interpretation" refers to an interpretation of the pathogenicity of clinical variants with reference to the ACMG guideline, edition 2015.
[0022] Based on standard clinical interpretation of the candidate positive variation data acquired by GATK identification analysis, the potentially pathogenic variation data can be acquired, and then the accuracy of variation can be verified through orthogonal test on these data, thereby acquiring the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the identified negative variation sites. The positive sequencing data set can be divided into SNV variation type and INDEL variation type. The genotype of variation, i.e., Hom or Het, can be accurately known based on these two variation types.
[0023] It should be noted that the method for orthogonal test analysis according to the present disclosure is not strictly limited as long as it can be known whether the variation data of possible pathogenic diseases is true positive variation or false positive. It can be specifically
manipulated by using conventional techniques in the art, for example, refer to Sanger F. DNA
sequencing with chain-terminating inhibitors, 1977[J]. Biotechnology (Reading, Mass.), 24:104-108.
[0024] According to an embodiment of the present disclosure, the model is selected from a random forest classification model and has a threshold of 0.95 0.05. The setting of the threshold ensures a sufficient accuracy rate and reduces accidental errors. With the scalable
threshold setting, the accuracy rate and the orthogonal test rate can be balanced on the premise
of ensuring sufficient accuracy.
[0025] According to an embodiment of the present disclosure, the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the identified negative variation sites are divided into a training set and a test set (3:1), respectively, and a random forest classification model is selected, and the model with the highest accuracy rate is
selected through 5-fold cross validation.
[0026] According to an embodiment of the present disclosure, the method for constructing a model for analyzing variation detection result includes the following steps.
[0027] 1. First, clinical genomic data are acquired and aligned with the human reference genome (hg19), and GATK is used to identify the variation and output VCF file.
[0028] 2. The potentially pathogenic variations are acquired based on standard clinical interpretation, then the accuracy of variation through the orthogonal experiment is verified, and the accurate genotypes of homozygous (Hom), heterozygous (Het), and no variation (N) are provided.
[0029] 3. The VCF file is then converted to machine learning labels, from which a total of 9 features were acquired, see Table 1.
[0030] 4. Depending on the different variation types (SNV, INDEL), two different machine learning classification models are constructed by the features extracted from the VCF file and the optimal parameters are acquired by grid search.
[0031] 5. The data are respectively divided into a training set and a test set (3:1) based on the above method, and a random forest classification model is selected, and the model with the
highest accuracy rate is selected through 5-fold cross validation. Method for analyzing variation detection result
[0032] In another aspect, the present disclosure provides a method for analyzing variation detection result. According to an embodiment of the present disclosure, the method includes: acquiring a candidate positive variation data set; and predicting, by analyzing the candidate
positive variation data set with a machine learning model, whether positive variation data in the
candidate positive variation data set are false positive and/or variation site genotype, the machine learning model being acquired by the above-mentioned method for constructing the
model for analyzing variation detection result. In this way, by using the model obtained by the
method according to the present disclosure, it can be accurately predicted whether the candidate positive variation data are false positives, while the variation genotype can be determined,
thereby quickly and precisely locating the possible variations and reducing the cost and turnover time of orthogonal experiment.
[0033] According to an embodiment of the present disclosure, the candidate positive variation data set is acquired by: acquiring a sequencing data set; and aligning, by using GATK software, the sequencing data set with reference data to acquire the candidate positive variation
data set.
[0034] According to an embodiment of the present disclosure, the model is selected from a random forest classification model, and a confidence of the candidate positive variation data is
lower than a threshold of the model, the candidate positive variation data is subjected to an
orthogonal test analysis, to predict whether the positive variation data in the candidate positive variation data set are false positive. The data lower than the threshold are referred to as gray zone data, and the accuracy rate of predicting false positives using the model is low. Therefore, this part of the data is necessarily subjected to orthogonal experimental verification to accurately predict whether it is false positive.
[0035] Those skilled in the art can appreciate that the features and advantages described above with respect to the method for constructing a model for analyzing variation detection result are equally applicable to the method for analyzing variation detection result, which are not be described in detail herein.
Apparatus for constructing a model for analyzing variation detection result
[0036] In yet another aspect, the present disclosure provides an apparatus for constructing a model for analyzing variation detection result. According to an embodiment of the present
disclosure, the apparatus includes: an acquisition module configured to acquire a positive
sequencing data set of identified positive variation sites and a negative sequencing data set of identified negative variation sites; an extraction module configured to extract features of
variation sites from the positive sequencing data set and the negative sequencing data set; and
a construction module configured to construct a model based on features and results acquired by the extraction module. The features includes at least one of: ADO value representing a depth
of the first allele in the variation site genotype; ADI value representing a depth of the second
allele in the variation site genotype; AFO value representing a frequency of the first allele in the variation site genotype; AF Ivalue of representing a frequency of the second allele in the
variation site genotype; GT value representing a single numerical value; DP value representing a sequencing depth value; GQ value representing a quality value of variation site genotype; MQ value representing a quality of mapping of variation sites; and QUAL value representing a
quality value of probability of the variation sites. Thus, with the model obtained by the apparatus according to the present disclosure, it can be accurately predicted whether positive
variation data are false positives, while the variation genotype can be determined, thereby
quickly and precisely locating the possible variations and reducing the cost and turnover time of orthogonal experiment.
[0037] According to an embodiment of the present disclosure, the acquisition module includes: a sequencing data set acquisition module configured to acquire a sequencing data set; an aligning module configured to align, by using GATK software, the sequencing data set with reference data to acquire a candidate positive variation data set; and an analysis processing module configured to perform analysis processing on the candidate positive variation data set to acquire the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the identified negative variation sites. The acquisition module can accurately determine the positive variation site data and negative variation site data in the sequencing data set, and can also determine the genotype of the positive variation site.
[0038] According to an embodiment of the present disclosure, the analysis processing module includes a standard clinical interpretation module configured to perform standard
clinical interpretation on the positive variation data to acquire data of potentially pathogenic variations; and an orthogonal test analysis module configured to perform orthogonal test
analysis on the data set of potentially pathogenic variations to acquire the positive sequencing
data set of the identified positive variation sites and the negative sequencing data set of the identified negative variation sites.
Executable storage medium
[0039] In yet another aspect, the present disclosure provides an executable storage medium. According to an embodiment of the present disclosure, the storage medium has computer
program instructions stored thereon. The computer program instructions, when executed on a
processor, cause the processor to implement the method for analyzing variation detection result as described above. Thus, with the storage medium of the present disclosure, it can be accurately
predicted whether positive variation data are false positives, while the variation genotype can be determined, thereby quickly and precisely locating the possible variations and reducing the cost and turnover time of orthogonal experiment.
[0040] Those skilled in the art can appreciate that the features and advantages described above in relation to the method for analyzing variation detection result are equally applicable
to the executable storage medium, which are not be described further herein.
Electronic device
[0041] In yet another aspect, the present disclosure provides an electronic device. According to an embodiment of the present disclosure, the electronic device includes the
executable storage medium described above, and a processor configured to execute the computer program to implement the method for analyzing variation detection result described above. Thus, by implementing the electronic device of the present disclosure, it can be accurately predicted whether positive variation data are false positives, while the variation genotype can be determined, thereby quickly and precisely locating the possible variations and reducing the cost and turnover time of orthogonal experiment.
[0042] Those skilled in the art can appreciate that the features and advantages described above with respect to the method for analyzing variation detection result and the executable storage medium are equally applicable to the electronic device, which are not be described in
detail herein.
[0043] The embodiments of the present disclosure will be explained with reference to the following examples. It will be understood by those skilled in the art that the following examples
are merely illustrative of the present disclosure and are not to be construed as limiting the scope
of the present disclosure. Where specific techniques or conditions are not specified in the examples, they are performed according to techniques or conditions described in the literature
in the art or according to the product description. The reagents or instruments used without
indicating the manufacturer are conventional and commercially available products. Example 1
[0044] 1. WES data of 5190 clinical patients were acquired, and GATK software was used to perform alignment, variation detection, annotation, and filtration on the data and human genome hg19 to acquire a VCF file.
[0045] 2. The VCF file was subjected to the standard clinical interpretation process, to acquire 7375 potentially pathogenic variants by analysis.
[0046] 3. The above-mentioned 7375 variants numbers were verified by orthogonal experiments (for details, referring to Sanger F. DNA sequencing with chain-terminating inhibitors. , 1977[J]. Biotechnology (Reading, Mass.), 24:104-108), and it was determined that
these variations included 5241 variation type SNV and 2134 variation type INDEL. There were
3226 Het genotypes, 63 Hom genotypes, and 1952 negative variants in SNV; and there were 1606 Het genotypes, 138 Hom genotypes, and 390 negative variants in Indel.
[0047] 4. The data in the previous step was divided into a training set and a test set (3:1), the training set established a random forest classification model respectively, and all the features in the training set were taken as candidate features. Then, the principal component analysis was performed, and finally, 9 features listed in Table 2 were determined.
[0048] [Table 2] Feature importance in establishing random forest classification model with different variation types SNV and INDEL
Features SNV_MODEL INDELMODEL
ADO 0.0305 0.0389
ADI 0.0365 0.0606
AFO 0.3350 0.3135
AFI 0.2352 0.3027
GT 0.0078 0.0289
DP 0.0300 0.0174
GQ 0.0787 0.0691
MQ 0.0139 0.0141
QUAL 0.2324 0.1548
[0049] The test set accuracy rate of SNV and INDEL models was 94.8% and 93.8%, respectively, and the accuracy rates of different genotypes are shown in Table 3.
[0050] [Table 3] Accuracy rate of different genotypes in establishing random forest classification model with different variation types SNV and INDEL
Genotype SNVMODEL(%) INDELMODEL(%)
Het 92.9 80.5
Horn 100 92.1
N (negative) 96.3 97.2
[0051] Considering the required accuracy of clinical data, this method acquires different accuracy and orthogonal experiment ratio (see Table 4) by defining different thresholds
(confidence of random forest results) for the test set. The accuracy rate represents the number
of correct judgment/the total number meeting the threshold, and the orthogonal experiment ratio represents the number lower than the threshold/the number of overall test samples. Under the
condition that sufficient accuracy rate is satisfied, the smallest possible threshold of orthogonal experiment ratio is selected as the target threshold, and the threshold is finally determined to be
0.95, within a scalable range 0.05. The results indicate that the proposed method has certain
tolerance to noise data, data redundancy, and low-quality data, with excellent robustness.
[0052] [Table 4] Different thresholds and ratio of required orthogonal experiments in establishing random forest classification models with different variation types SNV and INDEL
Orthogonal experiment Variation type Threshold Accuracy rate(%) ratio(%)
Default (the maximum of Het, hom, SNV 94.8 0 and N (negative))
SNV 0.9 98.1 14.8 SNV 0.92 98.3 17.2
SNV 0.94 98.6 20.8
SNV 0.96 98.7 26.7 SNV 0.99 99.1 87.1 Default (the maximum of Het, hom, INDEL 93.8 0 and N (negative))
INDEL 0.9 97.8 21.1
INDEL 0.92 98.0 23.7
INDEL 0.94 98.0 27.8 INDEL 0.96 97.7 34.4
INDEL 0.99 98.1 69.5
[0053] In the specification, references to descriptions of the terms "an embodiment", "some embodiments", "examples", "specific examples", or "some examples", etc. mean that a particular feature, structure, material, or characteristic described in connection with the
embodiment or example is included in at least one embodiment or example of the present
disclosure. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features,
structures, materials, or characteristics described may be combined in any suitable manner in
any one or more embodiments or examples. Furthermore, combinations and combinations of the various embodiments or examples and features of the various embodiments or examples
described in this specification can be made by those skilled in the art without departing from
the scope of the present disclosure.
[0054] While the embodiments of the present disclosure are illustrated and described above, it will be understood that the above-described embodiments are illustrative and not restrictive and that those skilled in the art can make changes, modifications, substitutions, and alterations without departing from the scope of the present disclosure.

Claims (13)

  1. What is claimed is: 1. A method for constructing a model for analyzing variation detection result, the method
    comprising:
    acquiring a positive sequencing data set of identified positive variation sites and a negative sequencing data set of identified negative variation sites;
    extracting features of variation sites from the positive sequencing data set and the negative sequencing data set; and
    constructing a model based on features and results acquired in the previous step,
    wherein the features comprise at least one of: ADO value: a depth of a first allele in a variation site genotype;
    ADi value: a depth of a second allele in the variation site genotype;
    AFO value: a frequency of the first allele in the variation site genotype; AFi value: a frequency of the second allele in the variation site genotype;
    GT value: a single numerical value; DP value: a sequencing depth value;
    GQ value: a quality value of the variation site genotype;
    MQ value: a quality of mapping of the variation sites; and QUAL value: a quality value of probability of the variation sites.
  2. 2. The method according to claim 1, wherein the positive sequencing data set of the
    identified positive variation sites and the negative sequencing data set of the identified negative variation sites are acquired by:
    acquiring a sequencing data set; aligning, by using GATK software, the sequencing data set with reference data to acquire
    a candidate positive variation data set; and
    performing analysis processing on the candidate positive variation data set to acquire the positive sequencing data set of the identified positive variation sites and the negative
    sequencing data set of the identified negative variation sites.
  3. 3. The method according to claim 2, wherein the reference data is selected from human
    genome hg19.
  4. 4. The method according to claim 2, wherein the analysis processing comprises: performing standard clinical interpretation on the candidate positive variation data set to
    acquire a data set of potentially pathogenic variations; and performing orthogonal test analysis on the data set of potentially pathogenic variations to acquire the positive sequencing data set of the identified positive variation sites and the negative
    sequencing data set of the identified negative variation sites, wherein the positive sequencing
    data set comprises an SNV variation type data set and an INDEL variation type data set, each of the SNV variation type data set and the INDEL variation type data set comprising a
    homozygous genotype data set and a heterozygous genotype data set.
  5. 5. The method according to claim 1, wherein the model is selected from a random forest classification model and has a threshold of 0.95 0.05.
  6. 6. A method for analyzing variation detection result, comprising:
    acquiring a candidate positive variation data set; and predicting, by analyzing the candidate positive variation data set with a machine learning
    model, whether positive variation data in the candidate positive variation data set are false
    positive and/or variation site genotype, the machine learning model being acquired by the method for constructing the model for analyzing variation detection result according to any one
    of claims I to 5.
  7. 7. The method according to claim 6, wherein the candidate positive variation data set is acquired by:
    acquiring a sequencing data set; and aligning, by using GATK software, the sequencing data set with reference data to acquire
    the candidate positive variation data set.
  8. 8. The method according to claim 6, wherein: the model is selected from a random forest classification model; and
    when a confidence of the candidate positive variation data is lower than a threshold of the
    model, the candidate positive variation data is subjected to an orthogonal test analysis, to predict whether the positive variation data in the candidate positive variation data set are false positive.
  9. 9. A apparatus for constructing a model for analyzing variation detection result, comprising:
    an acquisition module configured to acquire a positive sequencing data set of identified positive variation sites and a negative sequencing data set of identified negative variation sites;
    an extraction module configured to extract features of variation sites from the positive
    sequencing data set and the negative sequencing data set; and a construction module configured to construct a model based on features and results acquired by the extraction module,
    wherein the features comprise at least one of:
    ADO value: a depth of a first allele in a variation site genotype; ADi value: a depth of a second allele in the variation site genotype;
    AFO value: a frequency of the first allele in the variation site genotype;
    AFi value: a frequency of the second allele in the variation site genotype; GT value: a single numerical value;
    DP value: a sequencing depth value;
    GQ value: a quality value of the variation site genotype; MQ value: a quality of mapping of the variation sites; and
    QUAL value: a quality value of probability of the variation sites.
  10. 10. The apparatus according to claim 9, wherein the acquisition module comprises: a sequencing data set acquisition module configured to acquire a sequencing data set;
    an aligning module configured to, by using GATK software, the sequencing data set with reference data to acquire a candidate positive variation data set; and an analysis processing module configured to perform analysis processing on the candidate
    positive variation data set to acquire the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the identified negative variation sites.
  11. 11. The apparatus according to claim 10, wherein the analysis processing module
    comprises: a standard clinical interpretation module configured to perform standard clinical
    interpretation on the positive variation data to acquire a data set of potentially pathogenic
    variations; and an orthogonal test analysis module configured to perform orthogonal test analysis on the data set of potentially pathogenic variations to acquire the positive sequencing data set of the identified positive variation sites and the negative sequencing data set of the identified negative variation sites.
  12. 12. An executable storage medium, having computer program instructions stored thereon,
    wherein the computer program instructions, when executed on a processor, cause the processor to implement the method for analyzing variation detection result according to any one of claims 6 to 8.
  13. 13. An electronic device, comprising:
    the executable storage medium according to claim 12; a processor configured to execute a computer program to implement the method for
    analyzing variation detection result according to any one of claims 6 to 8.
AU2023261122A 2022-04-25 2023-03-15 Construction method for model for analyzing variation detection result Pending AU2023261122A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN202210443091.0A CN116994647A (en) 2022-04-25 2022-04-25 Method for constructing model for analyzing mutation detection result
CN202210443091.0 2022-04-25
PCT/CN2023/081719 WO2023207396A1 (en) 2022-04-25 2023-03-15 Construction method for model for analyzing variation detection result

Publications (1)

Publication Number Publication Date
AU2023261122A1 true AU2023261122A1 (en) 2024-09-05

Family

ID=88517243

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2023261122A Pending AU2023261122A1 (en) 2022-04-25 2023-03-15 Construction method for model for analyzing variation detection result

Country Status (3)

Country Link
CN (1) CN116994647A (en)
AU (1) AU2023261122A1 (en)
WO (1) WO2023207396A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711487B (en) * 2024-02-05 2024-05-17 广州嘉检医学检测有限公司 Identification method and system for embryo SNV and InDel variation and readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018161245A1 (en) * 2017-03-07 2018-09-13 深圳华大基因研究院 Method and device for detecting chromosomal variations
CN108690871B (en) * 2018-03-29 2022-05-20 深圳裕策生物科技有限公司 Method, device and storage medium for detecting insertion deletion mutation based on next generation sequencing
CN112111565A (en) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 Mutation analysis method and device for cell free DNA sequencing data
CN111304308B (en) * 2020-03-02 2025-09-16 北京泛生子基因科技有限公司 Method for auditing high-throughput sequencing gene mutation detection result

Also Published As

Publication number Publication date
WO2023207396A1 (en) 2023-11-02
CN116994647A (en) 2023-11-03

Similar Documents

Publication Publication Date Title
CN113160882B (en) Pathogenic microorganism metagenome detection method based on third generation sequencing
US20050209787A1 (en) Sequencing data analysis
CN111341383B (en) Method, device and storage medium for detecting copy number variation
CN106909806A (en) The method and apparatus of fixed point detection variation
JP2005531853A (en) System and method for SNP genotype clustering
CN113096737B (en) Method and system for automatically analyzing pathogen type
KR20200107774A (en) How to align targeting nucleic acid sequencing data
JPWO2012091093A1 (en) Glaucoma integrated judgment method by glaucoma diagnostic chip and deformed proteomics cluster analysis
US12272431B2 (en) Detecting false positive variant calls in next-generation sequencing
JP2008507993A (en) Automated analysis of multiple probe target interaction patterns: pattern matching and allele identification
CN119418762A (en) A genetic testing data cleaning method and system based on artificial intelligence
US20180196924A1 (en) Computer-implemented method and system for diagnosis of biological conditions of a patient
AU2023261122A1 (en) Construction method for model for analyzing variation detection result
CN112863603A (en) Automatic analysis method and system for bacterial whole genome sequencing data
CN119207587B (en) Gene data analysis method based on large language model
CN118335193A (en) Methods and apparatus for analyzing metagenome in single cell and spatial transcriptome data
CN119673284A (en) Third-generation sequencing read analysis methods, applications and devices
CN113889188B (en) Disease prediction method, system, computer device and medium
Chong et al. SeqControl: process control for DNA sequencing
CN116646010B (en) Human virus detection method and device, equipment and storage medium
CN117612747B (en) Drug sensitivity prediction method and device for klebsiella pneumoniae
WO2016033305A1 (en) Methods, systems and computer readable storage media for generating accurate nucleotide sequences
CN120126557B (en) A method for constructing a prediction model for the functional effect of missense mutations and a prediction method
CN119152934B (en) High-throughput genome sequencing variation detection system and method based on low input starting amount
CN118942543B (en) Plant genome sequencing data analysis method and analysis system based on artificial intelligence