[go: up one dir, main page]

HK1147528A1 - The method of detecting polymorphic sites in genomic target region - Google Patents

The method of detecting polymorphic sites in genomic target region Download PDF

Info

Publication number
HK1147528A1
HK1147528A1 HK11101668.6A HK11101668A HK1147528A1 HK 1147528 A1 HK1147528 A1 HK 1147528A1 HK 11101668 A HK11101668 A HK 11101668A HK 1147528 A1 HK1147528 A1 HK 1147528A1
Authority
HK
Hong Kong
Prior art keywords
snp
sequencing
sample
target region
depth
Prior art date
Application number
HK11101668.6A
Other languages
Chinese (zh)
Other versions
HK1147528B (en
Inventor
李英睿
余昶
罗锐邦
张帆
Original Assignee
深圳华大基因科技服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技服务有限公司 filed Critical 深圳华大基因科技服务有限公司
Publication of HK1147528A1 publication Critical patent/HK1147528A1/en
Publication of HK1147528B publication Critical patent/HK1147528B/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed are a method and a system for detecting a polymorphic locus in a targeted genomic region. The method comprises a step of acquiring an exon sequencing result, a step of removing redundancy and sequencing, a statistic analysis step I, a step of probing an SNP locus, an SNP locus filtering step, a statistic analysis step II, and an SNP annotation step.

Description

Method for detecting polymorphism sites of target region of genome
Technical Field
The invention relates to the technical field of biology, in particular to a method for detecting a polymorphic site in a genome target region.
Background
As the success of the human genome project and the international haplotype map project has been achieved, biologists have located a large number of genomic candidate regions associated with human disease by genetic linkage or association analysis, yet identifying the causative genes or mutations in these regions requires re-sequencing of these regions.
If the existing whole genome re-sequencing analysis technology is adopted, the cost is higher; moreover, for the research of parts such as candidate regions and the like or for the specific guidance given to individual medical treatment, the result of the whole genome re-sequencing analysis contains a large amount of redundant information, which is not beneficial to efficiently obtaining more accurate research results.
In order to improve the efficiency of obtaining effective information, the existing gene analysis technology is concentrated in a high-value gene research area, and the method has great significance for scientific research and medical guidance. Moreover, the conventional method for sequencing the candidate region based on PCR (Polymerase Chain Reaction) is time-consuming and labor-consuming, and cannot meet the requirements of researchers; meanwhile, the SNP (Single nucleotide polymorphism) typing technology based on the gene chip cannot find out rare variation on the genome.
With the advent of new generation high throughput sequencing technologies (such as Solexa sequencing technology) and the reduction of sequencing cost, high throughput, low cost sequencing is possible. Researchers are keenly demanding a technique that can sequence any region of interest on a genome so that various mutations can be identified on that region.
Because mutation of the gene coding region is the main cause of diseases, all coding regions (namely exon regions) of a human genome are extracted and sequenced, so that the genome mutation information of an individual can be well known, and the individual risk of diseases can be further evaluated. Thus, while the cost of sequencing the entire genome is still high today, sequencing all human exons is an important means of decoding the individual genome and enabling individualized medicine.
Therefore, high throughput sequencing methods based on exon Region or Target Region Capture (Target Region Capture) have been developed. The basic principle of this technique is to use a set of oligonucleotide probes to capture target sequences on the genome, then to use universal primers to perform PCR amplification of these captured sequences, and finally to perform high throughput sequencing of these amplified products to identify base sequences in DNA samples.
In conclusion, a method and a system for detecting a genomic target region polymorphic site are provided to solve the defects of imperfect detection means, complex data, low accuracy, low analysis speed and the like of the existing genomic exon, and become a technical problem to be solved in the field.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for detecting the polymorphic sites in a target region of a genome.
One aspect of the present invention provides a method for detecting a polymorphic site in a target region of a genome, the method comprising: obtaining exon sequencing results: sequencing and purifying a human genome DNA sample to obtain an exon region sequencing result; comparing the sequencing result of the exon region with a reference gene sequence to obtain an accurate comparison result; redundancy removal and sorting: removing repeated information and sequencing the comparison result obtained after comparison; statistical analysis step I: carrying out depth and coverage statistics on the global target region, and testing the sex of the sample by using the sequencing depth of the target region of the X chromosome and the Y chromosome; judging whether the sample is polluted or not; detecting SNP sites: finding SNP sites from the sequencing result; SNP site filtration step: screening the SNP loci obtained by detection by taking the mass value as an index; statistical analysis step II: counting the coverage of the filtered SNP sites, analyzing the optimal allele support depth and the suboptimal allele support depth of each SNP site, and judging whether the sample is polluted; SNP annotation step: and comparing the filtered SNP sites with the information in the dbSNP database, and annotating and classifying the matched SNP sites by combining the data in at least one of the ccds, refseq and ensembl databases.
In one embodiment of the method for detecting the polymorphic site in the target region of the genome, in the step of obtaining the exon sequencing result, the linker sequence and the adapter sequence which are contained in the sequencing result and introduced in the sequencing process are removed to realize purification treatment; and comparing the sequencing result of the exon region with the reference gene sequence by using a Soap tool to obtain an accurate comparison result.
In one embodiment of the method for detecting the polymorphic locus of the genome target region, in the redundancy removing and ordering step, the comparison result is ordered according to the chromosome and the coordinates after removing the repeated information, and the ordered result is used as an object to be processed in the SNP locus detecting step.
In one embodiment of the method for detecting the polymorphic sites in the target region of the genome, in the step I of statistical analysis, depth and coverage statistics are carried out on the global target region by using a tool soap.coverage, and a specific distribution map is drawn to reflect the covered uniformity of the target region of a sample and the proportion of bases larger than a preset value; and testing the sex of the sample according to the analysis principle of the support vector machine by using the sequencing depth of the target region of the X chromosome and the Y chromosome; judging whether the sample is polluted or not; if the sample is contaminated during the experimental phase, specific contamination information is given.
In one embodiment of the method for detecting a polymorphic site in a genome target region provided by the present invention, in the step II of statistical analysis, if the analysis of the optimal allele support depth and the suboptimal allele support depth of an SNP site shows that the global SNP heterozygosity rate shows a concentrated trend, it is determined that the sample is contaminated.
In another aspect of the present invention, there is provided a system for detecting a polymorphic site in a target region of a genome, the apparatus comprising: the exon sequencing result acquisition module is used for sequencing and purifying a human genome DNA sample to obtain an exon region sequencing result; comparing the sequencing result of the exon region with a reference gene sequence to obtain an accurate comparison result; the redundancy removing and ordering module is used for removing repeated information and ordering the comparison result obtained after comparison; the statistical analysis module is used for carrying out depth and coverage statistics on the global target region and testing the sex of the sample by using the sequencing depth of the target region of the X chromosome and the Y chromosome; judging whether the sample is polluted or not; counting the coverage of the filtered SNP sites, analyzing the optimal allele support depth and the suboptimal allele support depth of each SNP site, and judging whether the sample is polluted; the SNP locus detection module is used for finding out SNP loci from the sequencing processed result; the SNP locus filtering module is used for screening the SNP loci obtained by detection by taking the quality value as an index; and the SNP annotation module is used for comparing the filtered SNP sites with the information in the dbSNP database, and annotating and classifying the matched SNP sites by combining the data in at least one of the ccds, refseq and ensembl databases.
In an embodiment of the system for detecting a polymorphic site in a target genomic region provided by the present invention, the exon sequencing result obtaining module further comprises: a purification processing submodule used for removing the linker sequence and the adapter sequence which are contained in the sequencing result and are introduced by the sequencing process; and the comparison submodule is used for comparing the sequencing result of the exon region with the reference gene sequence by using a Soap tool to obtain an accurate comparison result.
In an embodiment of the system for detecting polymorphic sites in a target region of a genome provided by the present invention, the redundancy removing and sorting module further comprises: the redundancy removing submodule is used for removing repeated information processing on a comparison result obtained after comparison; and the sequencing submodule is used for sequencing the comparison result after the repeated information is removed according to the chromosome and the coordinate, and the result after sequencing is used as an object to be processed by the SNP locus detection module.
In an embodiment of the system for detecting polymorphic sites in a target region of a genome provided by the present invention, the statistical analysis module further comprises: the first statistical analysis submodule is used for carrying out depth and coverage statistics on the global target region and testing the sex of the sample by using the sequencing depth of the target region of the X chromosome and the Y chromosome; judging whether the sample is polluted or not; and the second statistical analysis submodule is used for counting the coverage of the filtered SNP sites, analyzing the optimal allele support depth and the suboptimal allele support depth of each SNP site and judging whether the sample is polluted.
In one embodiment of the system for detecting the polymorphic sites in the target region of the genome, the first statistical analysis submodule carries out depth and coverage statistics on the global target region by adopting a tool soap.coverage, and draws a specific distribution diagram to reflect the covered uniformity of the target region of a sample and the proportion of bases larger than a preset value; and testing the sex of the sample according to the analysis principle of the support vector machine by using the sequencing depth of the target region of the X chromosome and the Y chromosome; judging whether the sample is polluted or not; if the sample is polluted in the experimental stage, specific pollution information is given; the second statistical analysis submodule performs statistics on the coverage of the filtered SNP sites and analyzes the optimal allele support depth and the suboptimal allele support depth of each SNP site; and if the optimal allele support depth and the suboptimal allele support depth of the SNP locus show that the global SNP heterozygosity rate presents a centralized trend, judging that the sample is polluted.
The invention provides a method and a system for detecting a polymorphism site of a genome target region, which are used for carrying out SNP analysis on sequencing of a genome specific region and have the advantages of high accuracy, high speed and low cost of SNP detection results.
Furthermore, the whole process of detecting the polymorphism sites of the genome target region can be automated, namely, the original sequencing data is used as a data source, high-quality SNP sites are automatically generated, and the SNP sites are annotated and classified.
Furthermore, the problems of incomplete bioinformatics analysis methods and tools of the genomic exon regions are solved by carrying out tests on the depth, coverage analysis, capture efficiency analysis, sex test, SNP site heterozygosity consistency and the like on the experimental samples, and the accuracy and reliability of genomic exon data analysis are greatly improved.
Furthermore, by carrying out operations such as comparison, SNP locus annotation and classification on the sequencing of a specific region of a genome, the high-accuracy SNP annotation result is efficiently and quickly obtained, the guarantee is provided for decoding the personal genome and realizing the individualized medical treatment, and the problem that the bioinformatics analysis method and tool of the exon region of the genome are incomplete is solved.
Drawings
FIG. 1 is a flow chart of a method for detecting polymorphic sites in a target region of a genome according to an embodiment of the present invention;
FIG. 2 is a flow chart showing another embodiment of the method for detecting polymorphic sites in a target region of a genome provided by the present invention;
FIG. 3 is a flow chart showing another embodiment of the method for detecting polymorphic sites in a target region of a genome provided by the present invention;
FIG. 4 is a flow chart showing one embodiment of the method for detecting polymorphic sites in a target region of a genome provided by the present invention;
fig. 5 shows a target area depth distribution histogram drawn after the target area is subjected to depth and coverage statistics by using soap in the embodiment shown in fig. 4;
fig. 6 shows a target region depth accumulation distribution graph drawn after depth and coverage statistics are performed on the target region by using soap.
FIG. 7 shows a sequencing depth saturation curve plotted after depth and coverage statistics of the target region with the embodiment shown in FIG. 4 using soap.
FIG. 8 shows a SNP site heterozygosity scattergram drawn after analyzing the optimal allole support depth and the suboptimal allole support depth of each SNP site according to the embodiment shown in FIG. 4;
FIG. 9 is a schematic structural diagram of a system for detecting polymorphic sites in a target region of a genome according to an embodiment of the present invention;
FIG. 10 is a schematic structural diagram showing another embodiment of the system for detecting polymorphic sites in a target region of a genome provided by the present invention;
FIG. 11 is a schematic structural diagram showing another embodiment of the system for detecting polymorphic sites in a target region of a genome provided by the present invention;
FIG. 12 is a schematic structural diagram showing another embodiment of the system for detecting a polymorphic site in a target region of a genome according to the present invention.
Detailed Description
The present invention will now be described and explained more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention are shown.
FIG. 1 is a flowchart illustrating a method for detecting polymorphic sites in a target region of a genome according to an embodiment of the present invention.
As shown in FIG. 1, the method 100 for detecting polymorphic sites in a target region of a genome comprises a step 102 of obtaining exon sequencing results: sequencing and purifying a human genome DNA sample to obtain an exon region sequencing result; and comparing the sequencing result of the exon region with the reference gene sequence to obtain an accurate comparison result. In the embodiment of the invention, the sequencing method can adopt a high-throughput sequencing technology, such as an Illumina GA Solexa sequencing technology; solexa is a novel Sequencing method based on Sequencing-By-Synthesis (SBS) technology, and bridge PCR reaction is realized on a small chip (Flow Cell) By using a single molecule array. The novel reversible blocking technology can realize that only one base is synthesized at a time without marking a fluorescent group, and then the corresponding laser is used for exciting the fluorescent group to capture exciting light, so that the base information is read.
In one embodiment of the invention, the sequencing results of exon regions after purification treatment can be compared to a reference genome (the reference genome can be from genome information published by a standardization organization) by using a soap tool (the software can be obtained freely, and the download website is http:// soap. genomics.org.cn /) independently developed by the applicant (Shenzhen Huada Gene science and technology Co., Ltd.) to obtain an accurate comparison result; specific methods involved therein for the soap tool can be found in the literature: SOAP: a short oligonucleotide alignment program; ruiqiang Li, Yingrui Li, Karsten Kristiansen and Jun Wang; bioinformatics; 200824(5): 713 and 714; doi: 10.1093.
step 104, redundancy removal and sorting step: and removing repeated information and sequencing the comparison result obtained after comparison. In one embodiment provided by the invention, the comparison result is sorted according to 'chromosome and coordinate' after removing the repeated information, and the sorted result is used as an object to be processed in the step of detecting the SNP locus.
Step 106, statistical analysis step I: carrying out depth and coverage statistics on a global target region (target region), and testing the sex of the sample by using the sequencing depth of the target region of the X chromosome and the Y chromosome; and judging whether the sample is polluted or not, thereby eliminating potential sample pollution.
The target area in the present invention may be a predetermined or known series of reference coordinates to identify the area of interest. In one embodiment of the invention, depth and coverage statistics can be carried out on the target area by using a tool, soap, coverage, which is self-developed by the applicant (the software is a complete statistical tool and can be obtained for free), and the download website is http:// soap. The qualitative analysis conclusion of Pure or poluted can be specifically given in the analysis report.
108, detecting SNP sites: and finding out SNP sites from the sequencing processed result. Single Nucleotide Polymorphism (SNP) refers to a variation of a single nucleotide in a genome, and a large number of genetic markers are formed, and polymorphism is abundant. Such variations in genomic sequence can affect the occurrence of genetic diseases and the response of organisms to various pathogens, chemicals, drugs, vaccines, and the like. Many phenotypic differences in humans, susceptibility to disease, etc. may be associated with SNPs. Therefore, SNPs are widely recognized as a key to the realization of personalized medicine, and have great value for the analytical detection of SNPs. In one embodiment of the present invention, the SNP sites we are interested in can be found by using the SNP detection tool, soap SNP, which is autonomously developed by the applicant (the software is freely available, and the download website is http:// soap. SNP detection for structural parallel gene resetting; ruiqiang Li, Yingrui, Xiaodong Fan, Huangming Yang, Jian Wang, Karsten Kristiansen and Junn Wang Genome Res.; 2009.19: 1124-1132.
110, SNP locus filtering: and screening the SNP sites obtained by detection by taking the mass value as an index. In one embodiment of the present invention, a threshold value of 20 (where the threshold value 20 represents an error rate of 0.01, and below this value, it is considered as "unreliable") can be predefined as the quality value, and this threshold value is used as an index for screening SNP sites; as will be apparent to those skilled in the art in light of the teachings herein, the criteria for SNP site selection based on a particular sample may vary, and one skilled in the art can select an appropriate threshold value based on the circumstances, and the aforementioned exemplary threshold values are not intended to limit the present invention.
Step 112, statistical analysis step II: and (3) counting the coverage of the filtered SNP sites, analyzing the optimal allele (allele) support depth and the suboptimal allele support depth of each SNP site, and judging whether the sample is polluted. Wherein, the optimal allele 'support depth', namely how many gene sequences are consistent with the optimal genotype at the current coordinate; if the optimal allele support depth and the suboptimal allele support depth of the SNP locus analyze that the global SNP heterozygosity rate presents a concentrated trend, for example, the scatter point presents a linear relation, and when the square of the correlation coefficient r approaches to 1, whether the slope deviates from 0.5(0.5 is a normal value) or not; and judging that the sample is polluted according to the result.
Step 114, SNP annotation step: and comparing the filtered SNP sites with the information in the dbSNP database, and annotating and classifying the matched SNP sites by combining the data in at least one of the ccds (abbreviation of Consensus CDS), refseq and ensembl databases. Among them, dbSNP Database (Single nucleotide Polymorphism Database) is the United states National Center for Biotechnology Information (NCBI) collaboratively sponsored with the National institute for Human Genome (NHGRI) and provides the public with authoritative gene profiles of genetic variation in different species at no cost. And comparing the SNP locus appearing in the current sample with the known SNP locus information in the database to determine the SNP locus of the gene mutation, thereby searching the possibly influenced gene and labeling and classifying the possibly influenced gene.
The method for detecting the polymorphic sites in the target region of the genome provided by one embodiment of the invention carries out SNP analysis on the sequencing of the specific region of the genome, and the method has the advantages of high accuracy of SNP detection result, high speed, low cost and realization of automation in the whole process, namely, the original sequencing data is used as a data source, high-quality SNP sites are automatically generated, and the SNP sites are annotated and classified.
FIG. 2 is a flow chart showing another embodiment of the method for detecting a polymorphic site in a genomic target region provided by the present invention.
As shown in FIG. 2, the method 200 for detecting polymorphic sites in a genomic target region comprises: in steps 202, 203, 204 and 214, wherein the step 204 and 214 can respectively perform the same or similar technical contents as the step 104 and 114 shown in fig. 1, and for brevity, the technical contents thereof will not be described again.
As shown in FIG. 2, in step 202, a human genomic DNA sample is sequenced, and the sequence of the linker sequence and the adapter sequence introduced by the sequencing process contained in the sequencing result are removed to realize the purification treatment of the sequencing result of the exon region.
And step 203, comparing the sequencing result of the exon region with the reference gene sequence by using a Soap tool to obtain an accurate comparison result.
FIG. 3 is a flow chart showing another embodiment of the method for detecting a polymorphic site in a genomic target region provided by the present invention.
As shown in FIG. 3, the method 300 for detecting polymorphic sites in a genomic target region comprises: steps 302, 304, 306, 310, 312, 314, wherein the steps 302, 304, 308, 310, 312, 314 can respectively perform the same or similar technical contents as the steps 102, 104, 108, 110, 112, 114 shown in fig. 1, and for brevity, the technical contents thereof are not repeated herein.
As shown in fig. 3, after step 304, step 306 is executed to perform depth and coverage statistics on the global target area by using the tool soap. For example, a target region depth distribution histogram can be drawn according to the depth and coverage statistics of the target region, and the uniformity of the coverage of the measured target region of the sample is reflected by judging the coincidence degree of the histogram and Poisson distribution (Poisson distribution); drawing a depth accumulation distribution graph of the target area, and reflecting the ratio of the base of a certain depth value to the total length; in addition, a sequencing depth saturation curve can be drawn to reflect the correlation of sequencing depth and target region coverage.
Step 307, checking the sex of the sample according to the analysis principle of an SVM (Support Vector Machine, a widely used mathematical statistics learning method) by using the sequencing depth of the target region of the X and Y chromosomes; judging whether the sample is polluted or not; if so, go to step 309; otherwise, step 310 is performed. I.e., by performing sex tests with XY chromosome depth to exclude potential sample contamination.
Step 309, if the sample is polluted in the experimental stage, giving specific pollution information; if the experiment fails, the process of detecting the polymorphic sites in the target region of the genome can be terminated.
Step 312, judging whether the sample is polluted; if so, go to step 309; otherwise, step 314 is performed.
FIG. 4 is a flow chart showing an embodiment of the method for detecting a polymorphic site in a target region of a genome according to the present invention.
In the invention, the flows of the method for detecting the polymorphic sites in the target region of the genome can be integrated into software ECP (exterior Capture processor), and the running environment of the software is a Unix/Linux operating system and runs through a Unix/Linux command line. The specific operation steps are as follows:
the following commands are input in the Linux operating system computer terminal: list-o outdir-r hg18.fa-t capture _ regions/-i hs.fa.index-p-fref.fa.stat-x-q 20-S
ECP command line parameters include:
-r reference sequence paths;
sample List Path (List Format see below)
-O output folder path
-t target area folder path
-i reference sequence soap builds a library file path
-f reference sequence stat file path
-x whether to generate SNP files
Whether p is pair-end
-S generating CNS files
-e exon region file plus path
Whether a goes adapter or not
Whether L is unlinker or not
-h help
V current version
The data to be analyzed includes:
(1) sequencing data: PE _1.fq PE _2.fq (exon region sequencing results)
(2) And a reference sequence: hg18.fa (species reference sequence)
(3) Exon coordinate information: target (absolute coordinates of exon in genome)
(4) List, sample initial information sample:
1) sample name: FC61K8AAAXX (the present sample used here is allowed by the inventor of the present invention, and it should be understood by those skilled in the art that only one sample is selected here as the detection object, the implementation of the present embodiment is not dependent on the specific sample, and the sample used here does not limit the present invention in any way);
2) lane number:
100509_I82_FC61K8AAAXX_L2_HUMlrbXAADCAAPEI-6
3) sex: male
4) Sequencing data (the sequencing data corresponding to the sample is only used for illustration and does not limit the implementation of the technical scheme of the invention):
100509_I82_FC61K8AAAXX_L2_HUMlrbXAADCAAPEI-6_1.fq
100509_I82_FC61K8AAAXX_L2_HUMlrbXAADCAAPEI-6_2.fq
5) insert size: 100-200bp
Table one shows the results of the detection performed on the sample (FC61K8AAAXX), the analysis results concerning data yield & capture efficiency, and the like.
As shown in FIG. 4, in this embodiment, the genome sequence of a male (sample name: FC61K8AAAXX) is selected and sequenced to obtain the sequencing result of the exon region (reads file: (Fq)), and subjected to purification treatment for removing the linker and the adapter to obtain high-throughput sequencing results (solexa reads); then using the Soap tool to compare the processed high throughput sequencing result with a reference genome sequence (Fa) comparing, and performing redundancy removal and sequencing processing on repeated information in the result to obtain reads with uniqueness; then, carrying out statistical analysis and quality control detection, and specifically, carrying out depth and coverage statistics on target areas by using soap. Coverage is adopted by the specific embodiment shown in fig. 4 to carry out depth and coverage statistics on the target area, and then a target area depth distribution histogram is drawn. As shown in fig. 5, the uniformity of coverage of the target area of the sample is reflected by determining the degree of coincidence between the histogram and the Poisson distribution (Poisson distribution); in particular, it relates to whether a target area of the sample is detected and whether the distribution of the detected area is uniform. And fig. 6 shows a target region depth accumulation distribution diagram drawn after depth and coverage statistics are carried out on the target region by using soap. As shown in fig. 6, a target region depth accumulation distribution map is drawn, reflecting the ratio of the base of a certain depth value to the total length; in particular, it is mainly concerned with how many layers are above at least what percentage of the base depth. Coverage of the target region with soap is shown in fig. 7, which is a sequencing depth saturation curve plot plotted after depth and coverage statistics of the target region with the embodiment shown in fig. 4. As shown in fig. 7, a sequencing depth saturation curve is used to reflect the correlation between the sequencing depth and the coverage of the target region, for example, the depth of each layer can substantially cover all the regions, so as to avoid the coverage reduction caused by insufficient depth and avoid the data redundancy caused by too large depth.
And aiming at the results after the sequencing treatment, finding out the SNP sites which are concerned by people by using a SNP detection tool, namely, sopSNP, as shown in a table two.
Selection of detection results of SNP sites of Table II
According to the detected SNP sites, screening and filtering are carried out by taking the quality value as an index, the coverage of the SNP sites of the exon regions is counted, and the optimal allele support depth and the suboptimal allele support depth of each SNP site are analyzed. FIG. 8 shows a SNP site heterozygosity scattergram drawn after analyzing the optimal allole support depth and the suboptimal allole support depth of each SNP site according to the embodiment shown in FIG. 4. As shown in FIG. 8, whether the sample is contaminated is judged by showing whether the heterozygosity rate of the global SNP has a certain concentration trend, for example, if the heterozygosity site depth scattergram has a high concentration trend, i.e., the correlation coefficient approaches 1, and the slope deviates from 0.5, the possibility of contamination is indicated. Finally, the SNP site results obtained after screening and filtering can be compared with the information in the dbSNP database, and the data in at least one of the databases such as ccds, refseq, ensembl and the like are combined to perform annotation (shown in table three) and classification.
Selection of annotation results for table three SNP sites
The method for detecting the polymorphic sites in the target region of the genome provided by the specific embodiment of the invention is integrated into software ECP, the whole detection process can be realized in an automatic mode, and the I/O resources and the memory resources of a computer are well controlled. The pipeline technology replaces the traditional mode of taking files as information exchange, and the solution of taking binary memory compression and temporary storage of binary files as large memory data can theoretically adapt the system to any hardware environment capable of running SOAP.
FIG. 9 is a schematic structural diagram of a system for detecting polymorphic sites in a target region of a genome according to an embodiment of the present invention.
As shown in FIG. 9, a system 900 for detecting polymorphic sites in a genomic target region comprises: an exon sequencing result acquisition module 902, a redundancy removal and ordering module 904, a statistical analysis module 906, an SNP site detection module 908, an SNP site filtering module 910, and an SNP annotation module 912.
The exon sequencing result obtaining module 902 is used for sequencing and purifying a human genome DNA sample to obtain an exon region sequencing result; and comparing the sequencing result of the exon region with the reference gene sequence to obtain an accurate comparison result. In the embodiment of the invention, the sequencing method can adopt a high-throughput sequencing technology, such as Illumina GA Solexa sequencing technology; in an embodiment of the invention, the sequencing results of the exon regions after purification treatment can be compared to a reference genome (the reference genome can be from genome information published by a standardization organization) by using a sop tool independently developed by the applicant (Shenzhen Hua Dagen science and technology Co., Ltd.) to obtain an accurate comparison result; specific methods involved therein for the soap tool can be found in the literature: SOAP: a short oligonucleotide alignment program; ruiqiang Li, Yingrui Li, Karsten Kristiansen and Jun Wang; bioinformatics; 200824(5): 713 and 714; doi: 10.1093.
and a redundancy removing and sorting module 904, configured to perform duplicate information removing and sorting processing on the comparison result obtained after the comparison. In one embodiment provided by the invention, the comparison result is sorted according to 'chromosome and coordinate' after removing the repeated information, and the sorted result is used as an object to be processed in the step of detecting the SNP locus.
A statistical analysis module 906 for performing depth and coverage statistics on the global target region and testing the sex of the sample by using the sequencing depth of the target region of the X, Y chromosome; judging whether the sample is polluted or not; and (4) counting the coverage of the filtered SNP sites, analyzing the optimal allele support depth and the suboptimal allele support depth of each SNP site, and judging whether the sample is polluted. The target area in the present invention may be a predetermined or known series of reference coordinates to identify the area of interest. Coverage can be used for carrying out depth and coverage statistics on the target area by using a tool soap which is independently developed by the applicant. Wherein, the optimal allele 'support depth', namely how many gene sequences have the same genotype with the optimal genotype at the current coordinate; and if the optimal allele support depth and the suboptimal allele support depth of the SNP locus show that the global SNP heterozygosity rate presents a centralized trend, judging that the sample is polluted.
And an SNP site detection module 908 for finding SNP sites from the sequencing result. In one embodiment of the present invention, the SNP detection tool, SoapSNP, autonomously developed by the present applicant can be used to find the SNP site of interest, wherein the principles of the SoapSNP tool can be found in the literature: SNP detection for a structured parallel particle response; ruiqiang Li, Yingrui Li, Xiaodong Fan, Huangming Yang, Jian Wang, Karsten Kristiansen and Junn WangGenome Res.; 2009.19: 1124-1132.
And the SNP locus filtering module 910 is configured to screen the detected SNP loci by using the quality values as indexes. In one embodiment of the present invention, a threshold value of the mass value may be predetermined to be 20, and the threshold value is used as an index for screening the SNP site; as will be apparent to those skilled in the art in light of the teachings herein, the criteria for SNP site selection based on a particular sample may vary, and one skilled in the art can select an appropriate threshold value based on the circumstances, and the aforementioned exemplary threshold values are not intended to limit the present invention.
And the SNP annotation module 912 is used for comparing the filtered SNP sites with the information in the dbSNP database, and annotating and classifying the matched SNP sites by combining the data in at least one of the ccds, refseq and ensembl databases. And comparing the SNP locus appearing in the current sample with the known SNP locus information in the database to determine the SNP locus of the gene mutation, thereby searching the possibly influenced gene and labeling and classifying the possibly influenced gene.
The system for detecting the polymorphic sites in the target region of the genome provided by one embodiment of the invention carries out SNP analysis on the sequencing of the specific region of the genome, and the system has the advantages of high accuracy, high speed and low cost of SNP detection result, and can realize automation in the whole process, namely, the original sequencing data is used as a data source, high-quality SNP sites are automatically generated, and the SNP sites are annotated and classified.
FIG. 10 is a schematic structural diagram showing another embodiment of the system for detecting a polymorphic site in a target region of a genome according to the present invention.
As shown in FIG. 10, a system 1000 for detecting polymorphic sites in a genomic target region comprises: the system comprises an exon sequencing result acquisition module 1002, a redundancy removal and ordering module 1004, a statistical analysis module 1006, an SNP site detection module 1008, an SNP site filtering module 1010 and an SNP annotation module 1012, wherein the redundancy removal and ordering module 1004, the statistical analysis module 1006, the SNP site detection module 1008, the SNP site filtering module 1010 and the SNP annotation module 1012 can be the same or similar functional modules as the redundancy removal and ordering module 904, the statistical analysis module 906, the SNP site detection module 908, the SNP site filtering module 910 and the SNP annotation module 912 shown in fig. 9. For the sake of brevity, no further description is provided herein.
As shown in fig. 10, the exon sequencing result acquisition module 1002 further comprises: a purification processing submodule 10021 and a comparison submodule 10022; wherein
The purification processing submodule 10021 is used for purifying the linker sequence and the adapter sequence contained in the sequencing result and introduced by the sequencing process.
And an alignment submodule 10022 for comparing the sequencing result of the exon region with the reference gene sequence by using a Soap tool to obtain an accurate alignment result.
FIG. 11 is a schematic structural diagram showing another embodiment of the system for detecting a polymorphic site in a target region of a genome according to the present invention.
As shown in FIG. 11, a system 1100 for detecting polymorphic sites in a genomic target region comprises: an exon sequencing result acquisition module 1102, a redundancy removal and ordering module 1104, a statistical analysis module 1106, an SNP site detection module 1108, an SNP site filtering module 1110 and an SNP annotation module 1112, wherein the exon sequencing result acquisition module 1102, the statistical analysis module 1106, the SNP site detection module 1108, the SNP site filtering module 1110 and the SNP annotation module 1112 may be the same or similar functional modules as the exon sequencing result acquisition module 902, the statistical analysis module 906, the SNP site detection module 908, the SNP site filtering module 910 and the SNP annotation module 912 shown in fig. 9. For the sake of brevity, no further description is provided herein.
As shown in fig. 11, the redundancy elimination and sorting module 1104 further includes: a de-redundancy sub-module 11041 and a sorting sub-module 11042, in which
And a redundancy removing submodule 11041, configured to perform repeated information removing processing on the comparison result obtained after the comparison.
And the sequencing submodule 11042 is configured to sequence the comparison result from which the repeated information is removed according to the chromosome and the coordinates, and the result after the sequencing processing is used as an object to be processed by the SNP site detection module.
FIG. 12 is a schematic structural diagram showing another embodiment of the system for detecting a polymorphic site in a target region of a genome according to the present invention.
As shown in FIG. 12, a system 1200 for detecting polymorphic sites in a genomic target region comprises: an exon sequencing result acquisition module 1202, a redundancy removal and ordering module 1204, a statistical analysis module 1206, an SNP site detection module 1208, an SNP site filtering module 1010, and an SNP annotation module 1012, wherein the exon sequencing result acquisition module 1202, the redundancy removal and ordering module 1204, the SNP site detection module 1208, the SNP site filtering module 1010, and the SNP annotation module 1012 may be the same or similar functional modules as the exon sequencing result acquisition module 902, the redundancy removal and ordering module 904, the SNP site detection module 908, the SNP site filtering module 910, and the SNP annotation module 912 shown in fig. 9. For the sake of brevity, no further description is provided herein.
As shown in fig. 12, the statistical analysis module 1206 further comprises: a first statistical analysis submodule 12061 and a second statistical analysis submodule 12062, wherein
A first statistical analysis submodule 12061, configured to perform depth and coverage statistics on the global target region, and check the sex of the sample by using the sequencing depth of the target region of the X, Y chromosomes; and judging whether the sample is polluted or not. In one embodiment provided by the invention, the first statistical analysis submodule carries out depth and coverage statistics on the global target area by adopting a tool soap.coverage and draws a specific distribution diagram to reflect the covered uniformity of the sample target area and the proportion of the basic groups larger than a preset value; and testing the sex of the sample according to the analysis principle of the support vector machine by using the sequencing depth of the target region of the X chromosome and the Y chromosome; judging whether the sample is polluted or not; if the sample is contaminated during the experimental phase, specific contamination information is given.
And the second statistical analysis submodule 12062 is configured to perform statistics on the coverage of the filtered SNP sites, analyze the optimal allele support depth and the suboptimal allele support depth of each SNP site, and determine whether a sample is contaminated. In one embodiment provided by the invention, the second statistical analysis submodule performs statistics on the coverage of the filtered SNP sites and analyzes the optimal allele support depth and the suboptimal allele support depth of each SNP site; and if the optimal allele support depth and the suboptimal allele support depth of the SNP locus show that the global SNP heterozygosity rate presents a centralized trend, judging that the sample is polluted.
The system for detecting the polymorphic locus of the genome target region carries out detailed statistical analysis and quality control on an experimental sample, and relates to tests of depth, coverage analysis, capture efficiency analysis, sex test, SNP locus heterozygosity consistency and the like. The accuracy and reliability of genome exon data analysis are greatly improved through the analysis process, and meanwhile, corresponding error information can be properly corrected.
The aforementioned advantages of the method and system for detecting polymorphic sites in a genomic target region provided by the present invention will be apparent to those skilled in the art from the foregoing description of exemplary embodiments of the invention; the method comprises the following specific steps:
1. the method and the system for detecting the polymorphic sites in the target region of the genome, provided by one embodiment of the invention, carry out SNP analysis on the sequencing of the specific region of the genome, have the advantages of high accuracy of SNP detection result, high speed and low cost, and can realize automation in the whole process, namely, the original sequencing data is taken as a data source, high-quality SNP sites are automatically generated, and the SNP sites are annotated and classified.
2. The method and the system for detecting the polymorphic sites in the target region of the genome provided by one embodiment of the invention are integrated into software ECP, the whole detection process can be realized in an automatic mode, and the I/O resources and the memory resources of a computer are well controlled. The pipeline technology replaces the traditional mode of taking files as information exchange, and the solution of taking binary memory compression and temporary storage of binary files as large memory data can theoretically adapt the system to any hardware environment capable of running SOAP.
3. The method and the system for detecting the polymorphic sites in the target region of the genome, provided by one embodiment of the invention, carry out detailed statistical analysis on an experimental sample, and relate to tests such as depth, coverage analysis, capture efficiency analysis, sex test, SNP site heterozygosity consistency and the like. The accuracy and reliability of genome exon data analysis are greatly improved through the analysis process, and meanwhile, corresponding error information can be properly corrected.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The functional modules and the division of the functional modules described in the present invention are only for illustrating the idea of the present invention, and those skilled in the art can freely change the division of the functional modules and the module structure thereof to realize the same function according to the teaching of the present invention and the requirement of practical application; the embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (5)

1. A method for detecting polymorphic sites in a genomic target region, said method comprising:
obtaining exon sequencing results: sequencing and purifying a human genome DNA sample to obtain an exon region sequencing result; comparing the sequencing result of the exon region with a reference gene sequence to obtain an accurate comparison result;
redundancy removal and sorting: removing repeated information and sequencing the comparison result obtained after comparison;
statistical analysis step I: carrying out depth and coverage statistics on the global target region, and testing the sex of the sample by using the sequencing depth of the target region of the X chromosome and the Y chromosome; judging whether the sample is polluted or not;
detecting SNP sites: finding SNP sites from the sequencing result;
SNP site filtration step: screening the SNP loci obtained by detection by taking the mass value as an index;
statistical analysis step II: counting the coverage of the filtered SNP sites, analyzing the optimal allele support depth and the suboptimal allele support depth of each SNP site, and judging whether the sample is polluted;
SNP annotation step: and comparing the filtered SNP sites with the information in the dbSNP database, and annotating and classifying the matched SNP sites by combining the data in at least one of the ccds, refseq and ensembl databases.
2. The method according to claim 1, wherein in the step of obtaining exon sequencing results, the purification treatment is performed by removing linker sequences and adapter sequences introduced by a sequencing process, which are contained in the sequencing results; and
and comparing the sequencing result of the exon region with a reference gene sequence by using a Soap tool to obtain an accurate comparison result.
3. The method of claim 1, wherein in the redundancy removing and sorting step, the comparison result is sorted according to chromosome and coordinate after removing the repeated information, and the sorted result is used as the object to be processed in the step of detecting SNP sites.
4. The method according to claim 1, wherein in the statistical analysis step I, depth and coverage statistics are performed on the global target region by using a tool soap.coverage, and a specific distribution map is drawn to reflect the uniformity of coverage of the target region of the sample and the proportion of bases larger than a predetermined value;
and testing the sex of the sample according to the analysis principle of the support vector machine by using the sequencing depth of the target region of the X chromosome and the Y chromosome; judging whether the sample is polluted or not;
if the sample is contaminated during the experimental phase, specific contamination information is given.
5. The method of claim 1, wherein in the step of statistical analysis II, the sample is judged to be contaminated if the optimal allele support depth and suboptimal allele support depth analyses of SNP sites show a global trend in the concentration of SNP heterozygosity.
HK11101668.6A 2011-02-21 The method of detecting polymorphic sites in genomic target region HK1147528B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102704646A CN101914628B (en) 2010-09-02 2010-09-02 Method and system for detecting polymorphism locus of genome target region

Publications (2)

Publication Number Publication Date
HK1147528A1 true HK1147528A1 (en) 2011-08-12
HK1147528B HK1147528B (en) 2013-08-02

Family

ID=

Also Published As

Publication number Publication date
CN101914628B (en) 2013-01-09
CN101914628A (en) 2010-12-15
WO2012027958A1 (en) 2012-03-08

Similar Documents

Publication Publication Date Title
CN101914628B (en) Method and system for detecting polymorphism locus of genome target region
US20240004885A1 (en) Systems and methods for annotating biomolecule data
Agustinho et al. Unveiling microbial diversity: harnessing long-read sequencing technology
Venturini et al. Leveraging multiple transcriptome assembly methods for improved gene structure annotation
Almeida et al. Bioinformatics tools to assess metagenomic data for applied microbiology
Van Dijk et al. The third revolution in sequencing technology
CN107849612B (en) Alignment and variant sequencing analysis pipeline
Quinn et al. Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 Genomes data
Negi et al. Applications and challenges of microarray and RNA-sequencing
JP7497879B2 (en) Methods and Reagents for Analysing Nucleic Acid Mixtures and Mixed Cell Populations and Related Uses - Patent application
CN112466395A (en) SNP (Single nucleotide polymorphism) polymorphic site based sample identification label screening method and sample identification detection method
US20030134320A1 (en) Method system and computer program product for quality assurance in detecting biochemical markers
CN109524060B (en) Genetic disease risk prompting gene sequencing data processing system and processing method
WO2020046784A1 (en) Methods for detecting mutation load from a tumor sample
CN118782149B (en) A Hi-C-based microbial metagenomic sequencing analysis method and system
Huszar et al. Mitigating the effects of reference sequence bias in single-multiplex massively parallel sequencing of the mitochondrial DNA control region
CN110942806A (en) Blood type genotyping method and device and storage medium
HK1147528B (en) The method of detecting polymorphic sites in genomic target region
KR20250028287A (en) Generation and implementation of structural variation graph genomes
Hagar et al. Next-generation sequencing with emphasis on Illumina and Ion torrent platforms.
CN117561573A (en) Automatic identification of sources of failures in nucleotide sequencing from patterns of base calling errors
Pal RNA sequencing (RNA-seq)
US20240321395A1 (en) Mitochondrial probes for endogenous control and contamination detection
Clarke Bioinformatics challenges of high-throughput SNP discovery and utilization in non-model organisms
HK40116175A (en) Alignment and variant sequencing analysis pipeline

Legal Events

Date Code Title Description
PC Patent ceased (i.e. patent has lapsed due to the failure to pay the renewal fee)

Effective date: 20160902

ARF Application filed for restoration

Effective date: 20170411

ARG Restoration of standard patent granted

Effective date: 20171009