WO2018214010A1 - Method, device, and storage medium for detecting mutation on the basis of sequencing data - Google Patents
Method, device, and storage medium for detecting mutation on the basis of sequencing data Download PDFInfo
- Publication number
- WO2018214010A1 WO2018214010A1 PCT/CN2017/085448 CN2017085448W WO2018214010A1 WO 2018214010 A1 WO2018214010 A1 WO 2018214010A1 CN 2017085448 W CN2017085448 W CN 2017085448W WO 2018214010 A1 WO2018214010 A1 WO 2018214010A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- depth
- variation
- variant
- population
- variation information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Definitions
- the present invention relates to the field of bioinformatics technology, and in particular, to a mutation detection method, apparatus and storage medium based on sequencing data.
- Genotype imputation based on gene-linked inference is now mainly used, and high-depth variation sites are used as marker sites, and existing gene collections are referenced (a certain population is collected).
- Gene information inferring unknown mutation information based on the linkage relationship between known marker sites and adjacent unknown sites.
- This method uses the genetic linkage relationship to infer, and the inferred method fills the obtained unknown sites. In fact, it also covers in the individual sequencing process, but it is neglected due to the low depth, resulting in waste of these data.
- the genotyping is done by inference method, and the real base situation obtained by sequencing is not considered.
- the genetic information in the simple reference gene set cannot obtain the unique variation of the individual itself, and the accuracy and low depth data are sacrificed.
- genotypic inference filling techniques need to use known high-depth variation site information, there is a requirement for variation information density. If the number of known mutation sites is too small and the distribution density is too low, it will be difficult to carry out the gene. Type inferred to fill. In addition, the genotype filling technique takes a long time and has a poor timeliness.
- the invention provides a mutation detection method, device and storage medium based on sequencing data, which can directly perform accurate mutation detection by using low-depth sequencing data of a group, and improve data utilization.
- an embodiment provides a variation detection method based on sequencing data, including:
- Sequencing data of a plurality of individuals derived from the same population are compared to a reference genome and subjected to mutation detection to obtain a read length alignment position and variation information;
- the total depth of the variation sites and the depth of the variant bases of each individual in the population are respectively added to each variation site, and then the sub-alleles of each variation site in the population are calculated.
- Frequency and / or Hardy Weinberg balance get the cumulative results of group variation information;
- the variation information of each individual is filtered according to the accumulated result of the filtered population variation information, and the final variation information of each individual is obtained.
- an embodiment provides a variation detection apparatus based on sequencing data, including:
- the comparison and mutation detecting device is configured to compare the sequencing data of the plurality of individuals from the same group to the reference genome and perform mutation detection to obtain the read length alignment position and the variation information;
- a population variation information filtering device for plotting the total depth of the variant site and/or the depth of the variant base type and the dbSNP%, and based on the set filtering threshold, the total depth of the variant site and/or the depth of the variant base And the above allele frequencies and/or Hardy Weinberg equilibrium are filtered to filter out below Filtering the value of the threshold to obtain the accumulated result of the filtered population variation information;
- the individual variation information filtering device is configured to filter the variation information of each individual according to the accumulated result of the filtered population variation information to obtain final variation information of each individual.
- an embodiment provides a variation detection apparatus based on sequencing data, including:
- Sequencing data of a plurality of individuals derived from the same population are compared to a reference genome and subjected to mutation detection to obtain a read length alignment position and variation information;
- the total depth of the variation sites and the depth of the variant bases of each individual in the population are respectively added to each variation site, and then the sub-alleles of each variation site in the population are calculated.
- Frequency and / or Hardy Weinberg balance get the cumulative results of group variation information;
- the variation information of each individual is filtered according to the accumulated result of the filtered population variation information, and the final variation information of each individual is obtained.
- an embodiment provides a computer readable storage medium comprising a program executable by a processor to implement the following method:
- Sequencing data of a plurality of individuals derived from the same population are compared to a reference genome and subjected to mutation detection to obtain a read length alignment position and variation information;
- the total depth of the variation sites and the depth of the variant bases of each individual in the population are respectively added for each variation site, and then each variation site is calculated in the population.
- the variation information of each individual is filtered according to the accumulated result of the filtered population variation information, and the final variation information of each individual is obtained.
- the invention can utilize the low-depth sequencing data of the population, and directly performs accurate mutation detection without using chain inference to improve data utilization, and the variation information obtained in the low-depth region can be used as a marker site to help improve the existing linkage inference. result.
- FIG. 1 is a flow chart of a method for detecting variation based on sequencing data according to an embodiment of the present invention
- FIG. 2 is a structural block diagram of a variation detecting apparatus based on sequencing data according to an embodiment of the present invention
- each curve represents a different chromosome
- the bold curve represents the average value
- the gray portion represents the standard deviation
- the traditional method is directed to a single sample.
- the method of the present invention combines the low-depth variations of the population together as a variation information of a single sample, thereby realizing the conversion of low-depth groups into fictitious highs. Depth individuals, and then carry out mutation detection, obtain accurate mutation detection results and then split to obtain the variation information of individual individuals.
- a method for detecting a variation based on sequencing data includes the following steps:
- Step S101 Aligning the sequencing data of the plurality of individuals from the same population to the reference genome and performing mutation detection to obtain the read position and the variation information of the Reads.
- the sequencing data belongs to the same group, for example, the same species (human, pig, etc.). These sequencing data may be offline data of the same batch, or may be offline data belonging to the same group although belonging to different batches.
- the format of the sequencing data is, for example, the Fastq format.
- the reference genome can be published genomic data for each population (species), for example, for a human, the reference genome can be the human reference genome hg19.
- the comparison software can be a commonly used BWA software, etc.
- the mutation detection software can be a commonly used GATK software.
- the obtained variation information includes elements such as the chromosome, position, and variant base type in which the mutation is located.
- the embodiments of the present invention are not strictly required, but the more the sample size, the more favorable to find more mutation sites.
- the average sequencing depth of the sequencing data is not particularly limited, and the method of the embodiment of the present invention is particularly suitable for low-depth sequencing data, such as sequencing data having an average depth of 1 ⁇ to 15 ⁇ . In one embodiment of the invention, the average sequencing depth of the sequencing data is 3.5x.
- Step S102 According to the obtained alignment position and variation information, sum the total depth of the variation sites and the depth of the variant bases of each individual in the population for each variation site, and then calculate the variation sites in the population.
- the Minor allele frequency (MAF) and/or the Hardy-Weinberg equilibrium (HWE) the cumulative result of the population variation information is obtained, and the so-called "group variation information accumulation result" includes each variation site.
- total depth of variation site refers to the number of sequencing reads (Reads) of all individuals covering the site of variation;
- mutation base type depth refers to the The number of sequencing reads of all individuals of a particular base type at the variant site. It can be seen that for a particular variant site, the “total depth of the variant site” is the sum of the “mutation base depths" of the various base types. For example, if a variant site has two base types, A and T, and 100 of the sequencing reads in all individuals, 100 are A at the mutation site, and 100 are T at the mutation site. Then, the "mutation base type depth" of the two base types A and T is 100, and the "variation site total depth” of the mutation site is 200.
- the term “secondary allele frequency” refers to the frequency of unusual alleles in a given population; the term “Hardy Weinberg equilibrium” refers to the frequency of each allele The genotype frequencies of the alleles are stable in the inheritance, that is, the gene balance is maintained.
- the method of the embodiments of the present invention is particularly suitable for single nucleotide polymorphism (SNP) variation, insertion/deletion (Ins/Del) mutation detection, and thus in a preferred embodiment of the invention, all individuals in the population When the total depth of the variant site and the depth of the variant base are respectively added, the three bases and other multibase mutations are removed and only the single nucleotide polymorphism variation is retained.
- SNP single nucleotide polymorphism
- Ins/Del insertion/deletion
- Step S103 plot the total depth of the variant site and/or the depth of the variant base type and the dbSNP%, and according to the set filtering threshold, the total depth of the variant site and/or the depth of the variant base type and the frequency of the minor allele. And/or Hardy Weinberg balances the filtering to filter out the value below the filtering threshold, and obtains the accumulated result of the filtered population variation information.
- dbSNP% refers to the proportion of found variant sites in the oligonucleotide polymorphism database and is commonly used to measure the accuracy of detected SNPs.
- the filtering threshold includes a filtering threshold of the total depth of the mutation site and/or a filtering threshold of the variant base type depth, and a filtering threshold of the MAF and/or a filtering threshold of the HWE. That is, the key indicators are the mutated base type depth (or total variation site depth), the MAF value, and/or the HWE value.
- the depth of the variant base type of a base type is directly related to the credibility of the mutation. HWE and MAF are often used to remove low-reliability variant sites.
- the filter threshold for the depth of the variant base type is 30x and the filter threshold for the MAF is 0.05. That is to say, the depth of the variant base type is greater than or equal to 30 ⁇ , and the MAF is greater than 0.05, which is considered to be statistically significant.
- the depth of filtration (variation base type depth or total variation site depth) may be appropriately adjusted according to the actual depth density distribution.
- the "actual depth density distribution” may refer to the distribution of the sequencing depth of different mutation sites.
- the sample size may be set according to the average depth of the data, for example, the average depth is 1 ⁇ , and the sample size should be 30, so that the variation site type depth can reach a set value of 30 ⁇ or more. Filter conditions. The more sample size, the more favorable it is to detect more variant sites.
- Step S104 Filter the variation information of each individual according to the accumulated result of the filtered population variation information, and obtain the final variation information of each individual.
- filtering the variation information of each individual specifically includes: if the existing chromosomes, locations, and variant base types in the original variation information of the individual are consistent with the cumulative result of the filtered population variation information. As a result, the result is retained, otherwise the result is filtered out. In addition, if there are more than two variations in one variant site, the variant site is filtered out.
- the final variation information of each individual can be obtained.
- the accuracy of the variation can then be assessed by dbSNP% and subsequent inferential filling and research analysis.
- the method of the present invention improves data utilization in low-depth regions of high-throughput sequencing data, which would otherwise Neglected data is converted into usable data, and the accuracy of detecting SNPs is significantly improved, and the method is simple and easy to use, which can improve the accuracy and increase the data utilization rate, and can also be added to the previous step of traditional variation filling. Improve traditional methods.
- the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc.
- the computer executes the program to implement the above functions.
- the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the above functions can be realized.
- the program may also be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk or a mobile hard disk, and may be saved by downloading or copying.
- the system is updated in the memory of the local device, or the system of the local device is updated.
- an embodiment of the present invention further provides a mutation detecting device based on sequencing data, as shown in FIG. 2, including:
- the comparison and variation detecting device 201 is configured to compare the sequencing data of the plurality of individuals from the same group to the reference genome and perform mutation detection to obtain the read length matching position and the variation information; the sum and calculation device 202, According to the alignment position and variation information, the total depth of the variation sites and the depth of the variant bases of each individual in the population are respectively added for each variation site, and then the sub-equations of each variation site in the population are calculated.
- the gene frequency and/or Hardy Weinberg equilibrium the cumulative result of the population variation information is obtained; the population variation information filtering device 203 is used to map the total depth of the variant site and/or the relationship between the depth of the variant base and the dbSNP%, and The set filter threshold is filtered to filter the total depth of the variant site and/or the depth of the variant base type and the minor allele frequency and/or Hardy Weinberg equilibrium to filter out values below the filtration threshold.
- the group variation information accumulation result; the individual variation information filtering device 204 is configured to filter the variation information of each individual according to the accumulated result of the filtered group variation information, Each The final variation of the individual.
- Another embodiment of the present invention further provides a variation detecting apparatus based on sequencing data, including:
- the sequencing data of multiple individuals from the same population are compared to the reference genome and the mutation is detected to obtain the read position and the variation information of the read length; according to the alignment position and the variation information, the individual mutation sites are The total depth of the variant sites and the depth of the variant bases are summed separately, and then the sub-allelic frequencies and/or Hardy Weinberg equilibrium of each variant site in the population are calculated, and the cumulative result of the population variation information is obtained; Mapping the total depth of the variant site and/or the depth of the variant base type to the dbSNP% and based on the set filter threshold for the total depth of the variant site and/or the depth of the variant base and the frequency of the minor allele and/or Hardy Weinberg balances the filtering to filter out the value below the filtering threshold, and obtains the accumulated result of the filtered population variation information; according to the accumulated result of the filtered population variation information, the variation information of each individual is filtered to obtain each The final variation of the individual.
- the sequencing data of multiple individuals from the same population are compared to the reference genome and the mutation is detected to obtain the read position and the variation information of the read length; according to the alignment position and the variation information, the individual mutation sites are The total depth of the variant sites and the depth of the variant bases are summed separately, and then the sub-allelic frequencies and/or Hardy Weinberg equilibrium of each variant site in the population are calculated, and the cumulative result of the population variation information is obtained; Mapping the total depth of the variant site and/or the depth of the variant base type to the dbSNP% and based on the set filter threshold for the total depth of the variant site and/or the depth of the variant base and the frequency of the minor allele and/or Hardy Weinberg balances the filtering to filter out the value below the filtering threshold, and obtains the accumulated result of the filtered population variation information; according to the accumulated result of the filtered population variation information, the variation information of each individual is filtered to obtain each The final variation of the individual.
- the sample is sequencing data captured in 105 human MHC regions (4.9M region on chromosome 6), and the data concentrated in the MHC region accounts for about 50% of the data. That is to say, about 50% of the data is accompanied by other locations in the genome. Relatively speaking, this part of the data is low-depth data.
- the method of the present invention to detect the variation of the 50% low-depth region, the accuracy of the results is evaluated, and the 105 samples are also subjected to exon sequencing.
- the exon region is High depth. By comparing the consistency of low depth data and exon sequencing in the exon region, the consistency of the variation can be determined to assess the accuracy of the method of the invention.
- this embodiment studies with a low depth region, and the average depth of the low depth region in this embodiment is 3.5 ⁇ .
- the total depth of the mutation sites of 105 MHC samples and the depth of the variant base were respectively added to remove the three bases and other multibase mutations and calculate each The MAF value of the mutation site in the population obtained the cumulative result of the population variation information of 105 samples.
- Fig. 3 is a graph showing the relationship between the depth of the mutant base type and the dbSNP% in the present embodiment.
- the variation result of each individual is filtered. If there is a result that the chromosome, position, and variant base type are consistent with the filtered population variation information accumulation result in the original variation information of the individual, the result is retained, otherwise the result is filtered out. In addition, if there are more than two variations in one variant site, the variant site is filtered out.
- dbSNP% evaluates the accuracy of the variation, and found that dbSNP% can reach 90.3% (if the Y chromosome is removed and the coverage of chromosome 9 is dbSNP% is 93%). In addition, compared with the corresponding high-depth exon sequencing data, the consistency can reach 96.45% after comparison.
- the results are shown in Table 2. The number in the table indicates the number of 105 samples, and the table indicates “unprocessed”. Consistent data for comparison of the results obtained using the method of the invention with corresponding high depth exon sequencing data, the expression "this method" in the table indicates the results obtained using the method of the invention and the corresponding high depth exon sequencing data Contrast consistent data.
Landscapes
- Chemical & Material Sciences (AREA)
- Organic Chemistry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Microbiology (AREA)
- Biochemistry (AREA)
- Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Immunology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
本发明涉及生物信息学技术领域,具体涉及一种基于测序数据的变异检测方法、装置和存储介质。The present invention relates to the field of bioinformatics technology, and in particular, to a mutation detection method, apparatus and storage medium based on sequencing data.
尽管测序成本仍在下降,但是目前测序策略仍没有质的突破,高深度的测序数据成本仍然居高不下。因此,基于低深度的变异检测方法,将有效提高现有低深度全基因组测序数据的利用率。目前对低深度的变异检测,仍是利用常规软件进行,得到的变异信息往往由于深度不足而在质控中被过滤掉。Although the cost of sequencing is still declining, there is still no qualitative breakthrough in the sequencing strategy, and the cost of high-depth sequencing data remains high. Therefore, based on the low depth variation detection method, the utilization rate of the existing low-depth whole genome sequencing data will be effectively improved. At present, the detection of low-depth mutations is still carried out using conventional software, and the obtained variation information is often filtered out in the quality control due to insufficient depth.
为了得到这些低深度区域的变异信息,现在主要采用基于基因连锁推断的基因型填补(Genotype imputation),以高深度的变异位点为标记位点,参考已有的基因集合(收集了一定群体的基因信息),根据已知的标记位点与临近未知位点的连锁关系来推断未知的变异信息。这种方法利用基因连锁关系进行推断,通过推断方法填补得到的未知位点,事实上在个体测序过程中也有覆盖,但是由于深度较低,一直被忽视,造成这些数据的浪费。通过推断方法进行基因型填补,没有考虑测序得到的真实碱基情况,单纯参考基因集合中的基因信息无法获得个体本身特有的变异,牺牲了准确度和低深度数据。不仅如此,由于基因型推断填补技术需要利用已知的高深度变异位点信息,所以对变异信息密度有所要求,如果已知的变异位点数目过少、分布密度过低,将难以开展基因型的推断填补。此外基因型填补技术耗时较长,时效性较差。In order to obtain the variation information of these low-depth regions, Genotype imputation based on gene-linked inference is now mainly used, and high-depth variation sites are used as marker sites, and existing gene collections are referenced (a certain population is collected). Gene information), inferring unknown mutation information based on the linkage relationship between known marker sites and adjacent unknown sites. This method uses the genetic linkage relationship to infer, and the inferred method fills the obtained unknown sites. In fact, it also covers in the individual sequencing process, but it is neglected due to the low depth, resulting in waste of these data. The genotyping is done by inference method, and the real base situation obtained by sequencing is not considered. The genetic information in the simple reference gene set cannot obtain the unique variation of the individual itself, and the accuracy and low depth data are sacrificed. Moreover, since genotypic inference filling techniques need to use known high-depth variation site information, there is a requirement for variation information density. If the number of known mutation sites is too small and the distribution density is too low, it will be difficult to carry out the gene. Type inferred to fill. In addition, the genotype filling technique takes a long time and has a poor timeliness.
发明内容 Summary of the invention
本发明提供一种基于测序数据的变异检测方法、装置和存储介质,能够利用群体的低深度测序数据直接进行精准的变异检测,提高数据利用率。The invention provides a mutation detection method, device and storage medium based on sequencing data, which can directly perform accurate mutation detection by using low-depth sequencing data of a group, and improve data utilization.
根据第一方面,一种实施例中提供一种基于测序数据的变异检测方法,包括:According to a first aspect, an embodiment provides a variation detection method based on sequencing data, including:
将来源于同一群体的多个个体的测序数据比对到参考基因组并进行变异检测,得到读长的比对位置和变异信息;Sequencing data of a plurality of individuals derived from the same population are compared to a reference genome and subjected to mutation detection to obtain a read length alignment position and variation information;
依据上述比对位置和变异信息,针对各个变异位点,将群体中所有个体的变异位点总深度、变异碱基型深度分别加和,然后计算各个变异位点在群体中的次等位基因频率和/或哈迪温伯格平衡,得到群体变异信息累加结果;According to the above alignment position and mutation information, the total depth of the variation sites and the depth of the variant bases of each individual in the population are respectively added to each variation site, and then the sub-alleles of each variation site in the population are calculated. Frequency and / or Hardy Weinberg balance, get the cumulative results of group variation information;
绘制变异位点总深度和/或变异碱基型深度与dbSNP%关系图,并依据设定的过滤阈值对上述变异位点总深度和/或变异碱基型深度以及上述次等位基因频率和/或哈迪温伯格平衡进行过滤以滤除低于上述过滤阈值的值,得到过滤后的群体变异信息累加结果;Mapping the total depth of the variant site and/or the depth of the variant base type to the dbSNP%, and based on the set filter threshold, the total depth of the variant site and/or the depth of the variant base type and the above allele frequency and / or Hardy Weinberg balances the filtering to filter out the value below the above filtering threshold, and obtains the accumulated result of the filtered population variation information;
依据上述过滤后的群体变异信息累加结果对每个个体的变异信息进行过滤,得到每个个体的最终变异信息。The variation information of each individual is filtered according to the accumulated result of the filtered population variation information, and the final variation information of each individual is obtained.
根据第二方面,一种实施例中提供一种基于测序数据的变异检测装置,包括:According to a second aspect, an embodiment provides a variation detection apparatus based on sequencing data, including:
比对与变异检测装置,用于将来源于同一群体的多个个体的测序数据比对到参考基因组并进行变异检测,得到读长的比对位置和变异信息;The comparison and mutation detecting device is configured to compare the sequencing data of the plurality of individuals from the same group to the reference genome and perform mutation detection to obtain the read length alignment position and the variation information;
加和与计算装置,用于依据上述比对位置和变异信息,针对各个变异位点,将群体中所有个体的变异位点总深度、变异碱基型深度分别加和,然后计算各个变异位点在群体中的次等位基因频率和/或哈迪温伯格平衡,得到群体变异信息累加结果;Adding and calculating means for summing the total depth of the variant sites and the depth of the variant bases of all individuals in the population according to the above-mentioned alignment position and variation information, and then calculating the respective variation sites The sub-allelic frequency in the population and/or the Hardy Weinberg equilibrium, the cumulative result of the population variation information is obtained;
群体变异信息过滤装置,用于绘制变异位点总深度和/或变异碱基型深度与dbSNP%关系图,并依据设定的过滤阈值对上述变异位点总深度和/或变异碱基型深度以及上述次等位基因频率和/或哈迪温伯格平衡进行过滤以滤除低于上述 过滤阈值的值,得到过滤后的群体变异信息累加结果;A population variation information filtering device for plotting the total depth of the variant site and/or the depth of the variant base type and the dbSNP%, and based on the set filtering threshold, the total depth of the variant site and/or the depth of the variant base And the above allele frequencies and/or Hardy Weinberg equilibrium are filtered to filter out below Filtering the value of the threshold to obtain the accumulated result of the filtered population variation information;
个体变异信息过滤装置,用于依据上述过滤后的群体变异信息累加结果对每个个体的变异信息进行过滤,得到每个个体的最终变异信息。The individual variation information filtering device is configured to filter the variation information of each individual according to the accumulated result of the filtered population variation information to obtain final variation information of each individual.
根据第三方面,一种实施例中提供一种基于测序数据的变异检测装置,包括:According to a third aspect, an embodiment provides a variation detection apparatus based on sequencing data, including:
存储器,用于存储程序;Memory for storing programs;
处理器,用于通过执行上述存储器存储的程序以实现如下的方法:a processor for implementing the following method by executing the program stored in the above memory:
将来源于同一群体的多个个体的测序数据比对到参考基因组并进行变异检测,得到读长的比对位置和变异信息;Sequencing data of a plurality of individuals derived from the same population are compared to a reference genome and subjected to mutation detection to obtain a read length alignment position and variation information;
依据上述比对位置和变异信息,针对各个变异位点,将群体中所有个体的变异位点总深度、变异碱基型深度分别加和,然后计算各个变异位点在群体中的次等位基因频率和/或哈迪温伯格平衡,得到群体变异信息累加结果;According to the above alignment position and mutation information, the total depth of the variation sites and the depth of the variant bases of each individual in the population are respectively added to each variation site, and then the sub-alleles of each variation site in the population are calculated. Frequency and / or Hardy Weinberg balance, get the cumulative results of group variation information;
绘制变异位点总深度和/或变异碱基型深度与dbSNP%关系图,并依据设定的过滤阈值对上述变异位点总深度和/或变异碱基型深度以及上述次等位基因频率和/或哈迪温伯格平衡进行过滤以滤除低于上述过滤阈值的值,得到过滤后的群体变异信息累加结果;Mapping the total depth of the variant site and/or the depth of the variant base type to the dbSNP%, and based on the set filter threshold, the total depth of the variant site and/or the depth of the variant base type and the above allele frequency and / or Hardy Weinberg balances the filtering to filter out the value below the above filtering threshold, and obtains the accumulated result of the filtered population variation information;
依据上述过滤后的群体变异信息累加结果对每个个体的变异信息进行过滤,得到每个个体的最终变异信息。The variation information of each individual is filtered according to the accumulated result of the filtered population variation information, and the final variation information of each individual is obtained.
根据第四方面,一种实施例中提供一种计算机可读存储介质,包括程序,上述程序能够被处理器执行以实现如下的方法:According to a fourth aspect, an embodiment provides a computer readable storage medium comprising a program executable by a processor to implement the following method:
将来源于同一群体的多个个体的测序数据比对到参考基因组并进行变异检测,得到读长的比对位置和变异信息;Sequencing data of a plurality of individuals derived from the same population are compared to a reference genome and subjected to mutation detection to obtain a read length alignment position and variation information;
依据上述比对位置和变异信息,针对各个变异位点,将群体中所有个体的变异位点总深度、变异碱基型深度分别加和,然后计算各个变异位点在群体中 的次等位基因频率和/或哈迪温伯格平衡,得到群体变异信息累加结果;According to the above alignment position and mutation information, the total depth of the variation sites and the depth of the variant bases of each individual in the population are respectively added for each variation site, and then each variation site is calculated in the population. The sub-allelic frequency and/or Hardy Weinberg equilibrium, resulting in cumulative results of population variation information;
绘制变异位点总深度和/或变异碱基型深度与dbSNP%关系图,并依据设定的过滤阈值对上述变异位点总深度和/或变异碱基型深度以及上述次等位基因频率和/或哈迪温伯格平衡进行过滤以滤除低于上述过滤阈值的值,得到过滤后的群体变异信息累加结果;Mapping the total depth of the variant site and/or the depth of the variant base type to the dbSNP%, and based on the set filter threshold, the total depth of the variant site and/or the depth of the variant base type and the above allele frequency and / or Hardy Weinberg balances the filtering to filter out the value below the above filtering threshold, and obtains the accumulated result of the filtered population variation information;
依据上述过滤后的群体变异信息累加结果对每个个体的变异信息进行过滤,得到每个个体的最终变异信息。The variation information of each individual is filtered according to the accumulated result of the filtered population variation information, and the final variation information of each individual is obtained.
本发明能够利用群体的低深度测序数据,不再借助连锁推断而直接进行精准的变异检测,提高数据利用率,并且低深度区域获得的变异信息可作为标记位点,协助改善现有的连锁推断结果。The invention can utilize the low-depth sequencing data of the population, and directly performs accurate mutation detection without using chain inference to improve data utilization, and the variation information obtained in the low-depth region can be used as a marker site to help improve the existing linkage inference. result.
图1为本发明一个实施例基于测序数据的变异检测方法的流程图;1 is a flow chart of a method for detecting variation based on sequencing data according to an embodiment of the present invention;
图2为本发明一个实施例基于测序数据的变异检测装置的结构框图;2 is a structural block diagram of a variation detecting apparatus based on sequencing data according to an embodiment of the present invention;
图3为本发明一个实施例中变异碱基型深度与dbSNP%关系图,其中,各曲线代表不同染色体的情况,加粗曲线代表平均值,灰色部分代表标准差。3 is a graph showing the relationship between the depth of the variant base type and the dbSNP% in one embodiment of the present invention, wherein each curve represents a different chromosome, the bold curve represents the average value, and the gray portion represents the standard deviation.
下面通过具体实施方式结合附图对本发明作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本发明能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他元件、材料、方法所替代。在某些情况下,本发明相关的一些操作并没有在说明书中显示或者描述,这是为了避免本发明的核心部分被过多的描述所淹没,而对于本领域技术人员而言,详细描述这些相关操作并不是必要 的,他们根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。The present invention will be further described in detail below with reference to the accompanying drawings. In the following embodiments, many of the details are described in order to provide a better understanding of the invention. However, those skilled in the art can easily realize that some of the features may be omitted in different situations, or may be replaced by other components, materials, and methods. In some instances, some of the operations related to the present invention have not been shown or described in the specification in order to avoid that the core portion of the present invention is overwhelmed by excessive description, and those skilled in the art will describe these in detail. Related operations are not necessary They can fully understand the relevant operations according to the description in the manual and the general technical knowledge in the field.
另外,说明书中所描述的特点、操作或者特征可以以任意适当的方式结合形成各种实施方式。同时,方法描述中的各步骤或者动作也可以按照本领域技术人员所能显而易见的方式进行顺序调换或调整。因此,说明书和附图中的各种顺序只是为了清楚描述某一个实施例,并不意味着是必须的顺序,除非另有说明其中某个顺序是必须遵循的。In addition, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. At the same time, the steps or actions in the method description can also be sequentially changed or adjusted in a manner that can be apparent to those skilled in the art. Therefore, the various sequences in the specification and the drawings are only for the purpose of describing a particular embodiment, and are not intended to
对于低深度测序数据的变异检测,传统方法都是针对单个样本,本发明的方法将群体的低深度变异合并在一起,看作是单个样本的变异信息,从而实现低深度群体转化为虚构的高深度个体,进而进行变异检测,得到精准的变异检测结果后再拆分得到单个个体本身的变异信息。For the detection of mutations in low-depth sequencing data, the traditional method is directed to a single sample. The method of the present invention combines the low-depth variations of the population together as a variation information of a single sample, thereby realizing the conversion of low-depth groups into fictitious highs. Depth individuals, and then carry out mutation detection, obtain accurate mutation detection results and then split to obtain the variation information of individual individuals.
如图1所示,本发明一个实施例的基于测序数据的变异检测方法包括如下步骤:As shown in FIG. 1, a method for detecting a variation based on sequencing data according to an embodiment of the present invention includes the following steps:
步骤S101:将来源于同一群体的多个个体的测序数据比对到参考基因组并进行变异检测,得到读长(Reads)的比对位置和变异信息。Step S101: Aligning the sequencing data of the plurality of individuals from the same population to the reference genome and performing mutation detection to obtain the read position and the variation information of the Reads.
本发明实施例中,测序数据属于同一群体,例如同一物种(人、猪等)。这些测序数据可以是同一批次的下机数据,也可以是虽然属于不同批次但是属于同一群体的下机数据。测序数据的格式例如Fastq格式。参考基因组可以是各群体(物种)的已公开的基因组数据,例如对人而言,参考基因组可以是人类参考基因组hg19。比对软件可以是常用的BWA软件等,变异检测软件可以是常用的GATK软件等。得到的变异信息包含变异所在的染色体、位置、变异碱基型等元素。关于个体数量(即样本量),本发明实施例无严格要求,但样本量越多越有利于找到更多变异位点。In the embodiment of the present invention, the sequencing data belongs to the same group, for example, the same species (human, pig, etc.). These sequencing data may be offline data of the same batch, or may be offline data belonging to the same group although belonging to different batches. The format of the sequencing data is, for example, the Fastq format. The reference genome can be published genomic data for each population (species), for example, for a human, the reference genome can be the human reference genome hg19. The comparison software can be a commonly used BWA software, etc., and the mutation detection software can be a commonly used GATK software. The obtained variation information includes elements such as the chromosome, position, and variant base type in which the mutation is located. Regarding the number of individuals (ie, the sample size), the embodiments of the present invention are not strictly required, but the more the sample size, the more favorable to find more mutation sites.
本发明实施例中,测序数据的平均测序深度没有特别限制,本发明实施例的方法特别适合于低深度测序数据,例如平均深度为1×至15×的测序数据。在 本发明一个实施例中,测序数据的平均测序深度是3.5×。In the embodiment of the present invention, the average sequencing depth of the sequencing data is not particularly limited, and the method of the embodiment of the present invention is particularly suitable for low-depth sequencing data, such as sequencing data having an average depth of 1× to 15×. In In one embodiment of the invention, the average sequencing depth of the sequencing data is 3.5x.
步骤S102:依据得到的比对位置和变异信息,针对各个变异位点,将群体中所有个体的变异位点总深度、变异碱基型深度分别加和,然后计算各个变异位点在群体中的次等位基因频率(Minor allele frequency,MAF)和/或哈迪温伯格平衡(Hardy-Weinberg equilibrium,HWE),得到群体变异信息累加结果,所谓“群体变异信息累加结果”包括各个变异位点加和的变异位点总深度、变异碱基型深度,以及MAF和/或HWE。Step S102: According to the obtained alignment position and variation information, sum the total depth of the variation sites and the depth of the variant bases of each individual in the population for each variation site, and then calculate the variation sites in the population. The Minor allele frequency (MAF) and/or the Hardy-Weinberg equilibrium (HWE), the cumulative result of the population variation information is obtained, and the so-called "group variation information accumulation result" includes each variation site. The total depth of the additive site, the depth of the variant base, and the MAF and/or HWE.
如本发明所使用的,术语“变异位点总深度”是指覆盖在该变异位点上的所有个体的测序读长(Reads)的数量;术语“变异碱基型深度”是指覆盖在该变异位点上某种特定的碱基类型的所有个体的测序读长的数量。可见,对特定变异位点而言,“变异位点总深度”是各种碱基类型的“变异碱基型深度”之和。例如,若某一变异位点存在A、T两种碱基类型,并且在所有个体的测序读长中,有100个在该变异位点为A,有100个在该变异位点为T,则A、T两种碱基类型的“变异碱基型深度”分别是100,该变异位点的“变异位点总深度”是200。As used herein, the term "total depth of variation site" refers to the number of sequencing reads (Reads) of all individuals covering the site of variation; the term "mutation base type depth" refers to the The number of sequencing reads of all individuals of a particular base type at the variant site. It can be seen that for a particular variant site, the "total depth of the variant site" is the sum of the "mutation base depths" of the various base types. For example, if a variant site has two base types, A and T, and 100 of the sequencing reads in all individuals, 100 are A at the mutation site, and 100 are T at the mutation site. Then, the "mutation base type depth" of the two base types A and T is 100, and the "variation site total depth" of the mutation site is 200.
如本发明所使用的,术语“次等位基因频率”,指在给定群体中的不常见的等位基因的频率;术语“哈迪温伯格平衡”,是指各等位基因的频率和等位基因的基因型频率在遗传中稳定不变,即保持着基因平衡。As used herein, the term "secondary allele frequency" refers to the frequency of unusual alleles in a given population; the term "Hardy Weinberg equilibrium" refers to the frequency of each allele The genotype frequencies of the alleles are stable in the inheritance, that is, the gene balance is maintained.
本发明实施例的方法特别适合于单核苷酸多态性(SNP)变异、插入/删除(Ins/Del)变异检测,因此在本发明的一个优选的实施例中,在将群体中所有个体的变异位点总深度、变异碱基型深度分别加和时,去除三碱基及其它多碱基变异而仅保留单核苷酸多态性变异。The method of the embodiments of the present invention is particularly suitable for single nucleotide polymorphism (SNP) variation, insertion/deletion (Ins/Del) mutation detection, and thus in a preferred embodiment of the invention, all individuals in the population When the total depth of the variant site and the depth of the variant base are respectively added, the three bases and other multibase mutations are removed and only the single nucleotide polymorphism variation is retained.
步骤S103:绘制变异位点总深度和/或变异碱基型深度与dbSNP%关系图,并依据设定的过滤阈值对变异位点总深度和/或变异碱基型深度以及次等位基因频率和/或哈迪温伯格平衡进行过滤以滤除低于过滤阈值的值,得到过滤后的群体变异信息累加结果。 Step S103: plot the total depth of the variant site and/or the depth of the variant base type and the dbSNP%, and according to the set filtering threshold, the total depth of the variant site and/or the depth of the variant base type and the frequency of the minor allele. And/or Hardy Weinberg balances the filtering to filter out the value below the filtering threshold, and obtains the accumulated result of the filtered population variation information.
如本发明所使用的,术语“dbSNP%”是指发现的变异位点在寡核苷酸多态性数据库中的占比,通常用来衡量检测到的SNP的准确性。As used herein, the term "dbSNP%" refers to the proportion of found variant sites in the oligonucleotide polymorphism database and is commonly used to measure the accuracy of detected SNPs.
在本发明实施例中,过滤阈值包括变异位点总深度的过滤阈值和/或变异碱基型深度的过滤阈值,以及MAF的过滤阈值和/或HWE的过滤阈值。也就是说,关键指标是变异碱基型深度(或变异位点总深度)、MAF值和/或HWE值。某种碱基类型的变异碱基型深度与该变异的可信度直接相关,HWE和MAF常用于除去低可信度的变异位点。In an embodiment of the invention, the filtering threshold includes a filtering threshold of the total depth of the mutation site and/or a filtering threshold of the variant base type depth, and a filtering threshold of the MAF and/or a filtering threshold of the HWE. That is, the key indicators are the mutated base type depth (or total variation site depth), the MAF value, and/or the HWE value. The depth of the variant base type of a base type is directly related to the credibility of the mutation. HWE and MAF are often used to remove low-reliability variant sites.
在本发明的一个实施例中,变异碱基型深度的过滤阈值是30×,MAF的过滤阈值是0.05。也就是说,变异碱基型深度大于等于30×,MAF大于0.05,认为达到统计学显著效果。In one embodiment of the invention, the filter threshold for the depth of the variant base type is 30x and the filter threshold for the MAF is 0.05. That is to say, the depth of the variant base type is greater than or equal to 30×, and the MAF is greater than 0.05, which is considered to be statistically significant.
在本发明的其它实施例中,可以根据实际深度密度分布适当调整过滤深度(变异碱基型深度或变异位点总深度)。其中,“实际深度密度分布”可以是指不同变异位点的测序深度的分布。In other embodiments of the present invention, the depth of filtration (variation base type depth or total variation site depth) may be appropriately adjusted according to the actual depth density distribution. Wherein, the "actual depth density distribution" may refer to the distribution of the sequencing depth of different mutation sites.
在本发明的的其它实施例中,样本量可以根据数据平均深度来设置,例如平均深度为1×,样本量应该为30例,使得变异位点类型深度可达到设定的大于等于30×的过滤条件。样本量越多,越有利于检测更多的变异位点。In other embodiments of the present invention, the sample size may be set according to the average depth of the data, for example, the average depth is 1×, and the sample size should be 30, so that the variation site type depth can reach a set value of 30× or more. Filter conditions. The more sample size, the more favorable it is to detect more variant sites.
步骤S104:依据过滤后的群体变异信息累加结果对每个个体的变异信息进行过滤,得到每个个体的最终变异信息。Step S104: Filter the variation information of each individual according to the accumulated result of the filtered population variation information, and obtain the final variation information of each individual.
在本发明的一个具体实施例中,对每个个体的变异信息进行过滤具体包括:若在个体原始的变异信息中存在染色体、位置和变异碱基类型与过滤后的群体变异信息累加结果一致的结果,则保留该结果,否则过滤掉该结果。此外,还包括:若在一个变异位点上存在两种以上的变异,则过滤掉该变异位点。In a specific embodiment of the present invention, filtering the variation information of each individual specifically includes: if the existing chromosomes, locations, and variant base types in the original variation information of the individual are consistent with the cumulative result of the filtered population variation information. As a result, the result is retained, otherwise the result is filtered out. In addition, if there are more than two variations in one variant site, the variant site is filtered out.
通过以上步骤,可以得到每个个体的最终变异信息。然后,可通过dbSNP%来评估变异的准确性并且进行后续推断填补和研究分析。Through the above steps, the final variation information of each individual can be obtained. The accuracy of the variation can then be assessed by dbSNP% and subsequent inferential filling and research analysis.
本发明的方法提高了高通量测序数据中低深度区域的数据利用率,将本来 被忽视的数据转化为可利用的数据,同时显著提高了检测出SNP的准确性,而且方法简单易用,既可以提高准确率也可以增加数据利用率,还可以加入到传统变异填补的上一步,改善传统方法。The method of the present invention improves data utilization in low-depth regions of high-throughput sequencing data, which would otherwise Neglected data is converted into usable data, and the accuracy of detecting SNPs is significantly improved, and the method is simple and easy to use, which can improve the accuracy and increase the data utilization rate, and can also be added to the previous step of traditional variation filling. Improve traditional methods.
本领域技术人员可以理解,上述实施方式中各种方法的全部或部分功能可以通过硬件的方式实现,也可以通过计算机程序的方式实现。当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘、光盘、硬盘等,通过计算机执行该程序以实现上述功能。例如,将程序存储在设备的存储器中,当通过处理器执行存储器中程序,即可实现上述全部或部分功能。另外,当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序也可以存储在服务器、另一计算机、磁盘、光盘、闪存盘或移动硬盘等存储介质中,通过下载或复制保存到本地设备的存储器中,或对本地设备的系统进行版本更新,当通过处理器执行存储器中的程序时,即可实现上述实施方式中全部或部分功能。Those skilled in the art can understand that all or part of the functions of the various methods in the above embodiments may be implemented by hardware or by a computer program. When all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc. The computer executes the program to implement the above functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the above functions can be realized. In addition, when all or part of the functions in the above embodiment are implemented by a computer program, the program may also be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk or a mobile hard disk, and may be saved by downloading or copying. The system is updated in the memory of the local device, or the system of the local device is updated. When the program in the memory is executed by the processor, all or part of the functions in the above embodiments may be implemented.
因此,本发明的一种实施例还提供一种基于测序数据的变异检测装置,如图2所示,包括:Therefore, an embodiment of the present invention further provides a mutation detecting device based on sequencing data, as shown in FIG. 2, including:
比对与变异检测装置201,用于将来源于同一群体的多个个体的测序数据比对到参考基因组并进行变异检测,得到读长的比对位置和变异信息;加和与计算装置202,用于依据比对位置和变异信息,针对各个变异位点,将群体中所有个体的变异位点总深度、变异碱基型深度分别加和,然后计算各个变异位点在群体中的次等位基因频率和/或哈迪温伯格平衡,得到群体变异信息累加结果;群体变异信息过滤装置203,用于绘制变异位点总深度和/或变异碱基型深度与dbSNP%关系图,并依据设定的过滤阈值对变异位点总深度和/或变异碱基型深度以及次等位基因频率和/或哈迪温伯格平衡进行过滤以滤除低于过滤阈值的值,得到过滤后的群体变异信息累加结果;个体变异信息过滤装置204,用于依据过滤后的群体变异信息累加结果对每个个体的变异信息进行过滤,得到每个
个体的最终变异信息。The comparison and
本发明的另一种实施例还提供一种基于测序数据的变异检测装置,包括:Another embodiment of the present invention further provides a variation detecting apparatus based on sequencing data, including:
存储器,用于存储程序;Memory for storing programs;
处理器,用于通过执行上述存储器存储的程序以实现如下的方法:a processor for implementing the following method by executing the program stored in the above memory:
将来源于同一群体的多个个体的测序数据比对到参考基因组并进行变异检测,得到读长的比对位置和变异信息;依据比对位置和变异信息,针对各个变异位点,将群体中所有个体的变异位点总深度、变异碱基型深度分别加和,然后计算各个变异位点在群体中的次等位基因频率和/或哈迪温伯格平衡,得到群体变异信息累加结果;绘制变异位点总深度和/或变异碱基型深度与dbSNP%关系图,并依据设定的过滤阈值对变异位点总深度和/或变异碱基型深度以及次等位基因频率和/或哈迪温伯格平衡进行过滤以滤除低于过滤阈值的值,得到过滤后的群体变异信息累加结果;依据过滤后的群体变异信息累加结果对每个个体的变异信息进行过滤,得到每个个体的最终变异信息。The sequencing data of multiple individuals from the same population are compared to the reference genome and the mutation is detected to obtain the read position and the variation information of the read length; according to the alignment position and the variation information, the individual mutation sites are The total depth of the variant sites and the depth of the variant bases are summed separately, and then the sub-allelic frequencies and/or Hardy Weinberg equilibrium of each variant site in the population are calculated, and the cumulative result of the population variation information is obtained; Mapping the total depth of the variant site and/or the depth of the variant base type to the dbSNP% and based on the set filter threshold for the total depth of the variant site and/or the depth of the variant base and the frequency of the minor allele and/or Hardy Weinberg balances the filtering to filter out the value below the filtering threshold, and obtains the accumulated result of the filtered population variation information; according to the accumulated result of the filtered population variation information, the variation information of each individual is filtered to obtain each The final variation of the individual.
本发明的又一种实施例还提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现如下的方法:Yet another embodiment of the present invention also provides a computer readable storage medium comprising a program executable by a processor to implement the following method:
将来源于同一群体的多个个体的测序数据比对到参考基因组并进行变异检测,得到读长的比对位置和变异信息;依据比对位置和变异信息,针对各个变异位点,将群体中所有个体的变异位点总深度、变异碱基型深度分别加和,然后计算各个变异位点在群体中的次等位基因频率和/或哈迪温伯格平衡,得到群体变异信息累加结果;绘制变异位点总深度和/或变异碱基型深度与dbSNP%关系图,并依据设定的过滤阈值对变异位点总深度和/或变异碱基型深度以及次等位基因频率和/或哈迪温伯格平衡进行过滤以滤除低于过滤阈值的值,得到过滤后的群体变异信息累加结果;依据过滤后的群体变异信息累加结果对每个个体的变异信息进行过滤,得到每个个体的最终变异信息。The sequencing data of multiple individuals from the same population are compared to the reference genome and the mutation is detected to obtain the read position and the variation information of the read length; according to the alignment position and the variation information, the individual mutation sites are The total depth of the variant sites and the depth of the variant bases are summed separately, and then the sub-allelic frequencies and/or Hardy Weinberg equilibrium of each variant site in the population are calculated, and the cumulative result of the population variation information is obtained; Mapping the total depth of the variant site and/or the depth of the variant base type to the dbSNP% and based on the set filter threshold for the total depth of the variant site and/or the depth of the variant base and the frequency of the minor allele and/or Hardy Weinberg balances the filtering to filter out the value below the filtering threshold, and obtains the accumulated result of the filtered population variation information; according to the accumulated result of the filtered population variation information, the variation information of each individual is filtered to obtain each The final variation of the individual.
以下通过实施例详细说明本发明的技术方案和效果,应当理解,实施例仅 是示例性的,不能理解为对本发明保护范围的限制。The technical solutions and effects of the present invention are described in detail below through embodiments, and it should be understood that the embodiments are only It is intended to be illustrative, and not to limit the scope of the invention.
实施例Example
本实施例中,样本为105例人类MHC区域(6号染色体上4.9M区域)捕获的测序数据,这批数据中集中在MHC区域的数据占约50%。也就是说还有约50%是伴随产生的基因组其他位置的数据,相对而言,这部分数据就是低深度的数据。利用本发明的方法对这50%低深度的区域进行变异检测,为对结果的准确性进行测评,还对这105例样本进行外显子测序,在外显子测序数据中,外显子区域是高深度的。通过比较低深度数据和外显子测序在外显子区域的一致性,便能确定变异的一致性情况,从而测评本发明方法的准确性。In this example, the sample is sequencing data captured in 105 human MHC regions (4.9M region on chromosome 6), and the data concentrated in the MHC region accounts for about 50% of the data. That is to say, about 50% of the data is accompanied by other locations in the genome. Relatively speaking, this part of the data is low-depth data. Using the method of the present invention to detect the variation of the 50% low-depth region, the accuracy of the results is evaluated, and the 105 samples are also subjected to exon sequencing. In the exon sequencing data, the exon region is High depth. By comparing the consistency of low depth data and exon sequencing in the exon region, the consistency of the variation can be determined to assess the accuracy of the method of the invention.
本实施性的具体步骤如下:The specific steps of this embodiment are as follows:
(1)下载人类参考基因组hg19,利用BWA软件将测序读长比对到参考基因组hg19上,并用GATK软件对变异信息进行检测,将阈值调低以保留更多的变异信息。(1) Download the human reference genome hg19, use the BWA software to compare the sequencing read length to the reference genome hg19, and use GATK software to detect the mutation information, and lower the threshold to retain more variation information.
(2)针对MHC区域捕获的测序数据变异位点,本实施例以低深度区域进行研究,本实施例中低深度区域的平均深度为3.5×。按照变异位点所在的染色体、位置、变异碱基型,将105个MHC样本的变异位点总深度、变异碱基型深度分别进行加和,去除三碱基及其它多碱基变异并且计算各个变异位点在群体中的MAF值,得到105个样本的群体变异信息累加结果。(2) For the sequencing data variation site captured by the MHC region, this embodiment studies with a low depth region, and the average depth of the low depth region in this embodiment is 3.5×. According to the chromosome, position and variant base type of the mutation site, the total depth of the mutation sites of 105 MHC samples and the depth of the variant base were respectively added to remove the three bases and other multibase mutations and calculate each The MAF value of the mutation site in the population obtained the cumulative result of the population variation information of 105 samples.
(3)计算得到变异位点总深度、变异碱基型深度、MAF值(部分结果如表1所示)后,计算得到相关数值的数据分布形式。发现dbSNP%与变异碱基型深度在30×达到稳定,分析变异位点和MAF数值分布后,设定变异位点总深度的阈值30×,MAF的阈值为0.05,滤除低于阈值的值,得到过滤后的群体变异信息累加结果。图3示出了本实施例中变异碱基型深度与dbSNP%关系图。(3) Calculate the total depth of the variant site, the depth of the variant base, and the MAF value (some results are shown in Table 1), and calculate the data distribution form of the relevant values. It was found that the depth of dbSNP% and the variant base type reached 30×, and after analyzing the variation site and the MAF value distribution, the threshold of the total depth of the mutation site was set to 30×, the threshold of MAF was 0.05, and the value below the threshold was filtered out. , the filtered population variation information is added to the cumulative result. Fig. 3 is a graph showing the relationship between the depth of the mutant base type and the dbSNP% in the present embodiment.
表1Table 1
(4)根据105例样本过滤后的群体变异信息累加结果,对每个个体的变异结果进行过滤。若在个体原始的变异信息中存在染色体、位置和变异碱基类型与过滤后的群体变异信息累加结果一致的结果,则保留该结果,否则过滤掉该结果。此外,还包括:若在一个变异位点上存在两种以上的变异,则过滤掉该变异位点。(4) According to the cumulative result of the population variation information filtered by 105 samples, the variation result of each individual is filtered. If there is a result that the chromosome, position, and variant base type are consistent with the filtered population variation information accumulation result in the original variation information of the individual, the result is retained, otherwise the result is filtered out. In addition, if there are more than two variations in one variant site, the variant site is filtered out.
(5)dbSNP%评估变异的准确性,发现dbSNP%可以达到90.3%(若去掉Y染色体与覆盖度较差的9号染色体dbSNP%为93%)。另外,与对应的高深度外显子测序数据进行对比,比较后一致性可以达到96.45%,结果如表2所示,表中“编号”表示105例样本的编号,表中“未处理”表示未使用本发明的方法得到的结果与对应的高深度外显子测序数据对比的一致性数据,表中“本方法”表示使用本发明的方法得到的结果与对应的高深度外显子测序数据对比的一致性数据。(5) dbSNP% evaluates the accuracy of the variation, and found that dbSNP% can reach 90.3% (if the Y chromosome is removed and the coverage of chromosome 9 is dbSNP% is 93%). In addition, compared with the corresponding high-depth exon sequencing data, the consistency can reach 96.45% after comparison. The results are shown in Table 2. The number in the table indicates the number of 105 samples, and the table indicates “unprocessed”. Consistent data for comparison of the results obtained using the method of the invention with corresponding high depth exon sequencing data, the expression "this method" in the table indicates the results obtained using the method of the invention and the corresponding high depth exon sequencing data Contrast consistent data.
表2Table 2
以上应用了具体个例对本发明进行阐述,只是用于帮助理解本发明,并不用以限制本发明。对于本发明所属技术领域的技术人员,依据本发明的思想,还可以做出若干简单推演、变形或替换。 The invention has been described above with reference to specific examples, which are merely intended to aid the understanding of the invention and are not intended to limit the invention. For the person skilled in the art to which the invention pertains, several simple derivations, variations or substitutions can be made in accordance with the inventive concept.
Claims (10)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201780089042.9A CN110462063B (en) | 2017-05-23 | 2017-05-23 | Mutation detection method and device based on sequencing data and storage medium |
| PCT/CN2017/085448 WO2018214010A1 (en) | 2017-05-23 | 2017-05-23 | Method, device, and storage medium for detecting mutation on the basis of sequencing data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2017/085448 WO2018214010A1 (en) | 2017-05-23 | 2017-05-23 | Method, device, and storage medium for detecting mutation on the basis of sequencing data |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2018214010A1 true WO2018214010A1 (en) | 2018-11-29 |
Family
ID=64396029
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2017/085448 Ceased WO2018214010A1 (en) | 2017-05-23 | 2017-05-23 | Method, device, and storage medium for detecting mutation on the basis of sequencing data |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN110462063B (en) |
| WO (1) | WO2018214010A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111292803A (en) * | 2020-02-10 | 2020-06-16 | 广州金域医学检验集团股份有限公司 | Genome breakpoint identification method and application |
| CN113517022A (en) * | 2021-06-10 | 2021-10-19 | 阿里巴巴新加坡控股有限公司 | Gene detection method, feature extraction method, device, equipment and system |
| CN116864011A (en) * | 2023-06-29 | 2023-10-10 | 哈尔滨星云生物信息技术开发有限公司 | Colorectal cancer molecular marker identification method and system based on multi-omics data |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118969070A (en) * | 2024-07-17 | 2024-11-15 | 山东农业大学 | A method for mining population mitochondrial DNA variation based on low-depth sequencing data |
Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103080333A (en) * | 2010-09-14 | 2013-05-01 | 深圳华大基因科技有限公司 | Methods and systems for detecting genomic structure variations |
| CN103617256A (en) * | 2013-11-29 | 2014-03-05 | 北京诺禾致源生物信息科技有限公司 | Method and device for processing file needing mutation detection |
| CN104204220A (en) * | 2011-12-31 | 2014-12-10 | 深圳华大基因医学有限公司 | A method for detecting genetic variation |
| CN105441432A (en) * | 2014-09-05 | 2016-03-30 | 天津华大基因科技有限公司 | Composition and application thereof to sequencing and variation detection |
| WO2016049878A1 (en) * | 2014-09-30 | 2016-04-07 | 深圳华大基因科技有限公司 | Snp profiling-based parentage testing method and application |
| CN105760712A (en) * | 2016-03-01 | 2016-07-13 | 西安电子科技大学 | Copy number variation detection method based on next generation sequencing |
| CN105925685A (en) * | 2016-05-13 | 2016-09-07 | 万康源(天津)基因科技有限公司 | Exome potential pathogenic mutation detection method based on family line |
| CN105989246A (en) * | 2015-01-28 | 2016-10-05 | 深圳华大基因研究院 | Variation detection method and device assembled based on genomes |
| CN106520940A (en) * | 2016-11-04 | 2017-03-22 | 深圳华大基因研究院 | Chromosomal aneuploid and copy number variation detecting method and application thereof |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102952854B (en) * | 2011-08-25 | 2015-01-14 | 深圳华大基因科技有限公司 | Single cell sorting and screening method and device thereof |
| CN103525899B (en) * | 2012-07-02 | 2015-11-18 | 中国科学院上海生命科学研究院 | Diabetes B susceptibility loci and detection method and test kit |
| CN106715711B (en) * | 2014-07-04 | 2021-09-17 | 深圳华大基因股份有限公司 | Method for determining probe sequence and method for detecting genome structure variation |
| CN105512514B (en) * | 2014-09-23 | 2018-05-01 | 深圳华大基因股份有限公司 | A kind of MHC completions database, its construction method and application |
| JP2017016665A (en) * | 2015-07-03 | 2017-01-19 | 国立大学法人東北大学 | Method for selecting variation information from sequence data, system, and computer program |
| CN106156538A (en) * | 2016-06-29 | 2016-11-23 | 天津诺禾医学检验所有限公司 | The annotation method of a kind of full-length genome variation data and annotation system |
-
2017
- 2017-05-23 WO PCT/CN2017/085448 patent/WO2018214010A1/en not_active Ceased
- 2017-05-23 CN CN201780089042.9A patent/CN110462063B/en active Active
Patent Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103080333A (en) * | 2010-09-14 | 2013-05-01 | 深圳华大基因科技有限公司 | Methods and systems for detecting genomic structure variations |
| CN104204220A (en) * | 2011-12-31 | 2014-12-10 | 深圳华大基因医学有限公司 | A method for detecting genetic variation |
| CN103617256A (en) * | 2013-11-29 | 2014-03-05 | 北京诺禾致源生物信息科技有限公司 | Method and device for processing file needing mutation detection |
| CN105441432A (en) * | 2014-09-05 | 2016-03-30 | 天津华大基因科技有限公司 | Composition and application thereof to sequencing and variation detection |
| WO2016049878A1 (en) * | 2014-09-30 | 2016-04-07 | 深圳华大基因科技有限公司 | Snp profiling-based parentage testing method and application |
| CN105989246A (en) * | 2015-01-28 | 2016-10-05 | 深圳华大基因研究院 | Variation detection method and device assembled based on genomes |
| CN105760712A (en) * | 2016-03-01 | 2016-07-13 | 西安电子科技大学 | Copy number variation detection method based on next generation sequencing |
| CN105925685A (en) * | 2016-05-13 | 2016-09-07 | 万康源(天津)基因科技有限公司 | Exome potential pathogenic mutation detection method based on family line |
| CN106520940A (en) * | 2016-11-04 | 2017-03-22 | 深圳华大基因研究院 | Chromosomal aneuploid and copy number variation detecting method and application thereof |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111292803A (en) * | 2020-02-10 | 2020-06-16 | 广州金域医学检验集团股份有限公司 | Genome breakpoint identification method and application |
| CN111292803B (en) * | 2020-02-10 | 2024-04-26 | 广州金域医学检验集团股份有限公司 | Genome breakpoint identification method and application |
| CN113517022A (en) * | 2021-06-10 | 2021-10-19 | 阿里巴巴新加坡控股有限公司 | Gene detection method, feature extraction method, device, equipment and system |
| CN113517022B (en) * | 2021-06-10 | 2024-06-25 | 阿里巴巴达摩院(杭州)科技有限公司 | Gene detection method, feature extraction method, device, equipment and system |
| CN116864011A (en) * | 2023-06-29 | 2023-10-10 | 哈尔滨星云生物信息技术开发有限公司 | Colorectal cancer molecular marker identification method and system based on multi-omics data |
Also Published As
| Publication number | Publication date |
|---|---|
| CN110462063A (en) | 2019-11-15 |
| CN110462063B (en) | 2023-06-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Jain et al. | Nanopore sequencing and assembly of a human genome with ultra-long reads | |
| AU2020244451B2 (en) | Methods and systems for detection of abnormal karyotypes | |
| Li et al. | SNP detection for massively parallel whole-genome resequencing | |
| Tatsumoto et al. | Direct estimation of de novo mutation rates in a chimpanzee parent-offspring trio by ultra-deep whole genome sequencing | |
| Korneliussen et al. | ANGSD: analysis of next generation sequencing data | |
| Dharanipragada et al. | iCopyDAV: Integrated platform for copy number variations—Detection, annotation and visualization | |
| Goya et al. | SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors | |
| Lange et al. | Analysis pipelines for cancer genome sequencing in mice | |
| Hills et al. | BAIT: organizing genomes and mapping rearrangements in single cells | |
| Bellos et al. | cnvHiTSeq: integrative models for high-resolution copy number variation detection and genotyping using population sequencing data | |
| Bataillon et al. | Inference of purifying and positive selection in three subspecies of chimpanzees (Pan troglodytes) from exome sequencing | |
| Betge et al. | Amplicon sequencing of colorectal cancer: variant calling in frozen and formalin-fixed samples | |
| WO2018214010A1 (en) | Method, device, and storage medium for detecting mutation on the basis of sequencing data | |
| Couldrey et al. | Detection and assessment of copy number variation using PacBio long-read and Illumina sequencing in New Zealand dairy cattle | |
| Li et al. | Single nucleotide polymorphism (SNP) detection and genotype calling from massively parallel sequencing (MPS) data | |
| CA3023283A1 (en) | Methods of determining genomic health risk | |
| EP3559841A1 (en) | Base coverage normalization and use thereof in detecting copy number variation | |
| WO2020063052A1 (en) | Method for acquiring cell-free fetal dna concentration, acquisition device, storage medium, and electronic device | |
| Niehus et al. | PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes | |
| Tae et al. | Discretized Gaussian mixture for genotyping of microsatellite loci containing homopolymer runs | |
| Wei et al. | CONY: A Bayesian procedure for detecting copy number variations from sequencing read depths | |
| Lynch et al. | The linkage-disequilibrium and recombinational landscape in Daphnia pulex | |
| US20250157573A1 (en) | Genome wide assembly-based structural variant calling | |
| Shao et al. | A population model for genotyping indels from next-generation sequence data | |
| Söylev et al. | CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17911324 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 17911324 Country of ref document: EP Kind code of ref document: A1 |