[go: up one dir, main page]

CN116246703A - A quality assessment method for nucleic acid sequencing data - Google Patents

A quality assessment method for nucleic acid sequencing data Download PDF

Info

Publication number
CN116246703A
CN116246703A CN202310295466.8A CN202310295466A CN116246703A CN 116246703 A CN116246703 A CN 116246703A CN 202310295466 A CN202310295466 A CN 202310295466A CN 116246703 A CN116246703 A CN 116246703A
Authority
CN
China
Prior art keywords
sequencing
nucleic acid
polymer
sequence
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310295466.8A
Other languages
Chinese (zh)
Inventor
周文雄
李雷
李昂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Saina Biotechnology Guangzhou Co ltd
Original Assignee
Saina Biotechnology Guangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Saina Biotechnology Guangzhou Co ltd filed Critical Saina Biotechnology Guangzhou Co ltd
Priority to CN202310295466.8A priority Critical patent/CN116246703A/en
Publication of CN116246703A publication Critical patent/CN116246703A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种核酸测序数据的质量评估方法,以核酸序列中的多聚物为基本单位进行质量评估,而不是现有方法中以碱基为基本单位进行质量评估,更适用于3’端开放的测序方法得到的序列。

Figure 202310295466

The invention discloses a method for evaluating the quality of nucleic acid sequencing data, which uses the polymer in the nucleic acid sequence as the basic unit for quality evaluation instead of the base as the basic unit in the existing method, and is more suitable for 3' Sequences obtained by open-ended sequencing.

Figure 202310295466

Description

一种核酸测序数据的质量评估方法A quality assessment method for nucleic acid sequencing data

技术领域technical field

本发明所公开的技术涉及一种核酸测序数据的质量评估方法,属于基因测序领域。The technology disclosed in the present invention relates to a method for evaluating the quality of nucleic acid sequencing data, which belongs to the field of gene sequencing.

背景技术Background technique

基因测序技术可以探明遗传物质的序列,被广泛应用于临床肿瘤分型、微生物鉴定和遗传病诊断等领域。当今主流的核酸测序技术除了产出被测核酸样品的序列之外,还会给所测得的每个碱基以一个质量值,用来评估其所测的准确性。这个质量值一般以Phred的形式表示:Gene sequencing technology can ascertain the sequence of genetic material, and is widely used in clinical tumor typing, microbial identification, and genetic disease diagnosis. In addition to producing the sequence of the tested nucleic acid sample, today's mainstream nucleic acid sequencing technology also assigns a quality value to each base to evaluate its accuracy. This quality value is generally expressed in the form of Phred:

q=-10log10(1-a)q=-10log 10 (1-a)

式中a为该碱基的准确率,q为Phred值。例如,准确率99%、99.9%、99.99%对应的Phred值分别为20、30、40。In the formula, a is the accuracy rate of the base, and q is the Phred value. For example, the Phred values corresponding to accuracy rates of 99%, 99.9%, and 99.99% are 20, 30, and 40, respectively.

在对核酸测序数据的生物信息学分析中,质量值起到了非常重要的作用。例如,在鉴定基因突变时,若所测序列上的某碱基与参考序列上的对应碱基不同,则当该碱基的质量值较高时,此处会被判定为基因突变;而当该碱基的质量值较低时,该序列会被认为发生了测序错误、不存在基因突变。In the bioinformatics analysis of nucleic acid sequencing data, the quality value plays a very important role. For example, when identifying a gene mutation, if a base on the measured sequence is different from the corresponding base on the reference sequence, when the quality value of the base is high, it will be judged as a gene mutation; and when When the quality value of the base is low, the sequence is considered to have a sequencing error and there is no genetic mutation.

对于454、Ion Torrent、荧光发生测序这些3’端开放的测序技术来说,反应都是一个多聚物一个多聚物地进行的,每个多聚物从测序化学上讲是一个整体。而现有的给每个碱基赋予一个质量值的做法,是Sanger测序时代的孑余,只是恰好适配Illumina的测序化学,所以大行于世,但实际上并不适合3’端开放的测序技术,存在诸多缺陷。例如,许多测序技术在测到较长的同源多聚物时,易发生插入或缺失错误,如AAAA测成AAAAA或AAA。这种错误是发生在同源多聚物上的,难以准确评估该同源多聚物上每一个碱基的质量值。某些情况下,一些碱基并不容易发生替换错误,因此在鉴定单碱基替换突变时应予以保留,但这些碱基往往因为容易发生插入或缺失错误而具有较低的质量值,反而容易在鉴定突变时被丢弃,造成假阴性。因此需要一种更适合3’端开放的测序序列的质量评估方法。For 454, Ion Torrent, and fluorescence sequencing, which are 3'-end open sequencing technologies, the reaction is carried out polymer by polymer, and each polymer is a whole in terms of sequencing chemistry. The existing method of assigning a quality value to each base is a remnant of the Sanger sequencing era, which is just suitable for Illumina's sequencing chemistry, so it is popular in the world, but it is actually not suitable for the 3' end open Sequencing technology has many shortcomings. For example, many sequencing technologies are prone to insertion or deletion errors when detecting longer homologous polymers, such as AAAAA is detected as AAAAA or AAA. This error occurs on homopolymers, and it is difficult to accurately evaluate the quality value of each base on the homopolymers. In some cases, some bases are not prone to substitution errors, so they should be retained when identifying single-base substitution mutations, but these bases often have low quality values because they are prone to insertion or deletion errors. Discarded when identifying mutations, resulting in false negatives. Therefore, there is a need for a quality assessment method that is more suitable for 3' open sequencing sequences.

发明内容Contents of the invention

本发明公开了一种核酸测序数据的质量评估方法,以核酸序列中的多聚物为基本单位进行质量评估,而不是现有方法中以碱基为基本单位进行质量评估,更适用于3’端开放的测序方法得到的序列。The invention discloses a method for evaluating the quality of nucleic acid sequencing data, which uses the polymer in the nucleic acid sequence as the basic unit for quality evaluation instead of the base as the basic unit in the existing method, and is more suitable for 3' Sequences obtained by open-ended sequencing.

具体的,本发明提供了一种核酸测序数据的质量评估方法,其特征在于,包括:Specifically, the present invention provides a method for evaluating the quality of nucleic acid sequencing data, characterized in that it includes:

提供待测核酸序列,以多聚物为基本单元,计算多聚物的测序信号特征;Provide the nucleic acid sequence to be tested, and use the polymer as the basic unit to calculate the sequencing signal characteristics of the polymer;

利用训练校准的量化方案,并基于所述测序信号特征,预测所述多聚物的质量得分;predicting a quality score for the polymer based on the sequencing signal characteristics using a training-calibrated quantization scheme;

所述训练校准的量化方案包括:The quantization scheme of the training calibration includes:

对于提供的标准核酸序列,以多聚物为基本单元,计算多聚物的测序信号特征,根据标准核酸序列的比对结果将多聚物标记为测序正确或错误;训练分类器,拟合多聚物的测序信号特征与其标记之间的关系。For the standard nucleic acid sequence provided, the polymer is used as the basic unit to calculate the sequencing signal characteristics of the polymer, and the polymer is marked as sequenced correctly or incorrectly according to the comparison result of the standard nucleic acid sequence; the classifier is trained to fit multiple The relationship between the sequencing signature of a polymer and its label.

根据优选的实施方式,多聚物包括同源多聚物、二元共聚物、三元共聚物等。According to a preferred embodiment, the polymer includes homopolymers, binary copolymers, terpolymers, and the like.

根据优选的实施方式,多聚物的测序信号特征,指的是测序过程中该多聚物发生测序化学反应时产生的信号的特征,包括但不限于,组成该多聚物的碱基种类,该多聚物的长度,测序化学反应的轮数,信号强度,信号强度(及其邻近信号强度)接近整数的程度,测序信号的参数(单位信号、背景信号、超前系数、滞后系数、衰减系数),测到该多聚物时的失相程度,等。According to a preferred embodiment, the sequencing signal characteristics of a polymer refer to the characteristics of the signal generated when the polymer undergoes a sequencing chemical reaction during the sequencing process, including but not limited to, the type of bases that make up the polymer, The length of the polymer, the number of rounds of the sequencing chemical reaction, the signal strength, the degree to which the signal strength (and its adjacent signal strength) is close to an integer, the parameters of the sequencing signal (unit signal, background signal, lead coefficient, lag coefficient, decay coefficient ), the degree of dephasing when the polymer is measured, etc.

根据优选的实施方式,拟合多聚物的测序信号特征与其标记之间的关系,包括将分类器的拟合结果转化为质量得分。According to a preferred embodiment, fitting the relationship between the sequencing signal features of the polymer and its markers includes converting the fitting result of the classifier into a quality score.

根据优选的实施方式,标准核酸样品指的是来源和序列均已确定、在基因组的几乎所有位点上均高度纯合的核酸样品,包括λ噬菌体DNA、大肠杆菌DNA、酿酒酵母DNA等。According to a preferred embodiment, the standard nucleic acid sample refers to a nucleic acid sample whose source and sequence have been determined and is highly homozygous at almost all sites in the genome, including bacteriophage lambda DNA, Escherichia coli DNA, Saccharomyces cerevisiae DNA, and the like.

根据优选的实施方式,质量得分指的是表征多聚物测序准确率的一个数值,选自准确率、错误率、Phred值等。According to a preferred embodiment, the quality score refers to a numerical value characterizing the accuracy of polymer sequencing, which is selected from accuracy rate, error rate, Phred value and the like.

根据优选的实施方式,质量得分对数地基于多聚物检出误差概率,并且其中所述质量得分包括Q10、Q15、Q20、Q25、Q30、Q35、Q40、Q45、Q50、Q55、Q60等。According to a preferred embodiment, the quality score is logarithmically based on the polymer call error probability, and wherein said quality score comprises Q10, Q15, Q20, Q25, Q30, Q35, Q40, Q45, Q50, Q55, Q60, etc.

根据优选的实施方式,分类器包括但不限于,线性回归,多项式回归,逻辑回归,支持向量机,人工神经网络,随机森林,Phred算法,集成学习等。According to a preferred embodiment, the classifier includes, but is not limited to, linear regression, polynomial regression, logistic regression, support vector machine, artificial neural network, random forest, Phred algorithm, ensemble learning and the like.

根据优选的实施方式,训练分类器,包括根据多聚物的测序信号特征,将多聚物分成若干类,统计每一类多聚物的测序准确率。According to a preferred embodiment, the training of the classifier includes classifying the polymers into several classes according to the characteristics of the sequencing signals of the polymers, and counting the sequencing accuracy of each class of polymers.

根据优选的实施方式,训练分类器,是基于最大似然的概率分布模型;概率分布指的是具有单峰形状特征的概率分布,包括但不限于两点分布、二项分布、负二项分布、泊松分布、几何分布、指数分布、正态分布、Γ分布、卡方分布、t分布、F分布、β分布、对数正态分布,以及上述分布的高维扩展等。According to a preferred embodiment, the training classifier is a probability distribution model based on maximum likelihood; probability distribution refers to a probability distribution with a unimodal shape feature, including but not limited to two-point distribution, binomial distribution, negative binomial distribution , Poisson distribution, geometric distribution, exponential distribution, normal distribution, Γ distribution, chi-square distribution, t distribution, F distribution, β distribution, lognormal distribution, and high-dimensional extensions of the above distributions, etc.

根据优选的实施方式,方法进一步包括根据多聚物的质量得分,对待测核酸序列进行生物信息学分析。According to a preferred embodiment, the method further includes performing bioinformatics analysis on the nucleic acid sequence to be tested according to the quality score of the polymer.

根据优选的实施方式,生物信息学分析包括,根据所赋予的质量值,筛选高质量的核酸序列。筛选方法包括但不限于,筛选全部质量值均高于或低于某一阈值的核酸序列,筛选全部质量值的均值均高于或低于某一阈值的核酸序列,筛选核酸序列中质量值均高于或低于某一阈值的区域,筛选核酸序列中质量值的均值均高于或低于某一阈值的区域,等。According to a preferred embodiment, the bioinformatics analysis comprises, according to the assigned quality value, screening for high-quality nucleic acid sequences. Screening methods include, but are not limited to, screening nucleic acid sequences whose quality values are all higher or lower than a certain threshold, screening nucleic acid sequences whose average value of all quality values is higher than or lower than a certain threshold, and screening nucleic acid sequences whose quality values are all higher than or lower than a certain threshold. Regions above or below a certain threshold value, regions in which the average value of quality values in the nucleic acid sequence are both above or below a certain threshold value, and so on.

根据优选的实施方式,生物信息学分析包括,根据所赋予的质量值,将核酸序列比对到参考序列上。According to a preferred embodiment, the bioinformatics analysis comprises, according to the assigned quality value, the alignment of the nucleic acid sequence to a reference sequence.

根据优选的实施方式,生物信息学分析包括,根据比对结果及被比对序列所赋予的质量值,鉴定基因变异。According to a preferred embodiment, the bioinformatics analysis includes identifying genetic variations according to the alignment results and the quality values assigned to the aligned sequences.

根据优选的实施方式,鉴定基因变异时,可以利用比对结果的某些特征,来去除潜在的假阳性或假阴性结果。According to a preferred embodiment, when identifying a gene variation, some characteristics of the comparison result can be used to remove potential false positive or false negative results.

根据优选的实施方式,生物信息学分析包括,根据所赋予的质量值,将核酸序列组装为较长的核酸序列。According to a preferred embodiment, the bioinformatics analysis includes assembling the nucleic acid sequences into longer nucleic acid sequences according to the assigned quality values.

根据优选的实施方式,生物信息学分析包括,进行至少两回正交的简并测序,获得简并多聚物长度的质量值后,利用该质量值进行纠错校正。According to a preferred embodiment, the bioinformatics analysis includes performing at least two rounds of orthogonal degenerate sequencing, and after obtaining the mass value of the length of the degenerate polymer, using the mass value to perform error correction.

本发明还提供一种核酸测序数据的质量评估方法,其特征在于,包括:对待测核酸样品进行模糊测序或缺失测序得到输入数据,产生所述输入数据的简并多聚物长度信息,并计算所述简并多聚物长度的测序信号特征;The present invention also provides a method for evaluating the quality of nucleic acid sequencing data, which is characterized in that it includes: performing fuzzy sequencing or missing sequencing on the nucleic acid sample to be tested to obtain input data, generating degenerate polymer length information of the input data, and calculating The sequencing signal characteristics of the length of the degenerate polymer;

利用针对训练校准的量化方案,并基于所述测序信号特征,预测所述多聚物的质量得分;predicting a quality score for the polymer based on the sequencing signal characteristics using a quantization scheme calibrated for training;

所述训练校准的量化方案包括:The quantization scheme of the training calibration includes:

对标准核酸样品测序得到核酸序列,计算所述核酸序列的测序信号特征,将核酸序列比对到参考序列上,并将核酸序列的多聚物标记为测序正确或错误;训练分类器,拟合多聚物的测序信号特征与其标记之间的关系。Sequencing a standard nucleic acid sample to obtain a nucleic acid sequence, calculating the sequencing signal characteristics of the nucleic acid sequence, comparing the nucleic acid sequence to a reference sequence, and marking the polymer of the nucleic acid sequence as being sequenced correctly or incorrectly; training a classifier, fitting Relationship between sequencing signature features of polymers and their markers.

本发明的有益之处Benefits of the invention

本发明的方法相比于现有技术,具有如下优势:Compared with the prior art, the method of the present invention has the following advantages:

本方法以多聚物作为测序质量评估的基本单元,对组成核酸序列的聚合物赋予一个质量值,而不是对碱基赋予一个质量值,特别适合于3’端开放的测序反应,有助于获得准确度更高的生物信息学分析结果,包括对变异的鉴定等。This method uses polymers as the basic unit of sequencing quality assessment, and assigns a quality value to the polymers that make up the nucleic acid sequence instead of assigning a quality value to the bases. It is especially suitable for sequencing reactions that are open at the 3' end and helps Obtain more accurate bioinformatics analysis results, including identification of variants and more.

附图说明Description of drawings

本发明的新颖特征在所附权利要求书中具体阐述。将参考以下详细描述和附图来获得对本发明特征和优势的更好理解,以下详细描述利用本发明原理的说明性实施例,在附图中:The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description, which illustrates illustrative embodiments utilizing the principles of the invention, and the accompanying drawings, in which:

图1.说明了测序信号特征的实例。Figure 1. Illustrates an example of a sequencing signal signature.

具体实施方式Detailed ways

为了进一步说明本发明的核心内容,现将本发明用下面的例子作为说明。实施例是为了进一步解释发明内容部分,并不对于本发明造成限制。In order to further illustrate the core content of the present invention, the present invention is now illustrated with the following examples. The examples are for further explaining the summary of the invention, and do not limit the present invention.

除非另有定义,否则本文使用的所有科技术语具有与本领域普通技术人员通常理解的含义相同的含义。为了更好地公开本发明的方法和内容,在此对于本发明中较为关键的术语做详细的解释说明。Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. In order to better disclose the method and content of the present invention, the key terms in the present invention are explained in detail here.

术语解释Terminology Explanation

每个each

术语“每个”旨在识别集合中的单个项目,但不一定是指集合中的每个项目。如果明确公开或上下文另有明确规定,则可能会出现例外情况。The term "each" is intended to identify a single item in the collection, but does not necessarily refer to every item in the collection. Exceptions may apply where expressly disclosed or where the context clearly dictates otherwise.

包括include

术语“包括”在本文中旨在为开放式的,不仅包括所列举的元素,而且还涵盖任何附加的元素。The term "comprising" is intended herein to be open-ended, encompassing not only the listed elements, but also any additional elements.

样品sample

本发明的术语“样品”是指包含核酸或核酸混合物的样本,通常来源于生物流体、细胞、组织、器官或生物体,该核酸或核酸混合物包含待测序和/或定相的至少一种核酸序列。此类样本包括但不限于血液、血液级分、痰/口腔液、羊水、细针活检样本(例如,外科活检、细针活检等)、尿液、腹膜液、胸膜液、组织外植体、器官培养物和任何其他组织或细胞制剂,或其级分或衍生物,或从其分离的级分或衍生物。虽然样本通常取自人类受试者(例如,患者),但样本可取自具有染色体的任何生物体,包括但不限于细菌、病毒、真菌、鸟类、哺乳动物等。样本可按从生物来源获得的原样直接使用,或者经过预处理以改变样本的性质后使用。例如,此类预处理可包括由血液制备血浆、稀释粘性流体等。预处理的方法还可涉及但不限于过滤、沉淀、稀释、蒸馏、混合、离心、冷冻、冻干、浓缩、扩增、核酸片段化、干扰组分的灭活、添加试剂、裂解等。The term "sample" in the present invention refers to a sample, usually derived from a biological fluid, cell, tissue, organ or organism, comprising a nucleic acid or a nucleic acid mixture comprising at least one nucleic acid to be sequenced and/or phased sequence. Such samples include, but are not limited to, blood, blood fractions, sputum/oral fluid, amniotic fluid, fine needle biopsy samples (eg, surgical biopsy, fine needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, tissue explants, Organ cultures and any other tissue or cell preparations, or fractions or derivatives thereof, or fractions or derivatives isolated therefrom. While samples are typically taken from human subjects (eg, patients), samples can be taken from any organism that has a chromosome, including but not limited to bacteria, viruses, fungi, birds, mammals, and the like. Samples may be used as obtained from the biological source or after pretreatment to alter the properties of the sample. For example, such pretreatment may include preparation of plasma from blood, dilution of viscous fluids, and the like. Methods of pretreatment may also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, addition of reagents, lysis, and the like.

标准核酸样品standard nucleic acid sample

指的是来源和序列均已确定、在基因组的几乎所有位点上均高度纯合的核酸样品。利用标准核酸样品测序的优势在于,可以较为准确地分辨出测序结果中与参考基因组不同的序列到底是变异还是测序错误。例如的,标准核酸样品可以是大肠杆菌、酿酒酵母等的核酸,或New England Biolabs公司生产的λ噬菌体DNA等。Refers to nucleic acid samples whose source and sequence have been determined and are highly homozygous at almost all loci in the genome. The advantage of using standard nucleic acid sample sequencing is that it can be more accurately determined whether the sequence in the sequencing result is different from the reference genome is a variation or a sequencing error. For example, the standard nucleic acid sample can be nucleic acid of Escherichia coli, Saccharomyces cerevisiae, etc., or lambda phage DNA produced by New England Biolabs, etc.

纠错码(ErrorError Correction Code (Error Correctingcorrecting Code,ECC)测序以及纠错校正Code, ECC) sequencing and error correction

本申请中,纠错码测序具备如下特征:In this application, error correction code sequencing has the following characteristics:

该测序方法需要多回测序,每回测序得到的信息不完整,而多回测序得到的总的信息是冗余的;利用多回测序的信息冗余来检测和校正潜在的测序错误,得到高准确度的序列。例如的,以2+2测序为例,将测序试剂按对偶碱基分为两两匹配的三组(例如的,分别为MK、RY、WS三组),并对待测DNA序列进行三回独立测序,继而产生三条简并序列编码,这三条编码可互为校验,后续不但能够通过解码推导出真实碱基序列信息,而且具备对单回测序错误位点的校正能力。此校正过程即为纠错校正。This sequencing method requires multiple rounds of sequencing, and the information obtained by each round of sequencing is incomplete, while the total information obtained by multiple rounds of sequencing is redundant; the information redundancy of multiple rounds of sequencing is used to detect and correct potential sequencing errors, and high sequence of accuracy. For example, taking 2+2 sequencing as an example, the sequencing reagents are divided into three groups with pairwise matching (for example, three groups of MK, RY, and WS) according to the paired bases, and the DNA sequence to be tested is subjected to three independent Sequencing, and then generate three degenerate sequence codes, these three codes can be mutually verified, and then not only can deduce the real base sequence information through decoding, but also have the ability to correct single-pass sequencing error sites. This correction process is error correction correction.

序列sequence

本发明中,“序列”是核酸序列,包括或代表彼此偶联的核苷酸链,可以是确定的核苷酸序列,也可以是简并碱基序列,可基于DNA或RNA。应当理解,一个序列可包括多个子序列。例如,单个序列(例如,PCR扩增子的序列)可具有350个核苷酸。样本读段可包括这350个核苷酸内的多个子序列。序列不是以单碱基为基本单元划分,而是以多聚物为基本单元进行划分,多聚物的长度可以是1bp或更长。In the present invention, "sequence" is a nucleic acid sequence, including or representing nucleotide chains coupled to each other, which may be a definite nucleotide sequence or a degenerate base sequence, and may be based on DNA or RNA. It should be understood that a sequence may include multiple subsequences. For example, a single sequence (eg, the sequence of a PCR amplicon) can have 350 nucleotides. A sample read may include multiple subsequences within these 350 nucleotides. The sequence is not divided by a single base as the basic unit, but by a polymer as the basic unit, and the length of the polymer can be 1bp or longer.

简并碱基degenerate base

按照IUPAC符号命名规则(Nucleic acid notation),使用下面表1的字母表示简并碱基,例如字母M表示A和/或C;简并多聚物MMKKK,其长度为5,即DPL为5。According to the IUPAC symbol nomenclature (Nucleic acid notation), the letters in Table 1 below are used to represent degenerate bases, for example, the letter M represents A and/or C; the degenerate polymer MMKKK has a length of 5, that is, DPL is 5.

表1Table 1

字母letter 所代表的碱基base represented by Mm A/CA/C KK G/TG/T RR A/GA/G YY C/TC/T WW A/TA/T SS C/GC/G BB C/G/TC/G/T DD. A/G/TA/G/T Hh A/C/TA/C/T VV A/C/GA/C/G

参考序列reference sequence

参考序列是指可用于参考来自受试者的已鉴定序列的任何生物体的任何特定已知基因组序列,无论是部分的还是完整的。例如,可在ncbi.nlm.nih.gov的美国国家生物技术信息中心NCBI找到用于人类受试者以及许多其他生物体的参考基因组。“基因组”是指以核酸序列表达的生物体或病毒的完整遗传信息。基因组既包括基因又包括DNA的非编码序列。参考序列可大于与其比对的读段。例如,参考序列可为比对读段的至少约100倍大、或至少约1000倍大、或至少约10,000倍大、或至少约105倍大、或至少约106倍大、或至少约107倍大。在一个示例中,参考基因组序列是全长人类基因组的序列。在另一个示例中,参考基因组序列限于特定的人类染色体,诸如13号染色体。在一些具体实施中,参考染色体是来自人类基因组版本hg19的染色体序列。此类序列可称为染色体参考序列,但术语参考基因组旨在涵盖此类序列。参考序列的其他示例包括其他物种的基因组,以及任何物种的染色体、亚染色体区域(诸如链)等。在各种具体实施中,参考基因组是来源于多个个体的共有序列或其他组合。然而,在某些应用中,参考序列可取自特定个体。在其他具体实施方式中,“基因组”还涵盖所谓的“图形基因组”,其使用基因组序列的特定存储格式和表示。在一个具体实施中,图形基因组将数据存储在线性文件中。Reference sequence refers to any specific known genomic sequence, whether partial or complete, of any organism that can be used to reference an identified sequence from a subject. For example, reference genomes for human subjects, as well as many other organisms, can be found at NCBI, the National Center for Biotechnology Information at ncbi.nlm.nih.gov. "Genome" refers to the complete genetic information of an organism or virus expressed in nucleic acid sequences. The genome includes both genes and the non-coding sequence of DNA. A reference sequence can be larger than the reads it is aligned to. For example, the reference sequence can be at least about 100 times larger, or at least about 1000 times larger, or at least about 10,000 times larger, or at least about 105 times larger, or at least about 106 times larger, or at least about 10 7 times larger. In one example, the reference genome sequence is the sequence of the full length human genome. In another example, the reference genome sequence is restricted to a specific human chromosome, such as chromosome 13. In some implementations, the reference chromosome is a chromosomal sequence from the human genome version hg19. Such sequences may be referred to as chromosomal reference sequences, but the term reference genome is intended to encompass such sequences. Other examples of reference sequences include genomes of other species, and chromosomes, subchromosomal regions (such as strands), etc. of any species. In various implementations, a reference genome is a consensus sequence or other combination derived from multiple individuals. However, in some applications, a reference sequence may be taken from a particular individual. In other embodiments, "genome" also encompasses so-called "graph genomes", which use specific storage formats and representations of genome sequences. In one implementation, Graph Genome stores data in linear files.

比对(align或alignment)Comparison (align or alignment)

比对是生物信息学中的常见概念,在生物信息学中,比对经常用于比较不同核酸之间或者不同蛋白质之间的相似性。本发明中的比对指的是将核酸序列和参考序列进行比较,从而确定参考序列是否包含核酸序列的过程。常用的序列比对算法及软件包括但不限于,例如的,Smith-Waterman算法、Bowtie、BWA、SOAP、Needleman-Wunch算法、Bowtie2、BLAST、ELAND、TMAP、MAQ、minimap2、SHRiMP等。Alignment is a common concept in bioinformatics. In bioinformatics, alignment is often used to compare the similarity between different nucleic acids or between different proteins. Alignment in the present invention refers to the process of comparing a nucleic acid sequence with a reference sequence to determine whether the reference sequence contains a nucleic acid sequence. Commonly used sequence alignment algorithms and software include, but are not limited to, for example, Smith-Waterman algorithm, Bowtie, BWA, SOAP, Needleman-Wunch algorithm, Bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRiMP, etc.

质量得分quality score

或称质量值,指的是表征测序准确率的一个数值。质量值可以不同的数学方式表达,如准确率、错误率、Phred值等。例如,准确率99%、99.9%、99.99%对应的错误率分别为1%、0.1%、0.01%,对应的Phred值分别为20、30、40。在某些实现中,为了便于记录和存储,会将Phred值加上33后转为ASCII码,例如Phred值20、30、40会分别转为字符’5’、’?’、’I’。质量值表达形式的不同不影响本发明的实质。Or quality value, which refers to a numerical value that characterizes the accuracy of sequencing. Quality values can be expressed in different mathematical ways, such as accuracy rate, error rate, Phred value, etc. For example, the accuracy rates of 99%, 99.9%, and 99.99% correspond to error rates of 1%, 0.1%, and 0.01%, respectively, and the corresponding Phred values are 20, 30, and 40, respectively. In some implementations, in order to facilitate recording and storage, the Phred value will be converted into ASCII code after adding 33, for example, Phred value 20, 30, 40 will be converted into the characters '5', '? ’, ‘I’. The difference in the expression form of the quality value does not affect the essence of the present invention.

模糊测序fuzzy sequencing

本发明中,模糊测序指的是单回(或单个round)的2+2测序、或者单个round的1+3测序、或者单个round的3x4测序。具体的,2+2测序指的是,在测序中,包括两种不同的测序试剂:第一测序试剂和第二测序试剂;两种测序试剂循环加入;其中第一测序试剂包含具有可检测标记的两种不同的核苷酸单体;第二测序试剂包含具有可检测标记的两种不同的核苷酸单体,且所述核苷酸单体不同于第一测序试剂中存在的核苷酸单体,并且其中第二测序试剂是在提供了第一测序试剂随后提供的,将核苷酸单体掺入待测核酸之后检测可检测标记生成的信号,例如的,荧光信号;在2+2测序中,可利用荧光标记核苷酸的组合来获得与目标DNA序列相关的荧光信号值。可能的组合实例如下所示:M/K模式:凡奇数轮呈递dA4P和dC4P,凡偶数轮呈递dG4P和dT4P;或者二者反过来;R/Y模式:凡奇数轮呈递dA4P和dG4P,凡偶数轮呈递dC4P和dT4P;或者二者反过来;以及W/S模式:凡奇数轮呈递dA4P和dT4P,凡偶数轮呈递dC4P和dG4P;或者二者反过来。利用上述一个模式(例如的,M/K模式)对待测核酸进行多个cycle的测序可以称为一回(或单个round)。1+3测序与2+2测序类似,指的是,在测序中,包括两种不同的测序试剂:第一测序试剂和第二测序试剂;两种测序试剂循环加入;其中第一测序试剂包含具有可检测标记的三种不同的核苷酸单体;第二测序试剂包含具有可检测标记的一种核苷酸单体,且所述核苷酸单体不同于第一测序试剂中存在的核苷酸单体,并且其中第二测序试剂是在提供了第一测序试剂随后提供的,将核苷酸单体掺入待测核酸之后检测可检测标记生成的信号,例如的,荧光信号;3x4测序指的是,在测序中,包括四种不同的测序试剂:第一测序试剂、第二测序试剂、第三测序试剂和第四测序试剂;四种测序试剂循环加入;其中每种测序试剂均包含具有可检测标记的三种不同的核苷酸单体;并且其中第二测序试剂是在提供了第一测序试剂随后提供的,将核苷酸单体掺入待测核酸之后检测可检测标记生成的信号;第三测序试剂是在提供了第二测序试剂随后提供的,将核苷酸单体掺入待测核酸之后检测可检测标记生成的信号;第四测序试剂是在提供了第三测序试剂随后提供的,将核苷酸单体掺入待测核酸之后检测可检测标记生成的信号;例如的,上述四种测序试剂分别为B、D、H、V。In the present invention, fuzzy sequencing refers to single-round (or single-round) 2+2 sequencing, or single-round 1+3 sequencing, or single-round 3×4 sequencing. Specifically, 2+2 sequencing refers to, in the sequencing, including two different sequencing reagents: the first sequencing reagent and the second sequencing reagent; the two sequencing reagents are added in a cycle; wherein the first sequencing reagent contains two different nucleomonomers; the second sequencing reagent comprises two different nucleomonomers with a detectable label, and said nucleomonomers are different from the nucleosides present in the first sequencing reagent acid monomer, and wherein the second sequencing reagent is provided after the first sequencing reagent is provided, after the nucleotide monomer is incorporated into the nucleic acid to be detected, the signal generated by the detectable label is detected, for example, a fluorescent signal; at 2 In +2 sequencing, a combination of fluorescently labeled nucleotides can be used to obtain the fluorescent signal value associated with the target DNA sequence. Examples of possible combinations are as follows: M/K mode: where odd-numbered rounds present dA4P and dC4P, where even-numbered rounds present dG4P and dT4P, or vice versa; R/Y mode: where odd-numbered rounds present dA4P and dG4P, where even-numbered rounds present dA4P and dG4P and the W/S pattern: where odd-numbered rounds present dA4P and dT4P, where even-numbered rounds present dC4P and dG4P; or vice versa. Using one of the above modes (for example, M/K mode) to perform multiple cycles of sequencing on the nucleic acid to be tested may be referred to as one round (or a single round). 1+3 sequencing is similar to 2+2 sequencing, which means that in sequencing, two different sequencing reagents are included: the first sequencing reagent and the second sequencing reagent; the two sequencing reagents are added in a cycle; the first sequencing reagent contains Three different nucleotide monomers with a detectable label; the second sequencing reagent contains one nucleotide monomer with a detectable label, and the nucleotide monomer is different from that present in the first sequencing reagent Nucleotide monomers, and wherein the second sequencing reagent is provided after the first sequencing reagent is provided, after the nucleotide monomers are incorporated into the nucleic acid to be detected, a signal generated by a detectable label is detected, for example, a fluorescent signal; 3x4 sequencing means that in sequencing, four different sequencing reagents are included: the first sequencing reagent, the second sequencing reagent, the third sequencing reagent and the fourth sequencing reagent; the four sequencing reagents are added in cycles; each of the sequencing reagents Each comprises three different nucleomonomers with detectable labels; and wherein the second sequencing reagent is provided subsequently to the first sequencing reagent, and detectable after the nucleomonomers are incorporated into the nucleic acid to be tested The signal generated by the label; the third sequencing reagent is provided after the second sequencing reagent is provided, and the signal generated by the detectable label is detected after the nucleotide monomer is incorporated into the nucleic acid to be tested; the fourth sequencing reagent is provided after the second sequencing reagent is provided The three sequencing reagents are then provided to detect the signal generated by the detectable label after the nucleotide monomer is incorporated into the nucleic acid to be tested; for example, the above four sequencing reagents are B, D, H, and V respectively.

根据信号得到模糊序列信息,模糊序列信息指的是不能由该序列信息得出核苷酸序列确定的碱基序列信息,此处确定的碱基序列信息指的是以A、G、T、C为编码的核酸序列信息,或者以A、G、U、C为编码的核酸序列信息;其中碱基可以是甲基化的碱基。模糊碱基序列是科研领域的常见概念,比如用字母W代表碱基A和/或T。WIKIPEDIA上也有相关的定义(https://en.wikipedia.org/wiki/Nucleotide)。The fuzzy sequence information is obtained according to the signal, and the fuzzy sequence information refers to the base sequence information that cannot be obtained from the sequence information to determine the nucleotide sequence. It is the encoded nucleic acid sequence information, or the nucleic acid sequence information encoded by A, G, U, C; wherein the base may be a methylated base. Fuzzy base sequence is a common concept in the field of scientific research, such as using the letter W to represent base A and/or T. There are also related definitions on WIKIPEDIA (https://en.wikipedia.org/wiki/Nucleotide).

缺失测序与缺失序列Deletion Sequencing vs. Deletion Sequencing

在基因测序芯片上对待测核酸分子进行多个测序化学反应循环;其中,预先选择至少一个循环,仅在前述循环中进行信号采集;在其他循环中仅进行测序化学反应,不进行信号采集;将采集到的信号编码为序列,得到缺失核酸序列,此过程即为缺失测序。Perform multiple sequencing chemical reaction cycles on the nucleic acid molecule to be tested on the gene sequencing chip; wherein, at least one cycle is pre-selected, and signal acquisition is only performed in the aforementioned cycle; only sequencing chemical reactions are performed in other cycles, and no signal acquisition is performed; The collected signal is coded as a sequence to obtain the missing nucleic acid sequence. This process is called deletion sequencing.

缺失序列,即缺失信息的序列,此处缺失信息是指,所得序列中部分序列信息是缺失的,例如:序列ATTCGNNTTT,这个序列即为缺失序列,N表示序列信息未知,即此处序列信息是缺失的,则此序列为缺失序列。Missing sequence, that is, a sequence with missing information. The missing information here means that part of the sequence information in the obtained sequence is missing, for example: sequence ATTCGNNTTT, this sequence is the missing sequence, and N means that the sequence information is unknown, that is, the sequence information here is missing, the sequence is a missing sequence.

多碱基测序polybase sequencing

多碱基测序,即3’端开放的测序反应,在此种测序反应中,作为底物的核苷酸分子的3’端是羟基,可发生自由延伸,理论上一次测序反应可以延伸1个或多个核苷酸分子。常见的多核苷酸测序包括Ion Torrent的半导体测序、454的焦磷酸测序、赛纳生物的模糊测序等。Multi-base sequencing, that is, a sequencing reaction with an open 3' end. In this sequencing reaction, the 3' end of the nucleotide molecule as a substrate is a hydroxyl group, which can be freely extended. In theory, one sequencing reaction can extend one or multiple nucleotide molecules. Common polynucleotide sequencing includes Ion Torrent's semiconductor sequencing, 454's pyrosequencing, and Senna's fuzzy sequencing, etc.

单位信号unit signal

单位信号是DNA每延伸一个碱基时测序仪所检测信号的上升值,和发生延伸反应的DNA分子数目、相机曝光时间、激发光强度、相机感光能力等有关。单位信号指的是每个测序位点上延伸一个碱基的信号强度,是一个正比于该测序点模板DNA分子数的物理量,测序信号的强度即等于单位信号乘延伸反应的碱基个数。无论对任何测序技术来说,单位1的测量精度都直接影响到测序结果的准确度。The unit signal is the rising value of the signal detected by the sequencer when the DNA is extended by one base, and it is related to the number of DNA molecules that undergo the extension reaction, the exposure time of the camera, the intensity of the excitation light, and the light sensitivity of the camera. The unit signal refers to the signal intensity of one base extension at each sequencing site, which is a physical quantity proportional to the number of template DNA molecules at the sequencing site, and the intensity of the sequencing signal is equal to the unit signal multiplied by the number of bases in the extension reaction. Regardless of any sequencing technology, the measurement accuracy of unit 1 directly affects the accuracy of sequencing results.

背景信号background signal

背景信号指的是当无碱基延伸时测序仪所检测到的基准信号,和芯片材质、测序反应底物的自发水解等因素有关。并且背景信号也可以随着测序读长的延长而发生变化。背景信号属于一般性的定义。The background signal refers to the reference signal detected by the sequencer when a base is extended, and is related to factors such as the chip material and the spontaneous hydrolysis of the sequencing reaction substrate. And the background signal can also change with the extension of the sequencing read length. Background signal belongs to the general definition.

简并多聚物的长度(DPL)Degenerate Polymer Length (DPL)

简并测序是一种多碱基测序,区别于单碱基测序每轮反应只延伸一个核苷酸分子,多碱基测序每轮反应延伸的核苷酸可能是多个,测序反应释放的荧光信号强度与释放的荧光基团数目成正相关,在没有衰减和失相的理想条件下,每轮反应的荧光信号反映了该轮延伸的碱基数,被称为简并多聚物的长度(degenerate polymer length,DPL)。Degenerate sequencing is a kind of multi-base sequencing. Different from single-base sequencing, each round of reaction only extends one nucleotide molecule. Multi-base sequencing may extend multiple nucleotides in each round of reaction. The fluorescence released by the sequencing reaction The signal intensity is positively correlated with the number of released fluorophores. Under ideal conditions without attenuation and dephasing, the fluorescent signal of each round of reaction reflects the number of bases extended in this round, which is called the length of the degenerate polymer ( degenerate polymer length, DPL).

轮(cycle)wheel (cycle)

一轮测序反应,即一个测序cycle,或称一个测序反应轮,一轮测序反应指的是,提供一种测序试剂,将该种测序试剂中具有可检测标记的核苷酸掺入待测核酸之后检测可检测标记生成的信号的过程。A round of sequencing reaction, that is, a sequencing cycle, or a sequencing reaction round, a round of sequencing reaction refers to the provision of a sequencing reagent, which incorporates nucleotides with detectable labels in the sequencing reagent into the nucleic acid to be tested The process of detecting the signal generated by the detectable marker thereafter.

多聚物polymer

本发明所述的多聚物包括同源多聚物、二元共聚物、三元共聚物,多聚物的长度可以是1bp或者更长。The polymers described in the present invention include homologous polymers, binary copolymers, and terpolymers, and the length of the polymers can be 1 bp or longer.

同源多聚物homopolymer

本发明中的同源多聚物或称同源多聚体,指的是由多个同种核苷酸单体组成的多聚体,例如的,AAAA或TTTTT等均属于同源多聚体。The homologous multimer or homologous multimer in the present invention refers to a polymer composed of multiple homologous nucleotide monomers, for example, AAAA or TTTTT, etc. all belong to homologous multimers .

二元共聚物binary copolymer

二元共聚物是由两种不同单体聚合生成的具有两种不同单体链节的聚合物,本发明中,具体指代的是由两种核苷酸单体组成的聚合物,例如的,当进行MK简并测序时,M即由A,C两种核苷酸单体组成,K为G,T两种核苷酸单体组成,ACAC和GGTT都是二元共聚物。Binary copolymer is a polymer with two different monomer chains formed by the polymerization of two different monomers. In the present invention, it specifically refers to a polymer composed of two nucleotide monomers, such as , when performing MK degenerate sequencing, M is composed of two nucleotide monomers, A and C, K is composed of two nucleotide monomers, G and T, and both ACAC and GGTT are binary copolymers.

三元共聚物Terpolymer

与二元共聚物相类似,三元共聚物是由三种不同单体同时参与组成的聚合物,本发明中,具体指代的是由三种核苷酸单体组成的聚合物,例如的,当进行AB简并测序时,B即可能由C,G,T三种核苷酸单体组成,TCGG和GTGCC都是三元共聚物。Similar to binary copolymers, terpolymers are polymers composed of three different monomers at the same time. In the present invention, it specifically refers to polymers composed of three nucleotide monomers, such as , when performing AB degenerate sequencing, B may be composed of three nucleotide monomers C, G, and T, and both TCGG and GTGCC are terpolymers.

分类器Classifier

分类器分类是数据挖掘的非常重要的方法,在机器学习中,分类器的作用是在标记好类别的训练数据基础上判断一个新的观察样本所属的类别。Classifier Classification is a very important method of data mining. In machine learning, the role of a classifier is to judge the category of a new observation sample on the basis of the labeled training data.

分类器的构造和实施大体需经过以下几个步骤:The construction and implementation of classifier generally need to go through the following steps:

选定样本(包含正样本和负样本),将所有样本分成训练样本和测试样本两部分;Select samples (including positive samples and negative samples), and divide all samples into two parts: training samples and test samples;

在训练样本上执行分类器算法,生成分类模型;Execute the classifier algorithm on the training samples to generate a classification model;

在测试样本上执行分类模型,生成预测结果。Execute the classification model on the test samples to generate predictions.

优选的,根据预测结果,计算必要的评估指标,评估分类模型的性能。Preferably, according to the prediction results, necessary evaluation indicators are calculated to evaluate the performance of the classification model.

基因变异genetic mutation

是指与参考序列不同的核酸序列。典型的变异包括但不限于单核苷酸变异(SNV)、短缺失和插入多态性(Indel)、拷贝数变异(CNV)、表观遗传学变异、微卫星标记或短串联重复序列和结构变异。体细胞变异检出是识别以低频率存在于DNA样本中的变异的工作。体细胞变异检出在癌症治疗的背景下是非常有意义的。癌症是由DNA中突变的积聚引起的。来自肿瘤的DNA样本通常是异质的,包括一些正常细胞、癌症进展早期的一些细胞(具有较少突变)和一些晚期细胞(具有较多突变)。由于这种异质性,当对肿瘤(例如,来自FFPE样本)测序时,体细胞突变将通常以低频率出现。例如,可在覆盖给定碱基的读段的仅10%中看到SNV。待由变异分类器分类为体细胞或种系的变异在本文中也被称为“基因变异”。refers to a nucleic acid sequence that differs from a reference sequence. Typical variations include, but are not limited to, single nucleotide variation (SNV), short deletion and insertion polymorphism (Indel), copy number variation (CNV), epigenetic variation, microsatellite marker or short tandem repeat sequence and structure Mutations. Somatic variant calling is the effort to identify variants that are present at low frequency in a DNA sample. Somatic variant calling is of great interest in the context of cancer therapy. Cancer is caused by the accumulation of mutations in DNA. DNA samples from tumors are often heterogeneous, including some normal cells, some cells early in cancer progression (with fewer mutations), and some late-stage cells (with more mutations). Because of this heterogeneity, somatic mutations will often appear at low frequency when sequencing tumors (eg, from FFPE samples). For example, SNVs may be seen in only 10% of reads covering a given base. A variation to be classified as somatic or germline by a variation classifier is also referred to herein as a "genetic variation".

阈值threshold

在本文中是指用作表征样本、核酸或其部分(例如,读段)的截止值的数字或非数字值。阈值可基于经验分析而改变。可将阈值与测量值或计算值进行比较,以确定是否应以特定方式对产生此类值的源进行分类。阈值可根据经验或分析来识别。阈值的选择取决于用户希望必须进行分类的置信水平。阈值可被选择用于特定目的(例如,以平衡灵敏度和选择性)。如本文所用,术语“阈值”指示可改变分析过程的点和/或可触发动作的点。阈值不需要是预定数量。相反,阈值可以是例如基于多个因素的函数。阈值可根据情况进行调整。此外,阈值可指示上限、下限或限值之间的范围。Refers herein to a numeric or non-numeric value used as a cutoff for characterizing a sample, a nucleic acid, or a portion thereof (eg, a read). Thresholds may vary based on empirical analysis. Threshold values can be compared to measured or calculated values to determine whether sources producing such values should be classified in a particular way. Thresholds can be identified empirically or analytically. The choice of threshold depends on the confidence level with which the user wishes to have to classify. Thresholds can be chosen for a particular purpose (eg, to balance sensitivity and selectivity). As used herein, the term "threshold" indicates a point at which an analysis process may be altered and/or an action may be triggered. The threshold need not be a predetermined amount. Instead, the threshold may be a function based on a number of factors, for example. The threshold can be adjusted according to the situation. Additionally, a threshold may indicate an upper limit, a lower limit, or a range between limits.

上面指出了本发明中所涉及术语的一般性含义。上述术语,均为本领域的常规含义,为了避免引起歧义,再次阐述。上述术语并无特殊含义。The general meanings of the terms involved in the present invention are indicated above. The above-mentioned terms all have conventional meanings in this field, and are explained again in order to avoid ambiguity. The above terms have no special meaning.

发明详述Detailed description of the invention

对核酸测序数据进行质量评估对于后续的生物信息学分析具有非常重要的作用。例如的,在鉴定基因突变时,若某碱基具有高质量值,且该碱基与参考序列上的对应碱基不同,则此处会被判定为基因突变;而当该碱基的质量值较低时,该序列会被认为发生了测序错误、不存在基因突变。现有技术中,以单碱基作为基本单元为每个碱基赋予一个质量值的方法存在诸多缺陷。例如的,许多测序技术在测到较长的同源多聚物时,易发生插入或缺失错误,如将TTTT测成TTT或TTTTT。由于这种错误是发生在同源多聚物上的,因此难以准确评估该同源多聚物上每一个碱基的质量值。又例如,某些情况下一些碱基并不容易发生替换错误,因此在鉴定单碱基替换突变时应予以保留,但这些碱基往往因为容易发生插入或缺失错误而具有较低的质量值,反而容易在鉴定突变时被丢弃,造成假阴性。为了克服现有技术中的缺陷,本发明公开了一种新的测序数据质量评估方法,不是以碱基作为基本单元,而是以多聚物作为评分的基本单元,特别适合于3’端开放的测序反应,有助于获得准确度更高的生物信息学分析结果。Quality assessment of nucleic acid sequencing data plays a very important role in subsequent bioinformatics analysis. For example, when identifying a gene mutation, if a base has a high-quality value, and the base is different from the corresponding base on the reference sequence, it will be judged as a gene mutation; and when the quality value of the base When it is low, the sequence is considered to have a sequencing error and there is no gene mutation. In the prior art, the method of assigning a quality value to each base with a single base as a basic unit has many defects. For example, many sequencing technologies are prone to insertion or deletion errors when detecting longer homologous polymers, such as measuring TTTT as TTT or TTTTT. Since this error occurs on the homopolymer, it is difficult to accurately evaluate the mass value of each base on the homopolymer. For another example, in some cases, some bases are not prone to substitution errors, so they should be retained when identifying single-base substitution mutations, but these bases often have low quality values because they are prone to insertion or deletion errors, Instead, it is easy to be discarded when identifying mutations, resulting in false negatives. In order to overcome the defects in the prior art, the present invention discloses a new method for evaluating the quality of sequencing data, which does not use bases as the basic unit, but polymers as the basic unit of scoring, especially suitable for 3' end opening The sequencing reaction helps to obtain more accurate bioinformatics analysis results.

具体的,本发明提供了一种核酸测序数据的质量评估方法,其特征在于,包括:Specifically, the present invention provides a method for evaluating the quality of nucleic acid sequencing data, characterized in that it includes:

提供待测核酸序列,以多聚物为基本单元,计算多聚物的测序信号特征;Provide the nucleic acid sequence to be tested, and use the polymer as the basic unit to calculate the sequencing signal characteristics of the polymer;

利用训练校准的量化方案,并基于测序信号特征,预测所述多聚物的质量得分;所述训练校准的量化方案包括:Utilize the quantification scheme of training calibration, and based on the sequencing signal feature, predict the quality score of described polymer; The quantization scheme of described training calibration comprises:

对于提供的标准核酸序列,以多聚物为基本单元,计算多聚物的测序信号特征,根据标准核酸序列的比对结果将多聚物标记为测序正确或错误;训练分类器,拟合多聚物的测序信号特征与其标记之间的关系。For the standard nucleic acid sequence provided, the polymer is used as the basic unit to calculate the sequencing signal characteristics of the polymer, and the polymer is marked as sequenced correctly or incorrectly according to the comparison result of the standard nucleic acid sequence; the classifier is trained to fit multiple The relationship between the sequencing signature of a polymer and its label.

本发明中,得到核酸序列的测序方法包括双脱氧核苷酸终止法(Sanger测序法)、化学降解法(Gilbert法)、焦磷酸测序法(pyrosequencing)、半导体测序法(semiconductorsequencing)、循环可逆终止法(cyclic reversible terminator)、荧光发生测序法(fluorogenic sequencing)、纠错码测序法(error-correction code sequencing)、模糊测序法(fuzzy sequencing)、缺失测序(专利CN 2022101040373)、联合探针锚定连接法(combinatorial probe-anchor ligation)、联合探针锚定聚合法(combinatorial probe-anchor polymerization)、寡核苷酸连接检测测序法(sequencing by oligonucleotideligation and detection)、边结合边测序法(sequencing-by-binding)、单分子荧光测序法、单分子实时测序、纳米孔测序法等。In the present invention, the sequencing methods for obtaining nucleic acid sequences include dideoxynucleotide termination method (Sanger sequencing method), chemical degradation method (Gilbert method), pyrosequencing method (pyrosequencing), semiconductor sequencing method (semiconductor sequencing), cycle reversible termination cyclic reversible terminator, fluorogenic sequencing, error-correction code sequencing, fuzzy sequencing, deletion sequencing (patent CN 2022101040373), combined probe anchoring Combinatorial probe-anchor ligation, combinatorial probe-anchor polymerization, sequencing by oligonucleotide ligation and detection, sequencing-by -binding), single-molecule fluorescence sequencing, single-molecule real-time sequencing, nanopore sequencing, etc.

根据优选的实施方式,测序方法优选的选自3’端开放的测序反应,包括但不限于,焦磷酸测序法、半导体测序法、荧光发生测序法、纠错码测序法、模糊测序法、缺失测序法、纳米孔测序法等,在上述测序技术中,一个测序化学反应循环中可以延伸一个或多个核苷酸分子,也就是说,每个循环的反应以多聚物为单位发生,而不是以碱基为单位发生,所以,将多聚物作为质量评估的基本单元,无疑更能准确反映真实测序过程,得到的测序质量评估结果也更加精确。According to a preferred embodiment, the sequencing method is preferably selected from a sequencing reaction with an open 3' end, including but not limited to, pyrosequencing, semiconductor sequencing, fluorescence sequencing, error-correcting code sequencing, fuzzy sequencing, deletion Sequencing method, nanopore sequencing method, etc. In the above-mentioned sequencing technologies, one or more nucleotide molecules can be extended in one sequencing chemical reaction cycle, that is, the reaction of each cycle occurs in units of polymers, and It does not occur in units of bases. Therefore, using polymers as the basic unit of quality assessment can undoubtedly more accurately reflect the real sequencing process, and the obtained sequencing quality assessment results are also more accurate.

本发明中,核酸样品包括脱氧核糖核酸(DNA)、核糖核酸(RNA)、肽核酸(PNA)、木糖核酸(XNA)、锁式核酸(LNA)等。In the present invention, nucleic acid samples include deoxyribonucleic acid (DNA), ribonucleic acid (RNA), peptide nucleic acid (PNA), xylose nucleic acid (XNA), locked nucleic acid (LNA) and the like.

在一些实施方式中,可以将待测核酸样品和标准核酸样品分别标记后同时进行测序,所用测序方法相同,并分别得到对应的核酸序列。在一些实施方式中,可以先对标准核酸样品进行测序,得到对应的核酸序列;之后,再用相同的测序方法对待测核酸样品进行测序,得到与之对应的核酸序列。在一些实施方式中,也可以先对待测核酸样品进行测序,得到对应的核酸序列;之后,再对标准核酸样品进行测序,得到与之对应的核酸序列。待测样品和标准核酸样品的测序顺序可以调换,重要的是二者需要使用相同的测序方法进行测序以及碱基识别。In some embodiments, the nucleic acid sample to be tested and the standard nucleic acid sample can be labeled and then sequenced at the same time, using the same sequencing method, and corresponding nucleic acid sequences can be obtained respectively. In some embodiments, the standard nucleic acid sample can be sequenced first to obtain the corresponding nucleic acid sequence; then, the nucleic acid sample to be tested can be sequenced by the same sequencing method to obtain the corresponding nucleic acid sequence. In some embodiments, the nucleic acid sample to be tested can also be sequenced first to obtain the corresponding nucleic acid sequence; then, the standard nucleic acid sample can be sequenced to obtain the corresponding nucleic acid sequence. The sequencing order of the sample to be tested and the standard nucleic acid sample can be exchanged, and the important thing is that both need to use the same sequencing method for sequencing and base calling.

根据优选的实施方式,输入数据是图像数据,图像数据来源于由测序仪在测序运行期间产生的测序图像。例如的,焦磷酸测序法(pyrosequencing)、荧光发生测序法(fluorogenic sequencing)、纠错码测序法(error-correction code sequencing)、循环可逆终止法(cyclic reversible terminator)、联合探针锚定连接法(combinatorialprobe-anchor ligation)等测序方法产生的输入数据是图像数据。According to a preferred embodiment, the input data is image data derived from sequencing images generated by the sequencer during a sequencing run. For example, pyrosequencing, fluorogenic sequencing, error-correction code sequencing, cyclic reversible terminator, combined probe anchor ligation The input data generated by sequencing methods such as combinatorial probe-anchor ligation is image data.

在一些实施方式中,输入数据是基于在核苷酸底物分子延伸期间由于氢离子的释放引起的pH变化,检测pH变化并将其转换成与掺入的核苷酸的数量成比例的电压变化,例如的,Ion Torrent的半导体测序法生成的输入数据即是如此。In some embodiments, the input data is based on the pH change due to the release of hydrogen ions during the extension of the nucleotide substrate molecule, the pH change is detected and converted to a voltage proportional to the amount of incorporated nucleotide Variation, for example, is the case with the input data generated by Ion Torrent's semiconductor sequencing method.

在一些实施方式中,输入数据是根据纳米孔感测来创建的,该纳米孔感测使用生物传感器来测量当分析物穿过纳米孔或靠近其孔口时电流的中断,同时确定碱基的种类。例如的,牛津纳米孔测序技术(ONT)测序基于以下概念:使单链DNA(或RNA)经由纳米孔穿过膜,并且跨膜施加电压差。孔中存在的核苷酸将影响孔的电阻,因此随时间推移的电流测量结果可指示DNA碱基穿过孔的序列。该电流信号(由于其在绘制时的外观而被称为“波形曲线(squiggle)”)是由ONT测序器收集的原始数据。这些测量结果被存储为在(例如)4kHz频率下获取的16位整数数据采集(DAC)值。在DNA链速度为约450碱基对/秒的情况下,这给出了平均每种碱基大约九个原始观察结果。然后处理该信号以识别对应于各个读数的开孔信号的中断。对原始信号的这些最大限度的利用是进行碱基检出,即将DAC值转换成DNA碱基序列的过程。在一些具体实施中,输入数据包括归一化或缩放的DAC值。In some embodiments, the input data is created from nanopore sensing, which uses a biosensor to measure the interruption of electrical current as an analyte passes through a nanopore or near its orifice, while determining the base's type. For example, Oxford Nanopore Technology (ONT) sequencing is based on the concept of passing single-stranded DNA (or RNA) through a membrane via a nanopore and applying a voltage difference across the membrane. The nucleotides present in the pore will affect the electrical resistance of the pore, so current measurements over time can indicate the sequence of DNA bases passing through the pore. This current signal (called a "squiggle" because of its appearance when plotted) is the raw data collected by the ONT sequencer. These measurements are stored as 16-bit integer data acquisition (DAC) values acquired at a frequency of, for example, 4kHz. At a DNA chain speed of about 450 base pairs/sec, this gave an average of about nine raw observations per base. This signal is then processed to identify breaks in the aperture signal that correspond to individual readings. These maximum utilizations of the raw signal are for base calling, the process of converting DAC values into DNA base sequences. In some implementations, the input data includes normalized or scaled DAC values.

本发明中,多聚物包括同源多聚物、二元共聚物、三元共聚物等。将所测得的核酸序列划分成若干个多聚物,划分时,可以所有的多聚物全为同源多聚物,也可以全为二元共聚物,也可以全为三元共聚物,也可以一部分为同源多聚物、一部分为二元共聚物、另一部分为三元共聚物,每个多聚物的长度可以是1bp或者更长,例如的,5bp、10bp或10bp以上。In the present invention, polymers include homopolymers, binary copolymers, terpolymers, and the like. The measured nucleic acid sequence is divided into several polymers. When dividing, all polymers can be homologous polymers, binary copolymers, or terpolymers. A part may also be a homopolymer, a part may be a binary copolymer, and the other part may be a terpolymer, and the length of each polymer may be 1 bp or longer, for example, 5 bp, 10 bp or more than 10 bp.

多聚物的划分应对应于测序过程中的进样流程,例如的,若测序方法为MK简并测序,得到的核酸序列可以是MMKKKMKMMM,该序列为简并序列,那么划分多聚体时应按照MK测序实际延伸对应的方式将其划分,即:(MM)(KKK)(M)(K)(MMM);若测序方法为AB简并测序,得到的序列可以是ABBBAAABB,那么划分多聚体时应按照AB测序实际延伸对应的方式将其划分,即:(A)(BBB)(AAA)(BB);若测序方法为1x4,得到的序列可以是ACCCTTGGATT,该核酸序列是确定的核苷酸序列,那么划分多聚体时应按照1x4测序实际延伸对应的方式将其划分,以homopolymer为基本单位,即:(A)(CCC)(TT)(GG)(A)(TT)。The division of polymers should correspond to the sampling process in the sequencing process. For example, if the sequencing method is MK degenerate sequencing, the obtained nucleic acid sequence can be MMKKKMKMMM, which is a degenerate sequence, then when dividing polymers, you should Divide it according to the method corresponding to the actual extension of MK sequencing, that is: (MM)(KKK)(M)(K)(MMM); if the sequencing method is AB degenerate sequencing, the obtained sequence can be ABBBAAABB, then divide the multimer The body time should be divided according to the method corresponding to the actual extension of AB sequencing, that is: (A)(BBB)(AAA)(BB); if the sequencing method is 1x4, the obtained sequence can be ACCCTTGGATT, and the nucleic acid sequence is a definite nuclear sequence. nucleotide sequence, when dividing the polymer, it should be divided according to the method corresponding to the actual extension of 1x4 sequencing, with homopolymer as the basic unit, namely: (A)(CCC)(TT)(GG)(A)(TT).

在优选的实施方式中,核酸序列是确定的核苷酸序列,即由A,G,C,T表示的序列,或者由A,G,C,U表示的序列。In a preferred embodiment, the nucleic acid sequence is a definite nucleotide sequence, that is, the sequence represented by A, G, C, T, or the sequence represented by A, G, C, U.

在一些实施方式中,核酸序列是简并序列,或称模糊序列,即序列中包含有不确定信息,由M,K,R,Y,W,S,B,D,H,V等表示的简并碱基,可以理解的,序列中的部分序列可能是确定的。In some embodiments, the nucleic acid sequence is a degenerate sequence, or ambiguous sequence, that is, the sequence contains uncertain information, represented by M, K, R, Y, W, S, B, D, H, V, etc. Degenerate bases, it can be understood that part of the sequence in the sequence may be determined.

本发明中,标准核酸样品指的是来源和序列均已确定、在基因组的几乎所有位点上均高度纯合的核酸样品。例如,可以是New England Biolabs公司生产的λ噬菌体DNA。标准核酸的参考序列是已知的,则可以将测序得到的核酸序列比对到其对应的参考核酸序列,并将每个多聚物标记为正确或错误,例如的,一段测序得到的核酸序列为ATTGGCCAAAT,将其分成3个二元共聚物:(ATT)(GGCC)(AAAT),参考序列为ATTGGCCAAAA,则将第一个、第二个多聚物标记为正确,第三个多聚物标记为错误。In the present invention, a standard nucleic acid sample refers to a nucleic acid sample whose source and sequence have been determined and which are highly homozygous at almost all sites in the genome. For example, it may be lambda phage DNA produced by New England Biolabs. The reference sequence of the standard nucleic acid is known, the sequenced nucleic acid sequence can be compared to its corresponding reference nucleic acid sequence, and each polymer is marked as correct or incorrect, for example, a sequenced nucleic acid sequence For ATTGGCCAAAT, it is divided into 3 binary copolymers: (ATT)(GGCC)(AAAT), the reference sequence is ATTGGCCAAAA, then the first and second polymers are marked as correct, and the third polymer Marked as an error.

在具体的实施方式中,将测序得到的核酸序列比对到其对应的参考核酸序列,得到比对结果,再根据所述比对结果将多聚物标记为测序正确或测序错误;优选的,从比对结果中进一步筛选出高质量比对的序列,再将所述高质量比对的序列中的多聚物标记为测序正确或测序错误,忽略无法确定的序列(即无法成功比对到参考序列上的碱基或比对质量较低的碱基)。根据比对结果,将比对结果为“匹配”的多聚物标记为“测序正确”,将比对结果为“错配”、“插入”或“缺失”的多聚物标记为“测序错误”。本发明中所述的高质量比对,需要根据所用的比对软件或算法来具体选择质量值范围;例如的,当使用BWA进行序列比对时,高质量比对的序列指的是,比对质量大于0、或大于等于10、或大于等于20、或大于等于30、或大于等于40、或大于等于50、或大于等于60的碱基序列。In a specific embodiment, the nucleic acid sequence obtained by sequencing is compared to its corresponding reference nucleic acid sequence to obtain the comparison result, and then the polymer is marked as sequenced correctly or sequenced incorrectly according to the comparison result; preferably, High-quality aligned sequences are further screened from the comparison results, and then the polymers in the high-quality aligned sequences are marked as sequenced correctly or sequenced incorrectly, and undetermined sequences are ignored (that is, sequences that cannot be successfully aligned to Bases on the reference sequence or bases with lower alignment quality). According to the alignment results, the polymers whose alignment result is "match" are marked as "sequencing correct", and the polymers whose alignment result is "mismatch", "insertion" or "deletion" are marked as "sequence error". ". The high-quality alignment described in the present invention needs to specifically select the range of quality values according to the comparison software or algorithm used; for example, when using BWA for sequence alignment, the sequence of high-quality alignment refers to the comparison For base sequences whose quality is greater than 0, or greater than or equal to 10, or greater than or equal to 20, or greater than or equal to 30, or greater than or equal to 40, or greater than or equal to 50, or greater than or equal to 60.

根据优选的实施方式,多聚物对应的测序信号特征,指的是测序过程中该多聚物发生测序化学反应时产生的信号的特征,图1给出了测序信号特征的实例,包括但不限于:该碱基的种类,即碱基属于A,G,C,T(或U)的哪一种;该碱基在序列上的位置,即碱基在其所在核苷酸序列上的位置位次,例如的,对于单端测序,位置靠前的碱基的测序质量值通常高于位置靠后的碱基;该碱基所处多聚物的长度,即碱基所处的同源多聚物或简并多聚物的碱基的数量,通常的,多聚物长度短,测序质量值高;该碱基在其所处多聚物中的位置,即碱基与其所处的同源多聚物或简并多聚物的最近一个末端的距离;该碱基发生测序化学反应的轮数,即该碱基并入核苷酸链时对应的cycle数,通常的,其对应的cycle数小,质量值高;信号强度,可以是测序仪直接采集到的信号的强度,包括亮度、电压水平或电流水平等,可以是归一化信号,可以是失相校正后的信号;信号强度(及其邻近信号强度)接近整数的程度,即归一化信号或失相校正后信号或纠错校正后的信号与最接近整数之间的差值,通常的,差值小的,准确度更高;测序信号的参数,即单位信号、背景信号、超前系数、滞后系数、衰减系数等;测到该碱基时的失相程度,通常的,失相程度低,准确度更高,等等。According to a preferred embodiment, the characteristics of the sequencing signal corresponding to the polymer refer to the characteristics of the signal generated when the polymer undergoes a sequencing chemical reaction during the sequencing process. Figure 1 shows examples of the characteristics of the sequencing signal, including but not Limited to: the type of the base, that is, which one the base belongs to A, G, C, T (or U); the position of the base in the sequence, that is, the position of the base in its nucleotide sequence Position, for example, for single-end sequencing, the sequencing quality value of the base at the front is usually higher than that of the base at the back; the length of the polymer where the base is located, that is, the homology of the base The number of bases in polymers or degenerate polymers, usually, the length of the polymer is short, and the sequencing quality value is high; the position of the base in the polymer, that is, the base and its position The distance from the nearest end of a homologous polymer or a degenerate polymer; the number of rounds of sequencing chemical reactions of the base, that is, the corresponding cycle number when the base is incorporated into the nucleotide chain, usually, its corresponding The number of cycles is small and the quality value is high; the signal strength can be the strength of the signal directly collected by the sequencer, including brightness, voltage level or current level, etc. It can be a normalized signal or a signal after phase loss correction; The degree to which the signal strength (and its adjacent signal strength) is close to an integer, that is, the difference between the normalized signal or the signal after phase loss correction or the signal after error correction correction and the nearest integer, usually, the difference is small, The accuracy is higher; the parameters of the sequencing signal, namely unit signal, background signal, lead coefficient, lag coefficient, attenuation coefficient, etc.; the degree of dephasing when the base is detected, usually, the degree of dephasing is low and the accuracy is higher ,etc.

根据优选的实施方式,计算多聚物的一个或多个测序信号特征,例如的,可以只选取一个测序信号特征进行计算,也可以选取两个测序信号特征,或者更多个测序信号特征。According to a preferred embodiment, one or more sequencing signal features of the polymer are calculated. For example, only one sequencing signal feature may be selected for calculation, or two sequencing signal features may be selected, or more sequencing signal features may be selected.

进一步的,将标准核酸序列的每个多聚物进行标记后,训练分类器,来拟合多聚物的测序信号特征与其标记之间的关系。分类器是模式识别领域的常规概念,包括但不限于,线性回归,多项式回归,逻辑回归,支持向量机,人工神经网络,随机森林,Phred算法,集成学习等。随着模式识别领域的发展,近年来有多种新颖的分类器算法提出。使用新颖的分类器算法并不改变本发明的实质。Further, after labeling each polymer of the standard nucleic acid sequence, a classifier is trained to fit the relationship between the sequencing signal features of the polymer and its markers. Classifier is a conventional concept in the field of pattern recognition, including but not limited to, linear regression, polynomial regression, logistic regression, support vector machine, artificial neural network, random forest, Phred algorithm, ensemble learning, etc. With the development of the field of pattern recognition, a variety of novel classifier algorithms have been proposed in recent years. Using a novel classifier algorithm does not change the essence of the invention.

具体的,训练分类器,来拟合每个多聚物的测序信号特征与其标记之间的关系;所述分类器,可以根据多聚物的测序信号特征,将多聚物分成若干类,统计每一类多聚物的准确率。例如的,可将长度为1、2、3、4、5及5以上的多聚物分别划为一类,或将单位信号处于100-199、200-299、300-399、>400区间范围内的多聚物分别划为一类。当使用多种测序信号特征时,可进行正交划分,例如长度为1且单位信号处于100-199内的多聚物划为一类,长度为1且单位信号处于200-299内的多聚物划为另一类,以此类推。Specifically, a classifier is trained to fit the relationship between the sequencing signal features of each polymer and its markers; the classifier can divide the polymers into several categories according to the sequencing signal features of the polymers, and the statistics Accuracy for each class of polymers. For example, polymers with a length of 1, 2, 3, 4, 5, and 5 or more can be classified into one category, or the unit signal is in the range of 100-199, 200-299, 300-399, >400 The polymers within are divided into one class. When multiple sequencing signal features are used, orthogonal division can be performed. For example, polymers with a length of 1 and a unit signal within 100-199 are classified into one category, and polymers with a length of 1 and a unit signal within 200-299 objects into another category, and so on.

在优选的实施方式中,完成拟合后,将分类器的拟合结果转化为质量得分。存在大量文献报道如何将分类器的预测结果转化为质量值。以著名的softmax算法为例,设某分类器的输出为(a,b),其中(1,0)表示正确,(0,1)表示错误。由于分类器训练的精度或预测时的计算误差等因素,分类器在预测时的输出并不总恰好是(1,0)或(0,1),而是(0.9,0.05)或(0.1,0.99)这样较为接近(1,0)或(0,1)的数值。此时softmax算法利用下式将输出(a,b)转化为正确率:In a preferred embodiment, after the fitting is completed, the fitting result of the classifier is converted into a quality score. Extensive literature exists on how to convert classifier predictions into quality values. Taking the famous softmax algorithm as an example, let the output of a certain classifier be (a,b), where (1,0) means correct and (0,1) means error. Due to factors such as the accuracy of classifier training or calculation errors during prediction, the output of the classifier during prediction is not always exactly (1,0) or (0,1), but (0.9,0.05) or (0.1, 0.99) which is closer to (1,0) or (0,1) values. At this time, the softmax algorithm uses the following formula to convert the output (a, b) into a correct rate:

Figure BDA0004142955640000141
Figure BDA0004142955640000141

随着模式识别领域的发展,近年来有多种新颖的转化算法提出,例如的,包括Sparse-softmax,log-softmax,Taylor softmax,log-Taylor softmax,soft-marginsoftmax,SM-Taylor softmax等,使用新颖的转化算法并不改变本发明的实质。With the development of the field of pattern recognition, a variety of novel conversion algorithms have been proposed in recent years, for example, including Sparse-softmax, log-softmax, Taylor softmax, log-Taylor softmax, soft-marginsoftmax, SM-Taylor softmax, etc., using The novel transformation algorithm does not change the essence of the present invention.

对于待测核酸样品,利用上述将分类器的拟合结果转化为质量得分的方法,基于待测核酸的测序信号特征,预测每个多聚物的质量得分。For the nucleic acid sample to be tested, the above-mentioned method of converting the fitting result of the classifier into a quality score is used to predict the quality score of each polymer based on the sequencing signal characteristics of the nucleic acid to be tested.

在一些实施方式中,质量得分,指的是表征测序准确率的一个数值,选自碱基检出准确率、碱基检出错误率、Phred值等。可以理解的,所述质量得分只是碱基测序准确率的表示形式,表示形式本身并不重要,不影响本发明的实质,质量得分也可以直接用准确率来表示。In some embodiments, the quality score refers to a numerical value representing the accuracy of sequencing, which is selected from base calling accuracy, base calling error rate, Phred value and the like. It can be understood that the quality score is only an expression of the accuracy of base sequencing, and the expression itself is not important and does not affect the essence of the present invention. The quality score can also be directly expressed by the accuracy.

在一些实施方式中,所述质量得分对数地基于碱基检出误差概率,并且其中所述多个质量得分包括Q10、Q15、Q20、Q25、Q30、Q35、Q40、Q45、Q50、Q55、Q60。In some embodiments, the quality score is logarithmically based on base calling error probability, and wherein the plurality of quality scores includes Q10, Q15, Q20, Q25, Q30, Q35, Q40, Q45, Q50, Q55, Q60.

在一些优选的实施方式中,训练校准的量化方案可预先完成,并将训练好的分类器作为配置文件存于系统中,在执行待测核酸序列的质量评分时调取即可。In some preferred embodiments, the quantization scheme for training and calibration can be completed in advance, and the trained classifier is stored in the system as a configuration file, which can be called when performing the quality scoring of the nucleic acid sequence to be tested.

在一些优选的实施方式中,标准核酸样品和待测核酸样品可带上不同的分子标记,并混合在一起同时测序。测序结束后,先利用分子标记将两种样品进行拆分,完成训练校准的量化方案,得到训练好的分类器,再应用在待测核酸样品上。In some preferred embodiments, the standard nucleic acid sample and the nucleic acid sample to be tested can be labeled with different molecular markers, and mixed together for simultaneous sequencing. After the sequencing is completed, the two samples are first split using molecular markers, and the quantification scheme for training and calibration is completed to obtain a trained classifier, which is then applied to the nucleic acid sample to be tested.

本发明还公开了根据任何前述实施方案的方法,其中将含有核苷酸底物分子的反应液用于测序。核苷酸底物分子是指A、G、C、T核苷酸底物分子中的任何一种,或两种,或三种;或者A、G、C、U核苷酸底物分子中的任何一种,或两种,或三种。The present invention also discloses a method according to any of the preceding embodiments, wherein a reaction solution containing nucleotide substrate molecules is used for sequencing. Nucleotide substrate molecules refer to any one of A, G, C, T nucleotide substrate molecules, or two, or three; or A, G, C, U nucleotide substrate molecules Any one, or two, or three.

本文公开了根据任何前述实施方案使用具有荧光团标记的核苷酸底物分子的测序方法,其中每回测序使用一组反应液,每组反应液包括至少两份反应液,每份反应液包含A、G、C、T核苷酸底物分子中的至少一种,或者每份反应液包含A、G、C、U核苷酸底物分子中的至少一种。一方面,所述方法包括固定待测的核苷酸序列片段,通入一组反应液中的一份反应液,以及记录荧光信息。一方面,所述方法包括每次通入一份反应液,并相继通入同一组反应液中的另一份反应液。一方面,反应液组中有至少一份反应液,该反应液包含两种或三种核苷酸分子。Disclosed herein is a sequencing method using fluorophore-labeled nucleotide substrate molecules according to any of the preceding embodiments, wherein each sequencing run uses a set of reaction solutions, each set of reaction solutions comprising at least two reaction solutions, each reaction solution comprising At least one of A, G, C, T nucleotide substrate molecules, or each reaction solution contains at least one of A, G, C, U nucleotide substrate molecules. In one aspect, the method includes fixing the nucleotide sequence fragment to be detected, passing a part of the reaction solution in a set of reaction solutions, and recording fluorescence information. In one aspect, the method includes passing one portion of the reaction solution at a time, and sequentially passing another portion of the same set of reaction solutions. In one aspect, there is at least one reaction solution in the reaction solution group, and the reaction solution contains two or three nucleotide molecules.

本文公开了使用具有荧光团标记的核苷酸底物分子的测序方法,其中每回测序使用一组反应液,每组反应液包括两份反应液,每份反应液包含具有不同碱基的两种核苷酸。一方面,其中一份反应液中的核苷酸可以和待测核苷酸序列上的两种碱基互补,另一份反应液中的核苷酸可以和待测核酸序列上的另外两种碱基互补。一方面,所述方法包括固定待测的核苷酸序列片段,以及通入一组反应液中的第一反应液。然后,加入同一组反应液中的第二份反应液。两份反应液可以交替方式相继加入,以通过荧光信息获得待测核苷酸底物的编码信息。当每份反应液中的两种核苷酸标记不同的荧光标记时(即双色2+2测序),进行两回正交的双色2+2测序;当每份反应液中的两种核苷酸具有相同的荧光标记时(即单色2+2测序),优选的,进行三回正交的单色2+2测序。可选的,测序反应的一组反应液的其中一份包含3种核苷酸,另一份反应液包含另外一种与之不同的核苷酸,此时,可交替进行四回正交的测序反应(1+3测序)。Disclosed herein is a sequencing method using fluorophore-labeled nucleotide substrate molecules, wherein each sequence uses a set of reaction solutions, each set of reaction solutions includes two reaction solutions, and each reaction solution contains two different bases. kind of nucleotides. On the one hand, the nucleotides in one of the reaction solutions can be complementary to the two bases on the nucleotide sequence to be tested, and the nucleotides in the other reaction solution can be complementary to the other two bases on the nucleic acid sequence to be tested. base complementarity. In one aspect, the method includes immobilizing the nucleotide sequence fragment to be detected, and passing through the first reaction solution in a set of reaction solutions. Then, a second reaction from the same set of reactions was added. The two reaction solutions can be added successively in an alternate manner, so as to obtain the coding information of the nucleotide substrate to be tested through the fluorescence information. When the two nucleotides in each reaction solution are labeled with different fluorescent labels (i.e., two-color 2+2 sequencing), perform two rounds of orthogonal two-color 2+2 sequencing; when the two nucleosides in each reaction solution When the acids have the same fluorescent label (ie, single-color 2+2 sequencing), preferably, three rounds of orthogonal single-color 2+2 sequencing are performed. Optionally, one part of a group of reaction solutions of the sequencing reaction contains 3 kinds of nucleotides, and the other part of the reaction solution contains another nucleotide different from it. At this time, four times of orthogonal Sequencing reactions (1+3 sequencing).

根据优选的实施方式,本发明的方法进一步包括根据多聚物的质量得分,对所测的核酸序列进行生物信息学分析。According to a preferred embodiment, the method of the present invention further includes performing bioinformatics analysis on the measured nucleic acid sequence according to the quality score of the polymer.

根据优选的实施方式,所述生物信息学分析,指的是进行两回或两回以上正交的简并测序,获得每个简并多聚物长度的质量值后,利用该质量值进行纠错码解码(或纠错校正)。例如的,对待测核酸序列分别进行3回正交的简并测序(MK、RY、WS),得到的核酸序列为ACCGTTTGC,以多聚物CC为例,在MK测序回,其所在二元多聚物为ACC,在RY测序回,其所在简并多聚物为CC,在WS测序回中,其所在简并多聚物为CCG,可以分别计算多聚物CC在每一个简并多聚物的特征,例如简并多聚物长度,分别是3,2,3,并分别计算其质量得分,根据质量得分进行纠错校正(或称纠错码解码),即确定最终的核苷酸序列。According to a preferred embodiment, the bioinformatics analysis refers to performing two or more times of orthogonal degenerate sequencing, after obtaining the quality value of each degenerate polymer length, using the quality value to correct Error code decoding (or error correction correction). For example, three times of orthogonal degenerate sequencing (MK, RY, WS) are performed on the nucleic acid sequence to be tested, and the obtained nucleic acid sequence is ACCGTTTGC. Taking the polymer CC as an example, it is sequenced in MK. The polymer is ACC, in the RY sequencing round, its degenerate polymer is CC, in the WS sequencing round, its degenerate polymer is CCG, and the polymer CC can be calculated separately in each degenerate polymer The characteristics of the object, such as the length of degenerate polymers, are 3, 2, and 3 respectively, and their quality scores are calculated respectively, and error correction correction (or error correction code decoding) is performed according to the quality scores, that is, the final nucleotide is determined sequence.

以2+2简并测序为例说明ECC解码原理,ECC测序由MK、RY、WS三回简并测序组成,每回简并测序得到一条包含待测DNA一半信息的简并序列。通过对三条简并序列同一位置上的三个简并碱基取交集,可以得到待测DNA的准确碱基组成。一共有8种取交集的情况:Take 2+2 degenerate sequencing as an example to illustrate the principle of ECC decoding. ECC sequencing consists of three rounds of degenerate sequencing of MK, RY, and WS. Each round of degenerate sequencing obtains a degenerate sequence containing half of the information of the DNA to be tested. By taking the intersection of three degenerate bases at the same position of the three degenerate sequences, the accurate base composition of the DNA to be tested can be obtained. There are a total of 8 intersection situations:

Figure BDA0004142955640000161
Figure BDA0004142955640000161

这8种取交集的情况中,有4种合法情况,分别可以得到四种碱基。另外还有4种非法情况,取交集的结果为空集。在理想的、没有测序错误的情况下,三条简并序列取交集应当全为合法情况,可以得到待测DNA的序列。而当信号处理算法计算出的DPL中包含错误时,在测序错误附近就会发生非法情况。因此,简并序列取交集时的非法情况就提示了存在测序错误。为了校正测序错误,首先利用标准品的测序数据,统计得到了不同测序错误模式发生的概率。然后,基于最大似然原理,ECC解码算法试图修正信号处理算法计算出的DPL,并在修正中达到2个目标:1.修正后的DPL可以使得三条简并序列取交集时全部为合法情况;2.在前一条的约束下,按照不同测序错误模式发生的概率,该修正方式发生的概率最大。可以使用一种最优化方法来实现上述2个目标。Among the 8 intersection situations, there are 4 legal situations, and four bases can be obtained respectively. In addition, there are 4 illegal situations, and the result of taking the intersection is an empty set. In an ideal situation without sequencing errors, the intersection of the three degenerate sequences should be all legal, and the sequence of the DNA to be tested can be obtained. Illegal situations occur near sequencing errors when the DPL calculated by the signal processing algorithm contains errors. Therefore, the illegality of the intersection of degenerate sequences indicates the presence of sequencing errors. In order to correct sequencing errors, firstly, the probability of occurrence of different sequencing error modes was obtained by using the sequencing data of the standard. Then, based on the maximum likelihood principle, the ECC decoding algorithm tries to correct the DPL calculated by the signal processing algorithm, and achieves two goals in the correction: 1. The corrected DPL can make the intersection of the three degenerate sequences all legal; 2. Under the constraints of the previous article, according to the probability of occurrence of different sequencing error modes, the probability of occurrence of this correction method is the highest. An optimization method can be used to achieve the above two goals.

在某些实施方式中,所述生物信息学分析,指的是根据所赋予的质量值,筛选高质量的核酸序列。筛选方法包括但不限于,筛选全部质量值均高于或低于某一阈值的核酸序列,筛选全部质量值的均值均高于或低于某一阈值的核酸序列,筛选核酸序列中质量值均高于或低于某一阈值的区域,筛选核酸序列中质量值的均值均高于或低于某一阈值的区域,等。经过筛选得到的高质量的核酸序列,可用于检测基因变异、检测基因表达量、检测RNA可变剪接状态、检测基因修饰状态、鉴定核酸来源的物种或个体、检测基因组三维结构、检测核酸与核酸间的相互作用、检测核酸与蛋白质间的相互作用、检测染色质可及性、解析RNA结构等。In some embodiments, the bioinformatics analysis refers to screening high-quality nucleic acid sequences according to the assigned quality value. Screening methods include, but are not limited to, screening nucleic acid sequences whose quality values are all higher or lower than a certain threshold, screening nucleic acid sequences whose average value of all quality values is higher than or lower than a certain threshold, and screening nucleic acid sequences whose quality values are all higher than or lower than a certain threshold. Regions above or below a certain threshold value, regions in which the average value of quality values in the nucleic acid sequence are both above or below a certain threshold value, and so on. The high-quality nucleic acid sequence obtained after screening can be used to detect gene variation, detect gene expression, detect RNA alternative splicing status, detect gene modification status, identify species or individuals from nucleic acid sources, detect genome three-dimensional structure, detect nucleic acid and nucleic acid Interactions between nucleic acids, detection of interactions between nucleic acids and proteins, detection of chromatin accessibility, analysis of RNA structures, etc.

在一些实施方式中,所述生物信息学分析,指的是根据所赋予的质量值,将核酸序列比对到参考序列上。比对是生物信息学中的常规概念,可以使用BWA、Smith-Waterman算法、Bowtie、SOAP、Needleman-Wunch算法、Bowtie2、BLAST、ELAND、TMAP、MAQ、minimap2、SHRiMP的软件或算法进行。在比对中利用质量值的方法包括但不限于:In some embodiments, the bioinformatics analysis refers to aligning the nucleic acid sequence to a reference sequence according to the assigned quality value. Alignment is a conventional concept in bioinformatics, which can be performed using BWA, Smith-Waterman algorithm, Bowtie, SOAP, Needleman-Wunch algorithm, Bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRiMP software or algorithm. Methods for utilizing quality values in alignments include, but are not limited to:

1.选择高质量值的子序列作为初步定位序列的种子;1. Select the subsequence of high quality value as the seed of preliminary positioning sequence;

2.当存在多个可能的比对方式时,优先将低质量值的多聚物/碱基作为比对中不匹配的部分。2. When there are multiple possible alignment methods, polymers/bases with low quality values are prioritized as unmatched parts in the alignment.

在一些实施方式中,所述生物信息学分析,指的是根据比对结果及被比对序列所赋予的质量值,鉴定基因变异。基因变异是生物学中的常规概念,包括但不限于单核苷酸多态性、拷贝数变异、表观遗传学变异、大范围结构变异等。在鉴定基因变异中利用质量值的方法包括但不限于:In some embodiments, the bioinformatics analysis refers to the identification of gene variation according to the alignment results and the quality values assigned to the aligned sequences. Genetic variation is a conventional concept in biology, including but not limited to single nucleotide polymorphism, copy number variation, epigenetic variation, large-scale structural variation, etc. Methods for utilizing quality values in identifying genetic variants include, but are not limited to:

1.对于待鉴定变异的基因组位点,在所有比对到该位点的序列的碱基中,筛选所在多聚物/碱基的质量值较高的碱基进行鉴定。1. For the genomic site of the variation to be identified, among all the bases of the sequence aligned to the site, the base with a higher quality value of the polymer/base is screened for identification.

2.给出零假设:该位点不存在基因变异。根据质量值和比对结果,计算零假设成立的概率,若该概率大于给定的显著性水平,则接受零假设,否则拒绝零假设、认为该位点存在基因变异。2. Give the null hypothesis: there is no genetic variation at this site. According to the quality value and the comparison result, the probability of the null hypothesis being established is calculated. If the probability is greater than the given significance level, the null hypothesis is accepted; otherwise, the null hypothesis is rejected and the gene variation is considered to exist at this site.

在一些实施方式中,当鉴定基因变异时,可以利用比对结果的某些特征,来去除潜在的假阳性或假阴性结果。这些都是生物信息学中的常规操作,其增添均不影响本发明的实质。这样的特征包括但不限于:In some embodiments, when identifying genetic variants, certain features of the alignment results can be used to remove potential false positive or false negative results. These are routine operations in bioinformatics, and their additions do not affect the essence of the present invention. Such characteristics include, but are not limited to:

1.该基因变异集中出现在正向或反向比对的序列上,而在反向或正向比对的序列上较少出现;1. The gene variation appears concentratedly in the sequence of forward or reverse alignment, but rarely occurs in the sequence of reverse or forward alignment;

2.该基因变异集中出现在序列的两端,而在序列的中央较少出现;2. The gene variation appears concentratedly at both ends of the sequence, but rarely occurs in the center of the sequence;

3.当使用双端测序(pair-end sequencing)时,read1测到该位点主要为G变T、而read2测到该位点主要为C变A,或read1测到该位点主要为C变T、而read2测到该位点主要为G变A;3. When pair-end sequencing is used, read1 detects that the site is mainly G to T, while read2 detects that the site is mainly C to A, or read1 detects that the site is mainly C Change to T, and the site detected by read2 is mainly G to A;

4.该基因变异附近频繁出现其他不同的基因变异。4. Other different gene mutations appear frequently near this gene variation.

在一些实施方式中,生物信息学分析,指的是根据所赋予的质量值,将核酸序列组装为较长的核酸序列。In some embodiments, bioinformatics analysis refers to assembling nucleic acid sequences into longer nucleic acid sequences according to assigned quality values.

根据优选的实施方式,训练分类器,可以是基于最大似然的概率分布模型。概率分布指的是具有单峰形状特征的概率分布,包括但不限于两点分布、二项分布、负二项分布、泊松分布、几何分布、指数分布、正态分布、Γ分布、卡方分布、t分布、F分布、β分布、对数正态分布,以及上述分布的高维扩展等。在前述概率分布模型中,概率分布的期望或峰值与多聚物的测序信号特征有关,方差或偏度与多聚物的质量值有关。由基本统计学可知,不同概率分布均可由一组参数完全确定,例如正态分布可由均值和标准差两个参数完全确定。在将测序所得序列比对到参考基因组上之后,可以根据比对结果确定每个多聚物的测序信号特征与多聚物长度(设为n)间的对应关系。在由给定参数完全确定后,所述概率分布在n-0.5~n+0.5之间的积分面积表示测序信号特征对应多聚物长度n的概率。似然函数,指的是由给定参数确定概率分布后,计算出一组多聚物测序信号特征对应多聚物长度n的概率,再将所述概率相乘的结果。最大似然,指的是寻找到一组参数,使得用该组参数确定概率分布后,所得到的似然函数最大。According to a preferred embodiment, training the classifier may be a probability distribution model based on maximum likelihood. Probability distribution refers to a probability distribution characterized by a unimodal shape, including but not limited to two-point distribution, binomial distribution, negative binomial distribution, Poisson distribution, geometric distribution, exponential distribution, normal distribution, Γ distribution, chi-square distribution, t distribution, F distribution, beta distribution, lognormal distribution, and high-dimensional extensions of the above distributions, etc. In the aforementioned probability distribution model, the expectation or peak of the probability distribution is related to the sequencing signal characteristics of the polymer, and the variance or skewness is related to the quality value of the polymer. It is known from basic statistics that different probability distributions can be completely determined by a set of parameters, for example, the normal distribution can be completely determined by two parameters of mean and standard deviation. After the sequence obtained by sequencing is compared to the reference genome, the corresponding relationship between the sequencing signal characteristics of each polymer and the length of the polymer (set as n) can be determined according to the comparison result. After being completely determined by the given parameters, the integrated area of the probability distribution between n-0.5 and n+0.5 represents the probability of the sequencing signal feature corresponding to the polymer length n. The likelihood function refers to the result of calculating the probability of a group of polymer sequencing signal features corresponding to the polymer length n after determining the probability distribution with given parameters, and then multiplying the probabilities together. The maximum likelihood refers to finding a set of parameters, so that after the probability distribution is determined with this set of parameters, the obtained likelihood function is the largest.

所述基于最大似然的概率分布模型,可以将多聚物测序信号特征分为若干群体,每个群体单独应用所述基于最大似然的概率分布模型。The maximum likelihood-based probability distribution model can divide the polymer sequencing signal features into several groups, and each group applies the maximum likelihood-based probability distribution model independently.

所述似然函数,可以为了计算简便起见而对其作一定的数学变换。数学变换例如,通过取对数将概率相乘变换为概率相加。The likelihood function may be subjected to a certain mathematical transformation for the sake of simplicity of calculation. Mathematical transformations, for example, transform probabilistic multiplication into probabilistic addition by taking the logarithm.

本发明还提供一种核酸测序数据的质量评估方法,其特征在于,包括:对待测核酸样品进行模糊测序或缺失测序得到输入数据,产生所述输入数据的简并多聚物长度信息,并计算所述简并多聚物长度的测序信号特征;The present invention also provides a method for evaluating the quality of nucleic acid sequencing data, which is characterized in that it includes: performing fuzzy sequencing or missing sequencing on the nucleic acid sample to be tested to obtain input data, generating degenerate polymer length information of the input data, and calculating The sequencing signal characteristics of the length of the degenerate polymer;

利用针对训练校准的量化方案,并基于所述测序信号特征,预测所述多聚物的质量得分;predicting a quality score for the polymer based on the sequencing signal characteristics using a quantization scheme calibrated for training;

所述训练校准的量化方案包括:The quantization scheme of the training calibration includes:

对标准核酸样品测序得到核酸序列,计算所述核酸序列的测序信号特征,将核酸序列比对到参考序列上,并将核酸序列的多聚物标记为测序正确或错误;训练分类器,拟合多聚物的测序信号特征与其标记之间的关系。Sequencing a standard nucleic acid sample to obtain a nucleic acid sequence, calculating the sequencing signal characteristics of the nucleic acid sequence, comparing the nucleic acid sequence to a reference sequence, and marking the polymer of the nucleic acid sequence as being sequenced correctly or incorrectly; training a classifier, fitting Relationship between sequencing signature features of polymers and their markers.

根据优选的实施方式,简并多聚物长度的测序信号特征包括:该碱基所处多聚物的长度,即碱基所处的简并多聚物的碱基的数量,通常的,多聚物长度短,测序质量值高;该碱基在其所处多聚物中的位置,即碱基与其所处的简并多聚物的最近一个末端的距离,等。According to a preferred embodiment, the sequencing signal characteristics of the length of the degenerate polymer include: the length of the polymer where the base is located, that is, the number of bases in the degenerate polymer where the base is located, usually, more The polymer length is short, the sequencing quality value is high; the position of the base in the polymer it is in, that is, the distance between the base and the nearest end of the degenerate polymer it is in, etc.

具体的,训练分类器,来拟合每个多聚物的测序信号特征与其标记之间的关系;分类器可以根据多聚物的测序信号特征,将多聚物分成若干类,统计每一类多聚物的准确率。例如的,可将长度为1、2、3、4、5及5以上的多聚物分别划为一类。当使用多种测序信号特征时,可进行正交划分。Specifically, train a classifier to fit the relationship between the sequencing signal features of each polymer and its markers; the classifier can divide the polymers into several categories according to the sequencing signal features of the polymers, and count each category Polymer Accuracy. For example, polymers with a length of 1, 2, 3, 4, 5, and more than 5 can be classified into one class respectively. Orthogonal partitioning can be performed when multiple sequencing signal features are used.

在优选的实施方式中,完成拟合后,将分类器的拟合结果转化为质量得分。存在大量文献报道如何将分类器的预测结果转化为质量值。In a preferred embodiment, after the fitting is completed, the fitting result of the classifier is converted into a quality score. Extensive literature exists on how to convert classifier predictions into quality values.

以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art within the technical scope disclosed in the present invention can easily think of changes or Replacement should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims (11)

1. A method for quality assessment of nucleic acid sequencing data, comprising:
providing a nucleic acid sequence to be detected, taking the polymer as a basic unit, and calculating sequencing signal characteristics of the polymer; predicting a mass score of the multimer based on sequencing signal features using a trained and calibrated quantization scheme;
the training calibrated quantization scheme includes:
for the provided standard nucleic acid sequence, taking the polymer as a basic unit, calculating the sequencing signal characteristics of the polymer, and marking the polymer as correct or incorrect sequencing according to the comparison result of the standard nucleic acid sequence; the classifier is trained to fit the relationship between the sequencing signal features of the multimer and its markers.
2. The method of claim 1, wherein the sequencing signal characteristic of the polymer refers to a characteristic of a signal generated when the polymer undergoes a sequencing chemistry during the sequencing process, including, but not limited to, the base type of the constituent polymer, the length of the polymer, the number of rounds of the sequencing chemistry, the signal strength, the degree to which the signal strength is near an integer, a parameter of the sequencing signal, the degree of loss of phase when the polymer is detected, etc.
3. The method according to claim 1 or 2, wherein the polymer comprises a homopolymer, a bipolymer, a terpolymer, or the like.
4. A method according to any one of claims 1-3, wherein the training classifier is based on a probability distribution model of maximum likelihood.
5. The method of claim 4, wherein training the classifier comprises classifying the polymers into a plurality of classes based on sequencing signal characteristics of the polymers, and counting sequencing accuracy for each class of polymers.
6. The method of claim 5, wherein the classifier includes, but is not limited to, linear regression, polynomial regression, logistic regression, support vector machines, artificial neural networks, random forests, phred algorithms, ensemble learning, and the like.
7. The method of any one of claims 1-6, wherein the standard nucleic acid sequence is a sequence obtained by sequencing a standard nucleic acid sample; the standard nucleic acid sample refers to a nucleic acid sample which has been determined in both source and sequence and is highly homozygous at almost all sites of the genome, and includes lambda phage DNA, E.coli DNA, saccharomyces cerevisiae DNA, etc.
8. The method of claim 7, further comprising performing a bioinformatic analysis of the nucleic acid sequence to be tested based on the mass score of the multimer.
9. The method of claim 8, wherein the bioinformatic analysis comprises identifying genetic variations based on the alignment and the quality value assigned to the aligned sequences.
10. The method of claim 9, wherein the bioinformatic analysis comprises performing at least two orthogonal degenerate sequencing runs to obtain a mass value for a degenerate polymer length, and correcting with the mass value.
11. A method for quality assessment of nucleic acid sequencing data, comprising:
performing fuzzy sequencing or deletion sequencing on a nucleic acid sample to be detected to obtain input data, generating degenerate polymer length information of the input data, and calculating sequencing signal characteristics of the degenerate polymer length; predicting a mass score of the multimer based on the sequencing signal features using a quantification protocol calibrated for training;
the training calibrated quantization scheme includes:
sequencing a standard nucleic acid sample to obtain a nucleic acid sequence, calculating sequencing signal characteristics of the nucleic acid sequence, comparing the nucleic acid sequence to a reference sequence, and marking a polymer of the nucleic acid sequence as sequencing correct or incorrect; the classifier is trained to fit the relationship between the sequencing signal features of the multimer and its markers.
CN202310295466.8A 2023-03-24 2023-03-24 A quality assessment method for nucleic acid sequencing data Pending CN116246703A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310295466.8A CN116246703A (en) 2023-03-24 2023-03-24 A quality assessment method for nucleic acid sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310295466.8A CN116246703A (en) 2023-03-24 2023-03-24 A quality assessment method for nucleic acid sequencing data

Publications (1)

Publication Number Publication Date
CN116246703A true CN116246703A (en) 2023-06-09

Family

ID=86633250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310295466.8A Pending CN116246703A (en) 2023-03-24 2023-03-24 A quality assessment method for nucleic acid sequencing data

Country Status (1)

Country Link
CN (1) CN116246703A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117594130A (en) * 2024-01-19 2024-02-23 北京普译生物科技有限公司 Nanopore sequencing signal evaluation method and device, electronic equipment and storage medium
CN119229971A (en) * 2024-11-29 2024-12-31 杭州无垠科技有限公司 A quality control algorithm for nucleic acid sequences

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210005284A1 (en) * 2019-07-03 2021-01-07 Bostongene Corporation Techniques for nucleic acid data quality control
CN112599189A (en) * 2020-12-29 2021-04-02 北京优迅医学检验实验室有限公司 Data quality evaluation method for whole genome sequencing and application thereof
CN114420214A (en) * 2022-01-28 2022-04-29 赛纳生物科技(北京)有限公司 Quality evaluation method and screening method of nucleic acid sequencing data
CN114561453A (en) * 2022-01-28 2022-05-31 赛纳生物科技(北京)有限公司 A method for qualitative or quantitative analysis of target samples by degenerate sequencing
US20230021577A1 (en) * 2021-07-23 2023-01-26 Illumina Software, Inc. Machine-learning model for recalibrating nucleotide-base calls
CN115831219A (en) * 2022-12-22 2023-03-21 郑州思昆生物工程有限公司 A quality prediction method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210005284A1 (en) * 2019-07-03 2021-01-07 Bostongene Corporation Techniques for nucleic acid data quality control
CN112599189A (en) * 2020-12-29 2021-04-02 北京优迅医学检验实验室有限公司 Data quality evaluation method for whole genome sequencing and application thereof
US20230021577A1 (en) * 2021-07-23 2023-01-26 Illumina Software, Inc. Machine-learning model for recalibrating nucleotide-base calls
CN114420214A (en) * 2022-01-28 2022-04-29 赛纳生物科技(北京)有限公司 Quality evaluation method and screening method of nucleic acid sequencing data
CN114561453A (en) * 2022-01-28 2022-05-31 赛纳生物科技(北京)有限公司 A method for qualitative or quantitative analysis of target samples by degenerate sequencing
CN115831219A (en) * 2022-12-22 2023-03-21 郑州思昆生物工程有限公司 A quality prediction method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117594130A (en) * 2024-01-19 2024-02-23 北京普译生物科技有限公司 Nanopore sequencing signal evaluation method and device, electronic equipment and storage medium
CN119229971A (en) * 2024-11-29 2024-12-31 杭州无垠科技有限公司 A quality control algorithm for nucleic acid sequences

Similar Documents

Publication Publication Date Title
US20240304280A1 (en) Validation methods and systems for sequence variant calls
US8594951B2 (en) Methods and systems for nucleic acid sequence analysis
CN114999573B (en) Genome variation detection method and detection system
KR102447812B1 (en) Deep Learning-Based Framework For Identifying Sequence Patterns That Cause Sequence-Specific Errors (SSES)
CN116434843A (en) Base sequencing quality assessment method
CN110997944A (en) Method and system for detecting large fragment rearrangement in BRCA1/2
US12168800B2 (en) Methods and systems for evaluating microsatellite instability status
CN110088840B (en) Methods, systems, and computer readable media for correcting base calls in repeated regions of nucleic acid sequence reads
CN114420214A (en) Quality evaluation method and screening method of nucleic acid sequencing data
CN116246703A (en) A quality assessment method for nucleic acid sequencing data
EP4031664B1 (en) Methods for dna library generation to facilitate the detection and reporting of low frequency variants
CN112639984A (en) Method for detecting mutation load from tumor sample
JP7532396B2 (en) Methods for partner-independent gene fusion detection
Oloomi The impact of multi-mappings in short read mapping
WO2024163553A9 (en) Methods for detecting gene level copy number variation in brca1 and brca2
US10964407B2 (en) Method for estimating the probe-target affinity of a DNA chip and method for manufacturing a DNA chip
CN117976032A (en) Prediction model construction method, prediction method and device for nucleic acid chemical modification
Smith et al. Towards quality control in DNA Microarrays
Sun Two algorithmic problems in analyzing genetic and epigenetic variations
HK40034154A (en) Quality control templates for ensuring validity of sequencing-based assays
WO2003100541A2 (en) Methods for profiling molecules with an objective function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination