[go: up one dir, main page]

TWI852661B - Method for evaluating consensus base error rate and a system thereof - Google Patents

Method for evaluating consensus base error rate and a system thereof Download PDF

Info

Publication number
TWI852661B
TWI852661B TW112124699A TW112124699A TWI852661B TW I852661 B TWI852661 B TW I852661B TW 112124699 A TW112124699 A TW 112124699A TW 112124699 A TW112124699 A TW 112124699A TW I852661 B TWI852661 B TW I852661B
Authority
TW
Taiwan
Prior art keywords
base
observed
assumed
probability
expanded
Prior art date
Application number
TW112124699A
Other languages
Chinese (zh)
Other versions
TW202403774A (en
Inventor
劉宗霖
鄭順林
Original Assignee
國立成功大學
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 國立成功大學 filed Critical 國立成功大學
Publication of TW202403774A publication Critical patent/TW202403774A/en
Application granted granted Critical
Publication of TWI852661B publication Critical patent/TWI852661B/en

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention discloses a method for evaluating consensus base error rate and a system thereof. The method integrates polymerase chain reaction error rate and sequencing error rate for evaluating error rate of each consensus base. Corresponding consensus quality scores can be calculated, which can improve the precision of current molecular labeling sequencing. On the other hand, during mutation detection, the perplexed interpretation by low-frequency mutation can also be eliminated. The addressed advantages will benefit development of precision medicine in the future.

Description

共識鹼基錯誤率評估方法及其系統Consensus base error rate evaluation method and system

本發明涉及一種DNA定序數據分析技術領域,尤其涉及共識鹼基錯誤率之評估方法。The present invention relates to a DNA sequencing data analysis technology field, and more particularly to a method for evaluating consensus base error rate.

偵測低頻突變對於精準醫學與個人化癌症診療的發展具有相當大的潛在推力。在當前技術中,標靶治療仰賴體細胞變異點檢測的引導,而利用血液循環中的腫瘤細胞可進行非侵入式的DNA檢測以腫瘤分類、病程監測,且成本低廉。然而,當變異點之頻率相當於或甚至低於定序錯誤率或其他技術議題所產生的序列讀取錯誤率時,採用當前技術將遭遇相當大的挑戰。Detecting low-frequency mutations has great potential for the development of precision medicine and personalized cancer diagnosis. In current technologies, targeted therapy relies on the detection of somatic cell variants, and the use of tumor cells in the blood circulation can be used for non-invasive DNA testing for tumor classification and disease progression monitoring at low cost. However, when the frequency of variants is equal to or even lower than the sequencing error rate or the sequence reading error rate caused by other technical issues, the use of current technologies will encounter considerable challenges.

面對前述議題,分子條碼 (molecular barcode, MBC) 或唯一分子標籤 (unique molecular identifier, UMI) 已被廣泛的應用且在降低定序錯誤問題方面相當具有成效。在UMI定序中,每一個DNA模板首先以UMI進行標籤化,而標籤化之模板接著進行聚合酶連鎖反應(polymerase chain reaction, PCR)以進行分子擴增,其產生之擴增子 (amplicon) 以隨機抽樣的方式取出後進行定序。當有足夠的採樣時,對定序之擴增子取共識序列(consensus sequence)可去除大多數的定序錯誤。In response to the above issues, molecular barcodes (MBC) or unique molecular identifiers (UMI) have been widely used and are quite effective in reducing sequencing errors. In UMI sequencing, each DNA template is first labeled with a UMI, and the labeled template is then subjected to polymerase chain reaction (PCR) for molecular amplification. The resulting amplicons are randomly sampled and sequenced. When there is sufficient sampling, a consensus sequence for the sequenced amplicons can eliminate most sequencing errors.

然而,在聚合酶連鎖反應的過程中,也有可能在早期的擴增循環中導入錯誤鹼基(false base),並導致最終的擴增子池中有相當的比例具有錯誤鹼基。假如僅少量取樣擴增子並進行定序,該種錯誤鹼基可能在UMI群集中佔有主導地位進而導致共識鹼基並非原始鹼基。此外,在實際應用面上,通常擴增子採樣數量不大,舉例而言,取2至3筆相同UMI標記之核酸片段作為數據來源;然而,在尚未納入PCR錯誤的前提下,僅以少量取樣之UMI定序群集作為數據來源,並無法有效排除定序錯誤所導致的共識鹼基判斷錯誤,此更凸顯了處理同時處理PCR與定序錯誤的重要性;因此,在分析UMI定序數據方面,PCR及定序錯誤都必須同時納入考量。However, during the polymerase chain reaction, it is also possible that false bases are introduced in the early amplification cycles, resulting in a considerable proportion of the final amplicon pool having false bases. If only a small number of amplicon samples are sampled and sequenced, such false bases may dominate the UMI cluster, resulting in the consensus base being not the original base. In addition, in practical applications, the number of amplicon samples is usually not large. For example, 2 to 3 nucleic acid fragments with the same UMI marker are taken as data sources. However, without taking into account PCR errors, using only a small number of sampled UMI sequencing clusters as data sources cannot effectively eliminate consensus base judgment errors caused by sequencing errors. This further highlights the importance of handling PCR and sequencing errors at the same time. Therefore, in analyzing UMI sequencing data, both PCR and sequencing errors must be taken into consideration at the same time.

數種資訊工具已被應用於UMI數據的變異檢測,例如DeepSNVMiner、smCounter、MAGERI、smCounter2或UMI-VarCal;然而,這些工具採用了經驗式的方法來處理PCR錯誤;以smCounter為例,其利用了一種PCR及定序錯誤機率的聯合模型以估算變異檢測品質,採用了hand-waving經驗式近似法(hand-waving heuristic approximation)以計算PCR錯誤機率,並基於定序品質以計算定序錯誤機率。另一方面,MAGERI僅處理數量至少5之UMI定序群集,並假設共識序列中的定序錯誤都已被排除;該些習知技術不僅僅拋棄了多數的定序數據,也難以保證定序錯誤之存在與否。Several informatics tools have been applied to variant detection in UMI data, such as DeepSNVMiner, smCounter, MAGERI, smCounter2, or UMI-VarCal; however, these tools use empirical methods to handle PCR errors. For example, smCounter uses a joint model of PCR and sequencing error probabilities to estimate variant detection quality, using a hand-waving heuristic approximation to calculate PCR error probabilities and calculating sequencing error probabilities based on sequencing quality. On the other hand, MAGERI only processes UMI sequencing clusters of at least 5 and assumes that all sequencing errors in the consensus sequence have been eliminated; these conventional techniques not only discard most of the sequencing data, but also cannot guarantee the existence of sequencing errors.

為了處理共識序列中可能存在的PCR錯誤,MAGERI假設了共識鹼基之錯誤率遵守貝塔分布(beta distribution),並採用現象學模型 (phenomenological model),其參數是通過擬合已知DNA模板的UMI訓練數據所獲得;smCounter2則採用了和MAGERI類似的方式,其替數量小於5之UMI定序群集建立了第二貝塔分布模型 (second beta distribution),但這仍屬於現象學模型,且無法精確的描述定序錯誤。重要的是,該些現象學模型無法被用於估算每一UMI共識鹼基的品質;因此,UMI定序資料無法以習知的其他變異檢測方法分析,例如Mutect2,因為通常會需要每一共識鹼基品質的資訊。To handle possible PCR errors in consensus sequences, MAGERI assumes that the error rate of consensus bases follows a beta distribution and uses a phenomenological model whose parameters are obtained by fitting UMI training data of known DNA templates; smCounter2 uses a similar approach to MAGERI, establishing a second beta distribution model for UMI sequencing clusters with a number of less than 5, but this is still a phenomenological model and cannot accurately describe sequencing errors. Importantly, these phenomenological models cannot be used to estimate the quality of each UMI consensus base; therefore, UMI sequencing data cannot be analyzed by other known variant detection methods, such as Mutect2, because information about the quality of each consensus base is usually required.

此外,基於規則方法(rule-based approach)的DeepSNVMiner亦無法揭露PCR或定序錯誤的資訊,導致變異檢測之信效度低;另一方面,應用UMI-VarCal於評估變異檢測的信效度亦有其難度,同樣係由於UMI-VarCal採用了基於規則方法 (rule-based method);上述習知技術所遭遇的限制係肇因於缺乏堅實的統計架構以描述發生於UMI定序中所發生的PCR錯誤及定序錯誤。In addition, DeepSNVMiner, which is based on a rule-based approach, cannot reveal information about PCR or sequencing errors, resulting in low reliability and validity of variant detection. On the other hand, it is also difficult to apply UMI-VarCal to evaluate the reliability and validity of variant detection, also because UMI-VarCal adopts a rule-based method. The limitations of the above-mentioned known technologies are due to the lack of a solid statistical framework to describe PCR errors and sequencing errors that occur in UMI sequencing.

按,中國專利號CN112687339B,專利名稱「一種統計血漿DNA片段測序數據中序列錯誤的方法和裝置」,該專利具體揭示了一種統計血漿DNA片段測序數據中序列錯誤的方法和裝置,其包括了讀取建庫過程中真實的UMI,將其作爲UMI參考序列集合;統計定序數據中所有reads cycles之reads UMI錯誤次數以及錯誤鹼基占當前循環中鹼基的比例;統計定序完全模板之序列錯誤信息,通過識別定序完全模板之5’端UMI序列和3’端UMI序列以統計每個擴增循環中的錯誤率;該方法利用統計個別擴增循環中的錯誤率,以評估數據品質,並能在後續的團簇(cluster)糾錯過程中提供校正鹼基和鹼基的質量值,給團簇糾錯和質量值校正提供先驗概率,以提高了低頻突變檢測準確性;該專利雖然揭示了一種整合了PCR擴增錯誤率與定序錯誤率作為鹼基校正之參考依據,惟其錯誤率之定義必須完成模板之定序後,進行不同核酸片段間進行序列截取與比對,在遭遇到隨機採樣中出現了由錯誤鹼基所主導之觀察鹼基群的狀況下,該方法並無判斷共識鹼基真偽之能力,因此難以描述共識鹼基之錯誤率並針對每一共識鹼基給予品質分數,且在缺乏UMI定序建庫與序列比對的前提下,該方法難以實現。According to Chinese patent number CN112687339B, the patent name is "A method and device for counting sequence errors in plasma DNA fragment sequencing data". The patent specifically discloses a method and device for counting sequence errors in plasma DNA fragment sequencing data, which includes reading the real UMI in the library construction process and using it as a UMI reference sequence set; counting the reads of all reads cycles in the sequencing data; The method uses the error rate in individual expansion cycles to evaluate data quality, and can provide the correction base and base quality value in the subsequent cluster correction process, providing a priori probability for cluster correction and quality value correction, so as to improve the accuracy of low-frequency mutation detection; Although Lee et al. have revealed a method that integrates PCR amplification error rate and sequencing error rate as a reference for base correction, the definition of the error rate must be completed after the template is sequenced, and sequence interception and comparison between different nucleic acid fragments are performed. When encountering a situation where an observed base group dominated by erroneous bases appears in random sampling, the method has no ability to determine the authenticity of the consensus base, so it is difficult to describe the error rate of the consensus base and give a quality score for each consensus base. In addition, the method is difficult to implement in the absence of UMI sequencing library construction and sequence alignment.

由於低頻突變偵測是精準醫療與個人化診療之重要基石,解決突變頻率低於定序錯誤率所造成的共識鹼基錯誤判斷,將深遠的影響該領域未來的發展;在現有技術中,為了排除定序錯誤,分子條碼技術被廣泛的應用,其係將具有獨特序列之分子條碼黏合至不同的核酸片段,並以PCR擴增帶有分子條碼之DNA片段後,隨機抽樣定序,並利用抽樣同一分子條碼之共識序列以排除定序錯誤。Since low-frequency mutation detection is an important cornerstone of precision medicine and personalized diagnosis, solving the consensus base misjudgment caused by mutation frequency being lower than the sequencing error rate will have a profound impact on the future development of this field. In the existing technology, in order to eliminate sequencing errors, molecular barcoding technology is widely used. It is to attach molecular barcodes with unique sequences to different nucleic acid fragments, and use PCR to amplify DNA fragments with molecular barcodes, randomly sample and sequence, and use the consensus sequence of the same molecular barcode to eliminate sequencing errors.

然而,在核酸序列擴增的過程中面臨著相當程度的PCR錯誤,此需要額外處理;現有的計算工具雖能處理分子條碼數據,但通常係採用經驗法則或現象學模型以評估PCR錯誤,並無法為個別共識鹼基提供品質分數,這增加了突變檢測的不確定性與不便性。而現有技術所遭遇的限制,係肇因於缺乏能夠描述PCR擴增過程中所發生的錯誤,且能同時考慮定序錯誤的統計模型。However, a considerable amount of PCR errors are encountered during nucleic acid sequence amplification, which requires additional processing; although existing computational tools can process molecular barcode data, they usually use empirical rules or phenomenological models to evaluate PCR errors and cannot provide quality scores for individual consensus bases, which increases the uncertainty and inconvenience of mutation detection. The limitations of existing technologies are due to the lack of a statistical model that can describe the errors that occur during PCR amplification and can simultaneously consider sequencing errors.

緣此,為解決前述現有技術所遭遇的問題,本發明具體揭示了一種用於量化PCR及定序錯誤的統計模型,用以評估分子條碼共識鹼基之品質,並賦予每一共識鹼基一品質分數,作為評估該共識鹼基信效度之參考依據;前述統計模型係基於PCR過程之本質並納入定序品質資訊做為計算的基礎,其可用於評估每一個數據集之PCR錯誤率,且可為每一個共識鹼基計算品質分數。Therefore, in order to solve the problems encountered by the aforementioned prior art, the present invention specifically discloses a statistical model for quantifying PCR and sequencing errors, which is used to evaluate the quality of molecular barcode consensus bases and assign a quality score to each consensus base as a reference for evaluating the reliability and validity of the consensus base; the aforementioned statistical model is based on the nature of the PCR process and incorporates sequencing quality information as the basis for calculation, which can be used to evaluate the PCR error rate of each data set and calculate a quality score for each consensus base.

具體地,本發明之一目的在於提供一種共識鹼基錯誤率的評估方法,其包括:自一擴增子池選取一個或多個擴增鹼基,其中,該擴增子池係擴增自一未知鹼基N;定序第j個選取之擴增鹼基為觀察鹼基s j,其中j為任意正整數,並將該些觀察鹼基s j組成一觀察鹼基集合S;由該觀察鹼基集合S計算一共識鹼基c(S);計算一共識鹼基錯誤率P c(S),其係該共識鹼基c(S)非為該未知鹼基N之機率,其中,該共識鹼基錯誤率P c(S)為自該觀察鹼基集合S中隨機選取出該未知鹼基N相同於一假定鹼基b,但該假定鹼基b不相同於該共識鹼基c(S)之機率的總和,其中該假定鹼基b選自由腺嘌呤(A)、胸腺嘌呤(T)、鳥糞嘌呤(G)及胞嘧啶(C)所組成之群組,且該假定鹼基b不相同於該共識鹼基c(S)。 Specifically, one object of the present invention is to provide a method for evaluating the consensus base error rate, which comprises: selecting one or more extended bases from an extender pool, wherein the extender pool is extended from an unknown base N; ordering the j-th selected extended base as an observed base s j , wherein j is an arbitrary positive integer, and forming the observed bases s j into an observed base set S; calculating a consensus base c(S) from the observed base set S; calculating a consensus base error rate P c(S) , which is the probability that the consensus base c(S) is not the unknown base N, wherein the consensus base error rate P c(S) is the sum of the probabilities that the unknown base N is randomly selected from the observed base set S and is identical to a hypothetical base b, but the hypothetical base b is different from the consensus base c(S), wherein the hypothetical base b is selected from the group consisting of adenine (A), thymine (T), guanosine (G) and cytosine (C), and the hypothetical base b is different from the consensus base c(S).

如前所述之方法,其中,該共識鹼基錯誤率P c(S)之計算係透過下述公式而得: ; 其中,該P(S|N=b)為該未知鹼基N相同於該假定鹼基b時,自該擴增子池中隨機選取出該觀察鹼基集合S之機率。 In the method described above, the consensus base error rate P c (S) is calculated by the following formula: ; Wherein, P(S|N=b) is the probability of randomly selecting the observed base set S from the expander pool when the unknown base N is the same as the assumed base b.

如前所述之方法,其中,該P(S|N =b)之計算方法包括:定義一錯誤鹼基分數F R,其為一擴增錯誤鹼基佔該擴增子池中該些擴增鹼基的比例,其中,該擴增錯誤鹼基為不同於該假定鹼基b之鹼基;及依據下述公式計算該P(S|N =b): ; 其中,該 為該觀察鹼基集合S中每一個觀察鹼基s j,在該未知鹼基N相同於該假定鹼基b,且該擴增子池中該錯誤擴增鹼基的比例為F R時的機率 之總乘積。 As described above, the method for calculating P(S|N=b) includes: defining an error base fraction FR , which is the ratio of an extended error base to the extended bases in the extender pool, wherein the extended error base is a base different from the assumed base b; and calculating P(S|N=b) according to the following formula: ; Among them, the For each observed base sj in the observed base set S, the probability that the unknown base N is the same as the assumed base b and the proportion of the incorrectly extended bases in the expander pool is FR The total product of .

如前所述之方法,在該觀察鹼基s j相同於該假定鹼基b時,該 相等於一正確鹼基分數 ,在該觀察鹼基s j不同於該假定鹼基b時,該 相等於該錯誤鹼基分數 ,其中,該正確鹼基分數 與該錯誤鹼基分數 之總和為1。 As described above, when the observed base sj is the same as the assumed base b, the Equivalent to a correct base fraction , when the observed base sj is different from the assumed base b, the Equal to the error base score , where the correct base fraction The base score of the error The sum of is 1.

如前所述之方法,其進一步包括:依據一定序正確率P seq1、一定序錯誤率P seq2及一擴增鹼基比例 修正該 ,其為該未知鹼基N相同於該假定鹼基b時,且該擴增子池中一不相同於該假定鹼基b之擴增鹼基b”的比例為 時,該觀察鹼基s j相同於另一觀察鹼基b’的機率,其中,當該擴增鹼基b”不相同於該假定鹼基b時,該擴增鹼基比例 相等於該擴增鹼基分數 ,修正後該 滿足下述公式: ; 其中,當該另一觀察鹼基b’相同於該假定鹼基b時,P pcr1代表該假定鹼基b擴增為一相同於該另一觀察鹼基b’之鹼基的機率,P pcr2代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b; 其中,當該另一觀察鹼基b’不同於該假定鹼基b時,P pcr3代表該假定鹼基b擴增為相同於該另一觀察鹼基b’之鹼基的機率,P pcr4代表該假定鹼基b擴增為相同於該假定鹼基b之鹼基的機率,P pcr5代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b,亦不相同於該另一觀察鹼基b’; 其中,該定序正確率P seq1代表該另一觀察鹼基b’定序正確之機率,該定序錯誤率P seq2代表該擴增鹼基b”錯誤定序為該另一觀察鹼基b’之機率。 The method as described above further comprises: according to a certain sequence correctness rate P seq1 , a certain sequence error rate P seq2 and an expanded base ratio Correct the for , which is when the unknown base N is identical to the assumed base b, and the proportion of an extended base b" in the extender pool that is different from the assumed base b is When the observed base sj is the same as another observed base b', when the extended base b" is different from the assumed base b, the extended base ratio Equal to the expanded base score , after correction Satisfy the following formula: ; wherein, when the other observed base b' is the same as the assumed base b, P pcr1 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr2 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is not the same as the assumed base b; wherein, when the other observed base b' is different from the assumed base b, P pcr3 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr4 represents the probability that the assumed base b is expanded to a base that is the same as the assumed base b, P pcr5 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is different from the assumed base b and the other observed base b'; wherein the sequencing accuracy rate Pseq1 represents the probability that the other observed base b' is sequenced correctly, and the sequencing error rate Pseq2 represents the probability that the expanded base b" is incorrectly sequenced to the other observed base b'.

如前所述之方法,其更包括以該另一觀察鹼基b’之定序品質分數Q i計算該定序正確率P seq1及該定序錯誤率P seq2,其中,該定序正確率P seq1可以由下述公式計算而得: ; 該定序錯誤率P seq2可以由下述公式計算而得: The method as described above further includes calculating the sequencing accuracy rate P seq1 and the sequencing error rate P seq2 using the sequencing quality score Qi of the other observed base b', wherein the sequencing accuracy rate P seq1 can be calculated by the following formula: ; The sequencing error rate P seq2 can be calculated by the following formula: .

如前所述之方法,其中,該觀察鹼基集合S包括有d個觀察鹼基s j,該d個觀察鹼基s j包含有n b個相同於該假定鹼基b,以及(d-n b)個不同於該假定鹼基b之該擴增錯誤鹼基,其中,該d值為j的最大正整數值,且該n b為小於或等於該d之正整數; 在該觀察鹼基s j不同於該假定鹼基b時,該 相等於該錯誤鹼基分數F R,計算該 之總乘積如下式所示: As described above, the observed base set S includes d observed bases sj , the d observed bases sj include nb the same as the assumed base b, and ( dnb ) the extended error bases different from the assumed base b, wherein the d value is the maximum positive integer value of j, and nb is a positive integer less than or equal to d; when the observed base sj is different from the assumed base b, the is equal to the error base fraction FR , calculate the The total product is as follows: .

如前所述之方法,其中,該P(S|N=b)進一步以動差值近似計算而獲得: ; 其中,該 係於該擴增錯誤鹼基分數F R的機率分布之第(k+d-n b)的動差值,其中,該k為小於或等於該n b的任意非負整數; 其中,該 之近似值為r/2 k+d-nb,其中,r為每一循環中對應於該未知鹼基N之基因座發生聚合連鎖反應擴增錯誤的機率。 As described above, the P(S|N=b) is further approximated by the difference in momentum to obtain: ; Among them, the is the (k+dn b )th moment value of the probability distribution of the extended error base fraction FR , where k is any non-negative integer less than or equal to n b ; where The approximate value of is r/2 k+d-nb , where r is the probability of a polymerization cascade amplification error occurring at the locus corresponding to the unknown base N in each cycle.

如前所述之方法,其進一步包括:取該模板序列,黏合以一分子條碼,並以此模板序列進行聚合酶連鎖反應以產生該擴增子池;自該擴增子池隨機抽樣並定序以建立該觀察鹼基集合S,其中,分子條碼包括核酸捕獲序列(nucleic acid capture sequence)或唯一分子識別序列(unique molecular identifier sequence, UMIs)。The method as described above further comprises: taking the template sequence, attaching a molecular barcode, and performing a polymerase chain reaction with the template sequence to generate the amplicon pool; randomly sampling and sequencing from the amplicon pool to establish the observed base set S, wherein the molecular barcode comprises a nucleic acid capture sequence (nucleic acid capture sequence) or a unique molecular identifier sequence (UMIs).

本發明之另一目的在於提供一種共識鹼基錯誤率的評估系統,其包括一聚合酶連鎖反應品質評估模組,配置以執行如前所述之方法以計算一共識鹼基錯誤率P c(S)Another object of the present invention is to provide a consensus base error rate evaluation system, which includes a polymerase chain reaction quality evaluation module configured to execute the method as described above to calculate a consensus base error rate P c(S) .

如前所述之系統,其更包括一定序品質評估模組,與該聚合酶連鎖反應品質評估模組訊號連接,配置以讀取另一觀察鹼基b’之定序品質分數Q i計算一定序正確率P seq1及一定序錯誤率P seq2,其中,該定序正確率P seq1可以由下述公式計算而得: ; 該定序錯誤率P seq2可以由下述公式計算而得: ;及 一共識鹼基錯誤率P c(S)修正單元,配置於該聚合酶連鎖反應品質評估模組中,並與該定序品質評估模組訊號連接,以依據該定序正確率P seq1、該定序錯誤率P seq2及一擴增鹼基比例 修正該 ,其為該未知鹼基N相同於該假定鹼基b時,且該擴增子池中一不相同於該假定鹼基b之擴增鹼基b”的比例為 時,該觀察鹼基s j相同於另一觀察鹼基b’的機率,其中,當該擴增鹼基b”不相同於該假定鹼基b時,該擴增鹼基比例 相等於該擴增鹼基分數 ,修正後該 滿足下述公式: ; 其中,當該另一觀察鹼基b’相同於該假定鹼基b時,P pcr1代表該假定鹼基b擴增為一相同於該另一觀察鹼基b’之鹼基的機率,P pcr2代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b; 其中,當該另一觀察鹼基b’不同於該假定鹼基b時,P pcr3代表該假定鹼基b擴增為相同於該另一觀察鹼基b’之鹼基的機率,P pcr4代表該假定鹼基b擴增為相同於該假定鹼基b之鹼基的機率,P pcr5代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b,亦不相同於該另一觀察鹼基b’; 其中,該定序正確率P seq1代表該另一觀察鹼基b’定序正確之機率,該定序錯誤率P seq2代表該擴增鹼基b”錯誤定序為該另一觀察鹼基b’之機率。 The system as described above further includes a sequence quality assessment module, which is signal-connected to the polymerase chain reaction quality assessment module and configured to read the sequence quality score Qi of another observed base b' to calculate a sequence accuracy rate Pseq1 and a sequence error rate Pseq2 , wherein the sequence accuracy rate Pseq1 can be calculated by the following formula: ; The sequencing error rate P seq2 can be calculated by the following formula: and a consensus base error rate P c(S) correction unit, which is disposed in the polymerase chain reaction quality assessment module and is connected to the sequencing quality assessment module signal to adjust the sequence accuracy rate P seq1 , the sequencing error rate P seq2 and an expanded base ratio Correct the for , which is when the unknown base N is identical to the assumed base b, and the proportion of an extended base b" in the extender pool that is different from the assumed base b is When the observed base sj is the same as another observed base b', when the extended base b" is different from the assumed base b, the extended base ratio Equal to the expanded base score , after correction Satisfy the following formula: ; wherein, when the other observed base b' is the same as the assumed base b, P pcr1 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr2 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is not the same as the assumed base b; wherein, when the other observed base b' is different from the assumed base b, P pcr3 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr4 represents the probability that the assumed base b is expanded to a base that is the same as the assumed base b, P pcr5 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is different from the assumed base b and the other observed base b'; wherein the sequencing accuracy rate Pseq1 represents the probability that the other observed base b' is sequenced correctly, and the sequencing error rate Pseq2 represents the probability that the expanded base b" is incorrectly sequenced to the other observed base b'.

本發明之再一目的在於提供一種共識鹼基錯誤率的評估裝置,其包括一存儲器及一處理器,其中,該存儲器用於存儲一程序;該處理器配置以執行該程序以實現如前所述之共識鹼基錯誤率評估方法。Another object of the present invention is to provide a consensus base error rate evaluation device, which includes a memory and a processor, wherein the memory is used to store a program; the processor is configured to execute the program to implement the consensus base error rate evaluation method as described above.

本發明所提供之技術方案,用於偵測低頻突變時,可針對每一個共識鹼基計算一個品質分數,以準確的反應其錯誤率;有了共識鹼基的品質分數,能夠預期達到更為精確的低頻突變偵測。The technical solution provided by the present invention can calculate a quality score for each consensus base when detecting low-frequency mutations to accurately reflect its error rate; with the quality score of the consensus base, more accurate low-frequency mutation detection can be expected.

下面通過具體實施方式結合圖式以對本發明之內容進一步詳細說明;在以下的實施方式中,其細節描述是爲了使本發明更容易被理解。然而,本領域通常知識者可以輕易地理解到,當中部分技術特徵在不同情况下是可以省略的,或者可由其他元件、材料、方法所替代。在某些情况下,本發明相關的一些操作並未於說明書中顯示或描述,係爲了避免本發明的核心部分被過多的描述所淹沒,且對於本領域通常知識者而言,詳細描述這些相關操作非為必要,其可根據說明書中描述及基於本領域之普通技術知識即可完整瞭解該些相關操作。The content of the present invention is further described in detail below through specific implementation methods combined with drawings; in the following implementation methods, the detailed description is to make the present invention easier to understand. However, those skilled in the art can easily understand that some of the technical features can be omitted in different situations, or can be replaced by other components, materials, and methods. In some cases, some operations related to the present invention are not shown or described in the specification in order to avoid the core part of the present invention being overwhelmed by too much description, and for those skilled in the art, it is not necessary to describe these related operations in detail, and they can fully understand these related operations according to the description in the specification and based on the common technical knowledge in this field.

請參閱圖1A,其係說明本發明之第一實施方式,其係一種共識鹼基錯誤率的評估方法,其包括: 擴增鹼基選取處理(S1):自一擴增子池選取一個或多個擴增鹼基,其中,該擴增子池係擴增自一未知鹼基N; 觀察鹼基集合建立處理(S2):定序第j個選取之擴增鹼基為觀察鹼基s j,並將該些觀察鹼基s j組成一觀察鹼基集合S,其中j為任意正整數; 共識鹼基c(S)計算處理(S3):由該觀察鹼基集合S計算一共識鹼基c(S); 共識鹼基錯誤率P c(S)計算處理(S4):計算一共識鹼基錯誤率P c(S),其係該共識鹼基c(S)非為該未知鹼基N之機率,其中,該共識鹼基錯誤率P c(S)為自該觀察鹼基集合S中隨機選取出該未知鹼基N相同於一假定鹼基b,但該假定鹼基b不相同於該共識鹼基c(S)之機率的總和,其中該假定鹼基b選自由腺嘌呤(A)、胸腺嘌呤(T)、鳥糞嘌呤(G)及胞嘧啶(C)所組成之群組,且該假定鹼基b不相同於該共識鹼基c(S)。 Please refer to FIG. 1A , which illustrates the first embodiment of the present invention, which is a method for evaluating the consensus base error rate, comprising: an extended base selection process (S1): selecting one or more extended bases from an extended sub-pool, wherein the extended sub-pool is extended from an unknown base N; an observed base set establishment process (S2): ordering the j-th selected extended base as an observed base s j , and grouping the observed bases s j into an observed base set S, wherein j is an arbitrary positive integer; a consensus base c(S) calculation process (S3): calculating a consensus base c(S) from the observed base set S; Consensus base error rate P c(S) calculation process (S4): Calculate a consensus base error rate P c(S) , which is the probability that the consensus base c(S) is not the unknown base N, wherein the consensus base error rate P c(S) is the sum of the probabilities that the unknown base N is randomly selected from the observed base set S to be the same as a hypothetical base b, but the hypothetical base b is not the same as the consensus base c(S), wherein the hypothetical base b is selected from the group consisting of adenine (A), thymine (T), guanosine (G) and cytosine (C), and the hypothetical base b is not the same as the consensus base c(S).

所謂共識鹼基c(S)所指的是,於觀察鹼基集合S中,出現頻率最高的觀察鹼基s j;舉例來說,於具有6個觀察鹼基s j的觀察鹼基集合S中,觀察鹼基s j分別為A、A、A、T、T及A,則觀察鹼基集合S之共識鹼基c(S)為A,其出現頻率為2/3大於T之出現頻率1/3;由上述示例可見,由未知鹼基N所擴增之擴增鹼基中,出現非共識鹼基c(S)’之鹼基,在不考慮定序錯誤的前提下,該非共識鹼基c(S)’為聚合酶擴增反應錯誤所導致,是以在第一實施方式中透過評估共識鹼基c(S)非為假定鹼基b的機率,亦即共識鹼基c(S)錯誤率之計算,以作為共識鹼基c(S)信效度之判斷依據。 The so-called consensus base c(S) refers to the observed base sj with the highest frequency in the observed base set S. For example, in the observed base set S with 6 observed bases sj , the observed bases sj are A, A, A, T, T and A respectively. Then the consensus base c(S) of the observed base set S is A, and its frequency of occurrence is 2/3, which is greater than the frequency of occurrence of T, which is 1/3. From the above example, it can be seen that in the expanded bases expanded by the unknown base N, the base with non-consensus base c(S)' appears. Without considering the fixed Under the premise of sequence error, the non-consensus base c(S)' is caused by a polymerase amplification reaction error. Therefore, in the first embodiment, the probability that the consensus base c(S) is not the assumed base b is evaluated, that is, the error rate of the consensus base c(S) is calculated as a basis for judging the validity of the consensus base c(S).

於共識鹼基錯誤率P c(S)計算處理(S4)中,該共識鹼基錯誤率P c(S)之計算係透過下述公式而得: ; 其中,該P(S|N=b)為該未知鹼基N相同於該假定鹼基b時,自該擴增子池中隨機選取出該觀察鹼基集合S之機率。 In the consensus base error rate P c(S) calculation process (S4), the consensus base error rate P c(S) is calculated by the following formula: ; Wherein, P(S|N=b) is the probability of randomly selecting the observed base set S from the expander pool when the unknown base N is the same as the assumed base b.

於共識鹼基錯誤率P c(S)計算處理(S4)中,該P(S|N =b)之計算方法包括定義一錯誤鹼基分數F R,其為一擴增錯誤鹼基佔該擴增子池中該些擴增鹼基的比例,其中,該擴增錯誤鹼基為不同於該假定鹼基b之鹼基,並依據下述公式計算該P(S|N =b): ; 其中,該 為該觀察鹼基集合S中每一個觀察鹼基s j,在該未知鹼基N相同於該假定鹼基b,且該擴增子池中該錯誤擴增鹼基的比例為F R時的機率 之總乘積。 In the consensus base error rate Pc (S) calculation process (S4), the calculation method of P(S|N=b) includes defining an error base fraction FR , which is the ratio of an extended error base to the extended bases in the extended sub-pool, wherein the extended error base is a base different from the assumed base b, and calculating P(S|N=b) according to the following formula: ; Among them, the For each observed base sj in the observed base set S, the probability that the unknown base N is the same as the assumed base b and the proportion of the incorrectly extended bases in the expander pool is FR The total product of .

於本實施方式,在該觀察鹼基s j相同於該假定鹼基b時,該 相等於一正確鹼基分數 ,在該觀察鹼基s j不同於該假定鹼基b時,該 相等於該錯誤鹼基分數 ,其中,該正確鹼基分數 與該錯誤鹼基分數 之總和為1。 In this embodiment, when the observed base sj is the same as the assumed base b, the Equivalent to a correct base fraction , when the observed base sj is different from the assumed base b, the Equal to the error base score , where the correct base fraction The error base score The sum of is 1.

於本實施方式中,該觀察鹼基集合S包括有d個觀察鹼基s j,該d個觀察鹼基s j包含有n b個相同於該假定鹼基b,以及(d-n b)個不同於該假定鹼基b之該擴增錯誤鹼基,其中,該d值為j的最大正整數值,且該n b為小於或等於該d之正整數;在該觀察鹼基s j不同於該假定鹼基b時,該 相等於該錯誤鹼基分數F R,計算該 之總乘積如下式所示: In this embodiment, the observed base set S includes d observed bases sj , the d observed bases sj include nb the same as the assumed base b, and ( dnb ) the extended error bases different from the assumed base b, wherein the d value is the maximum positive integer value of j, and nb is a positive integer less than or equal to d; when the observed base sj is different from the assumed base b, the is equal to the error base fraction FR , calculate the The total product is as follows: .

於本實施方式中,該P(S|N=b)的計算進一步展開如下: 其中, 為該觀察鹼基集合S中該錯誤鹼基分數 之機率分布的動差項;由於該擴增子池是由一次或多次擴增循環而獲得,當擴增循環次數越多,該動差項的計算將變得相當密集;原則上,當僅考慮第一個擴增錯誤出現,動差項 的數值可以藉由近似計算處理以減少計算的密集度,進一步地,以動差值近似計算而獲得: = ; 其中,該 代表該擴增錯誤鹼基分數F R的機率分布之第(k+d-n b)的動差值,其中,該k為小於或等於該n b的任意非負整數,該 之近似值為r/2 k+d-nb,其中,r為每一循環中對應於該未知鹼基N之基因座發生聚合連鎖反應擴增錯誤的機率。 In this embodiment, the calculation of P(S|N=b) is further expanded as follows: in, is the fraction of the wrong base in the observed base set S The moment term of the probability distribution of the expansion subpool; since the expansion subpool is obtained by one or more expansion cycles, the calculation of the moment term will become quite intensive when the number of expansion cycles increases; in principle, when only the first expansion error is considered, the moment term The value of can be processed by approximation to reduce the computational intensity, and further, the value of can be approximated by the difference in momentum to obtain: = ; Among them, the represents the (k+dn b )th moment value of the probability distribution of the extended error base fraction FR , where k is any non-negative integer less than or equal to n b , The approximate value of is r/2 k+d-nb , where r is the probability of a polymerization cascade amplification error occurring at the locus corresponding to the unknown base N in each cycle.

請參閱圖1B,其係說明本發明之第二實施方式,其係一種共識鹼基錯誤率的評估方法,其方法步驟及計算方式與第一實施方式相同,惟在第二實施方式中,不僅計算未知鹼基N於擴增循環過程中發生錯誤的機率,更納入完成擴增循環後,定序過程發生的錯誤率。Please refer to FIG. 1B , which illustrates the second embodiment of the present invention, which is a method for evaluating the consensus base error rate. The method steps and calculation method are the same as those of the first embodiment, but in the second embodiment, not only the probability of an error occurring in the unknown base N during the expansion cycle is calculated, but also the error rate occurring in the sequencing process after the expansion cycle is completed is included.

藉前述示例進一步說明,在具有6個觀察鹼基s j的觀察鹼基集合S中,其觀察鹼基s j分別為A、A、A、T、T及A,其共識鹼基c(S)為A;由於觀察鹼基集合S中出現的非共識鹼基c(S)’,其可以是未知鹼基N所擴增之擴增鹼基中,出現了不同於假定鹼基b之擴增鹼基,例如T,也可以是出現了相同於假定鹼基b之鹼基,例如A,但由於定序出錯導致A定序為T,亦可以是出現了不同於假定鹼基b也不同於非共識鹼基c(S)’之鹼基,例如C或G,但由於定序錯誤將其定序為T;考量上述情形,在第二實施方式中,進一步整合了聚合酶擴增反應錯誤率及定序錯誤率以計算共識鹼基c(S)錯誤率,作為共識鹼基c(S)信效度之判斷依據。 To further illustrate, in an observed base set S with 6 observed bases sj , the observed bases sj are A, A, A, T, T and A respectively, and the consensus base c(S) is A; since the non-consensus base c(S)' appears in the observed base set S, it can be that the expanded bases expanded by the unknown base N have an expanded base different from the assumed base b, such as T, or it can be that the base that is the same as the assumed base b, such as A, appears, but due to a sequencing error, A is sequenced. It may be T, or a base different from the assumed base b and the non-consensus base c(S)', such as C or G, may appear, but it is sequenced as T due to a sequencing error. Taking the above situation into consideration, in the second embodiment, the polymerase amplification reaction error rate and the sequencing error rate are further integrated to calculate the consensus base c(S) error rate as a basis for determining the reliability and validity of the consensus base c(S).

據此,於第二實施方式中,該方法更包括: 共識鹼基錯誤率P c(S)之計算修正處理(S4-1):依據一定序正確率P seq1、一定序錯誤率P seq2及一擴增鹼基比例 修正該 ,其為該未知鹼基N相同於該假定鹼基b時,且該擴增子池中一不相同於該假定鹼基b之擴增鹼基b”的比例為 時,該觀察鹼基s j相同於另一觀察鹼基b’的機率,其中,當該擴增鹼基b”不相同於該假定鹼基b時,該擴增鹼基比例 相等於該擴增鹼基分數 ,修正後該 滿足下述公式: ; 其中,當該另一觀察鹼基b’相同於該假定鹼基b時,P pcr1代表該假定鹼基b擴增為一相同於該另一觀察鹼基b’之鹼基的機率,P pcr2代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b; 其中,當該另一觀察鹼基b’不同於該假定鹼基b時,P pcr3代表該假定鹼基b擴增為相同於該另一觀察鹼基b’之鹼基的機率,P pcr4代表該假定鹼基b擴增為相同於該假定鹼基b之鹼基的機率,P pcr5代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b,亦不相同於該另一觀察鹼基b’; 其中,該定序正確率P seq1代表該另一觀察鹼基b’定序正確之機率,該定序錯誤率P seq2代表該擴增鹼基b”錯誤定序為該另一觀察鹼基b’之機率。 Accordingly, in the second embodiment, the method further comprises: Calculation and correction processing of consensus base error rate Pc (S) (S4-1): based on a certain sequence correctness rate Pseq1 , a certain sequence error rate Pseq2 and an extended base ratio Correct the for , which is when the unknown base N is identical to the assumed base b, and the proportion of an extended base b" in the extender pool that is different from the assumed base b is When the observed base sj is the same as another observed base b', when the extended base b" is different from the assumed base b, the extended base ratio Equal to the expanded base score , after correction Satisfy the following formula: ; wherein, when the other observed base b' is the same as the assumed base b, P pcr1 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr2 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is not the same as the assumed base b; wherein, when the other observed base b' is different from the assumed base b, P pcr3 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr4 represents the probability that the assumed base b is expanded to a base that is the same as the assumed base b, P pcr5 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is different from the assumed base b and the other observed base b'; wherein the sequencing accuracy rate Pseq1 represents the probability that the other observed base b' is sequenced correctly, and the sequencing error rate Pseq2 represents the probability that the expanded base b" is incorrectly sequenced to the other observed base b'.

於第二實施方式中,該方法更包括: 定序正確率及定序錯誤率計算處理(S4a):以該另一觀察鹼基b’之定序品質分數Q i計算該定序正確率P seq1及該定序錯誤率P seq2,其中,該定序正確率P seq1可以由下述公式計算而得: ; 該定序錯誤率P seq2可以由下述公式計算而得: In the second embodiment, the method further comprises: a sequence accuracy rate and a sequence error rate calculation process (S4a): the sequence accuracy rate P seq1 and the sequence error rate P seq2 are calculated using the sequence quality score Qi of the other observed base b', wherein the sequence accuracy rate P seq1 can be calculated by the following formula: ; The sequencing error rate P seq2 can be calculated by the following formula: .

在多個實施例中,請繼續參閱圖1A至1B,在選取該擴增鹼基之前,該方法進一步包括擴增子池建立處理(S1a),其包含:取一模板序列,黏合以一分子條碼,其中該模板序列包括有一對應該未知鹼基N之基因座,及以此模板序列進行聚合酶連鎖反應以產生該擴增子池,其中,該分子條碼包括:核酸捕獲序列(nucleic acid capture sequence)或唯一分子識別序列(unique molecular identifier sequence, UMIs),但不限於此。In various embodiments, please continue to refer to FIG. 1A to 1B , before selecting the amplicon base, the method further includes an amplicon pool establishment process (S1a), which includes: taking a template sequence, attaching a molecular barcode, wherein the template sequence includes a locus corresponding to the unknown base N, and performing a polymerase chain reaction with the template sequence to generate the amplicon pool, wherein the molecular barcode includes: a nucleic acid capture sequence (nucleic acid capture sequence) or a unique molecular identifier sequence (UMIs), but is not limited thereto.

下文進一步說明本發明之各實施方式中,其計算基礎之具體推導;首先,假定經過一個理想PCR循環後,個別DNA模板複製為相同序列之副本(duplicate);最原始的單股模板序列(single-stranded template)中每一核苷酸仍保持原樣,後續擴增循環中產生的PCR錯誤仍可能導致副本序列的特定位點上產生錯誤鹼基,從而在後續PCR循環中產生不同的鹼基。The following further describes the specific derivation of the calculation basis in each embodiment of the present invention. First, it is assumed that after an ideal PCR cycle, individual DNA templates are copied into duplicates of the same sequence. Each nucleotide in the original single-stranded template sequence remains intact. PCR errors generated in subsequent amplification cycles may still cause incorrect bases to be generated at specific positions in the duplicate sequence, thereby generating different bases in subsequent PCR cycles.

假定經過R個循環後,共有2 R個核酸分子從單一的雙股DNA模板序列擴增而來,其中可能含有一個鹼基不同於該模板序列上相應位點之鹼基,亦即為錯誤鹼基(false base)。錯誤鹼基的比例隨著PCR錯誤率(r)以及發生錯誤的循環數(i)而定。在2 R個擴增子中,僅能針對少部分的擴增子進行隨機抽樣定序。當抽樣的擴增子攜帶錯誤鹼基,就可以觀察到PCR錯誤的發生。是以,觀察PCR錯誤可以想像成一個由2 R個擴增子中取樣出帶有錯誤鹼基之擴增子的過程,而PCR錯誤率的評估就取決於這些擴增子中錯誤鹼基的比例(F R)。為了評估PCR的錯誤率,必須建立起錯誤鹼基比例F R的機率分布。 Assume that after R cycles, a total of 2 R nucleic acid molecules are amplified from a single double-stranded DNA template sequence, which may contain a base that is different from the base at the corresponding position on the template sequence, that is, a false base. The proportion of false bases depends on the PCR error rate (r) and the number of cycles in which errors occur (i). Among the 2 R amplicon, only a small number of amplicon can be randomly sampled and sequenced. When the sampled amplicon carries a false base, the occurrence of PCR errors can be observed. Therefore, observing PCR errors can be imagined as a process of sampling amplicon with erroneous bases from 2R amplicon, and the estimation of PCR error rate depends on the ratio of erroneous bases in these amplicon ( FR ). In order to estimate the error rate of PCR, a probability distribution of the ratio of erroneous bases FR must be established.

在第i個循環中,令t i及f i分別真實鹼基及錯誤鹼基的數量。公式(I)具體地描述了由該t i個真實鹼基以及該f i個錯誤鹼基在下一個循環(第i+1個循環)中分別衍生出的m及n個錯誤鹼基之機率。在公式(I)中,r代表了於每個循環中、每一個鹼基的PCR錯誤率 (error rate per base per cycle),在本實施方式中,r即代表了對應於未知鹼基N之基因座發生PCR錯誤的機率,而由該t i個真實鹼基中衍生出的m個錯誤鹼基之機率分布遵守了二項分布(binomial distribution)。在沒有對於任何鹼基偏好發生PCR錯誤的前提下,由該f i個錯誤鹼基所衍生出的錯誤鹼基,由於其並不需要發生PCR錯誤即會衍生出錯誤鹼基,是故其機率為1-r,又或者發生了PCR錯誤,但該錯誤將原本的錯誤鹼基擴增為另外兩種錯誤鹼基的任一種,其機率為 ,這解釋了公式(I)中所述及的 In the i-th cycle, let ti and fi be the number of true bases and erroneous bases, respectively. Formula (I) specifically describes the probability of m and n erroneous bases being derived from the ti true bases and the fi erroneous bases in the next cycle (i+1th cycle). In formula (I), r represents the PCR error rate per base per cycle in each cycle. In this embodiment, r represents the probability of PCR error at the locus corresponding to the unknown base N, and the probability distribution of m erroneous bases derived from the ti true bases follows the binomial distribution. Under the premise that there is no PCR error for any base preference, the probability of the wrong base derived from the fi wrong base is 1-r, because it does not need a PCR error to derive an wrong base. Alternatively, if a PCR error occurs, but the error expands the original wrong base to any of the other two wrong bases, the probability is , which explains the .

通過公式(I),在第i+1個循環中產生f i+1個錯誤鹼基的機率可以在考量所有可能的錯誤鹼基組合後計算而得,亦即如公式(II)所表示,而該些組合包括了由第i個循環中所存在的真實鹼基以及錯誤鹼基。在公式(II)中,克羅內克δ函數 (kronecker delta function) 保證了(f i+1–f i)的衍生錯誤鹼基係來自於第i個循環中的錯誤與真實鹼基。f i可以是0、1、2……乃至於2 i-1。由於最初始的真實鹼基在經過所有的複製後仍必然維持著初始樣貌;所以f i小於或等於 f i+1,,是以上限設定為f i+1 From formula (I), the probability of generating fi +1 erroneous bases in the i+1th cycle can be calculated by considering all possible combinations of erroneous bases, as represented by formula (II), which include the real bases and erroneous bases present in the i-th cycle. In formula (II), the Kronecker delta function ensures that the derived erroneous bases of ( fi+1 –fi ) come from the erroneous and real bases in the i-th cycle. fi can be 0, 1, 2, ... or even 2i -1. Since the original true base must maintain its original appearance after all replications, fi is less than or equal to fi +1 , so the upper limit is set to fi +1 .

通過上述的動態方程式 (dynamic equation),錯誤鹼基比例F R的機率P(F R)便可由計算P(f i+1)計算而得。舉例來說,請參閱圖1C,其呈現了錯誤鹼基比例F 10在PCR錯誤率為0.01的條件下之機率分布;如圖1C所示,三個小峰值分別接近0.5、0.25及0.125,其分別對應了第一個PCR錯誤發生於第1、第2及第3個循環,從而造成了相當於1/2、1/4及1/8的鹼基為錯誤鹼基。除了通過數學計算,更可模擬隨機的PCR錯誤;於本實施方式中,通過數學模擬達100萬次,其中PCR錯誤率為0.01,共10個擴增循環,其模擬的結果相當吻合數學計算的結果。 Through the above dynamic equation, the probability P( FR ) of the wrong base ratio FR can be calculated by calculating P( fi+1 ). For example, please refer to Figure 1C, which shows the probability distribution of the wrong base ratio F10 under the condition of PCR error rate of 0.01; as shown in Figure 1C, three small peaks are close to 0.5, 0.25 and 0.125, which correspond to the first PCR error occurring in the 1st, 2nd and 3rd cycles, respectively, resulting in 1/2, 1/4 and 1/8 of the bases being wrong bases. In addition to mathematical calculations, random PCR errors can also be simulated. In this embodiment, mathematical simulations were performed 1 million times, with a PCR error rate of 0.01 and a total of 10 expansion cycles. The simulation results were consistent with the mathematical calculation results.

在獲得錯誤鹼基比例F R的機率後,在沒有定序錯誤的前提下,可以進一步計算共識鹼基非為真實鹼基的機率;在一個特定的位點上,令S i為第d個取樣之擴增子上的鹼基,而S={S i}係所有鹼基的集合。所謂的共識鹼基c(S)即為該鹼基集合S={S i}中出現頻率最高的鹼基。通過公式(III),共識鹼基c(S)非為未知鹼基N的機率P(N≠c(S)|S)可以透過貝式定理計算而得。 After obtaining the probability of the wrong base ratio FR , under the premise that there is no sequencing error, the probability that the consensus base is not the true base can be further calculated; at a specific location, let S i be the base on the d-th sampled expander, and S={S i } is the set of all bases. The so-called consensus base c(S) is the base with the highest frequency in the base set S={S i }. Through formula (III), the probability P(N≠c(S)|S) that the consensus base c(S) is not the unknown base N can be calculated by Bayesian theorem.

在此假設對於未知鹼基N沒有任何先備訊息,亦即未知鹼基N是四種鹼基(A、T、C、G)中任一種之可能性相同,則P (N=b)為一常數 ;是以在公式(III)中可以省略掉P(N=b)。P (S|N=b)是由2 R個擴增子中隨機取樣而獲得鹼基集合S的機率,而該些擴增子係擴增自一假定鹼基b。P(S|N=b)取決於該些擴增子中的錯誤鹼基比例F R,而在考量到所有可能的錯誤鹼基比例值F R後,可以通過公式(IV)計算出P(S|N=b)。在一個特定的F R作為前提下,一個隨機抽樣鹼基是錯誤鹼基的機率可以簡單的化約為F R,其如公式(V)所示。 Here we assume that there is no prior information about the unknown base N, that is, the probability that the unknown base N is any of the four bases (A, T, C, G) is the same, then P (N=b) is a constant ; therefore, P(N=b) can be omitted in formula (III). P(S|N=b) is the probability of obtaining a base set S by randomly sampling from 2 R expanders, and these expanders are expanded from a hypothetical base b. P(S|N=b) depends on the proportion of incorrect bases in these expanders, FR , and after considering all possible values of the proportion of incorrect bases FR , P(S|N=b) can be calculated by formula (IV). Under a specific FR as the premise, the probability that a randomly sampled base is an incorrect base can be simply reduced to FR , which is shown in formula (V).

當抽樣有多個擴增子時,上述公式涉及了ΣP(F R)*F R k之計算,而ΣP(F R)*F R k指涉的是錯誤鹼基比例F R機率分布的第k個動差。原則上,該些數值可以由公式(II)取得。然而,當R值相當大時,涉及前述ΣP(F R)*F R k的計算量亦將十分龐大。因此,必須對於前述ΣP(F R)*F R k的計算做一個近似處理;對於該些個動差值進行近似計算,在錯誤率較小時,僅需要考慮第一個PCR錯誤。舉例來說,在10個循環的PCR擴增循環中,共有2 10-1=1,023個複製事件產生。在PCR錯誤率僅有0.001的前提下,平均的PCR錯誤亦僅接近1。 When there are multiple expanders in the sample, the above formula involves the calculation of ΣP(F R )*F R k , which refers to the kth moment of the probability distribution of the error base ratio F R. In principle, these values can be obtained by formula ( II ). However, when the R value is quite large, the amount of calculation involving the above ΣP(F R )*F R k will also be very large. Therefore, an approximation must be made to the calculation of the above ΣP(F R )*F R k ; for the approximate calculation of these moment values, when the error rate is small, only the first PCR error needs to be considered. For example, in 10 cycles of PCR amplification, a total of 2 10 -1=1,023 replication events occur. Assuming the PCR error rate is only 0.001, the average PCR error is only close to 1.

除此之外,一個發生在早期循環中的PCR錯誤將會導致相當高的錯誤鹼基比例,也因此對於動差值貢獻更多。當只有第一個PCR錯誤被納入考量,該些動差值可以通過公式(VI)計算而得。在公式(VI)中,i為第一個PCR錯誤發生的循環數。在PCR錯誤率r相對小的的情況下,例如 r < 2 -R,第一個PCR錯誤發生於第i個循環的機率相當於 2 i-1*r,這是由於在第i個循環之前有2 i-1個擴增子且對於每一個擴增子而言,發生PCR錯誤的機率均為r。 In addition, a PCR error occurring in an early cycle will result in a significantly higher fraction of erroneous bases and will therefore contribute more to the momentum value. When only the first PCR error is taken into account, these momentum values can be calculated using formula (VI). In formula (VI), i is the cycle number at which the first PCR error occurs. When the PCR error rate r is relatively small, for example r < 2 -R , the probability that the first PCR error occurs in the i-th cycle is equal to 2 i-1 *r, since there are 2 i-1 amplifiers before the i-th cycle and for each amplifier, the probability of a PCR error is r.

若沒有進一步的PCR錯誤,則最終產生的錯誤鹼基比例為1/2 i。然而,進一步的PCR錯誤是有可能發生的,但他們對於該些動差值的貢獻屬於較高次方的項次,例如r 2、r 3、…….,在r<<1時該些動差值均相當的小。請參閱圖2,其呈現不同PCR循環中F R動差值之理論值;如圖2所示,在PCR錯誤率0.001時,頭5個錯誤鹼基比例F R機率分布動差值之理論值,可以通過公式(II)計算而得。此外,圖2也呈現了基於公式(VI)的近似動差值,其與該些理論值相當吻合。有了這些近似的動差值,共識鹼基c(S)非為未知鹼基N的機率即可更有效的計算而得,並且不需要再納入細節的F R之機率分布。 If there are no further PCR errors, the final erroneous base ratio is 1/2 i . However, further PCR errors are possible, but their contribution to the moment values belongs to higher powers, such as r 2 , r 3 , ...., and the moment values are quite small when r << 1. Please refer to Figure 2, which presents the theoretical values of the FR moment values in different PCR cycles; as shown in Figure 2, the theoretical values of the FR probability distribution moment values of the first 5 erroneous base ratios at a PCR error rate of 0.001 can be calculated by formula (II). In addition, Figure 2 also presents the approximate moment values based on formula (VI), which are quite consistent with the theoretical values. With these approximate moments, the probability that the consensus base c(S) is not the unknown base N can be calculated more efficiently without incorporating the detailed probability distribution of FR .

有了這些動差值,該些共識鹼基c(S)非為未知鹼基N的機率可以通過公式(VII)及公式(VIII)計算而得;在一示例性的計算當中,令觀察鹼基為A及C,而在此觀察鹼基集合S中,A的數量n A大於C的數量n C;在公式(VIII)中,當PCR錯誤率r<<1,未知鹼基N為G或T的機率P(S|N=G)、P(S|N=T)將遠小於未知鹼基N為A或C的機率P(S|N=A)、P(S|N=C),因此是可以忽略不計的;須說明的是,公式(VII)及公式(VIII)在不同種類鹼基的條件下,必須針對鹼基種類的數量做出調整,其調整方式將於後段文字中進一步說明。 With these moment values, the probability that the consensus base c(S) is not the unknown base N can be calculated by formula (VII) and formula (VIII); in an exemplary calculation, let the observed bases be A and C, and in this observed base set S, the number of A n A is greater than the number of C n C ; In formula (VIII), when the PCR error rate r<<1, the probability that the unknown base N is G or T, P(S|N=G), P(S|N=T), will be much smaller than the probability that the unknown base N is A or C, P(S|N=A), P(S|N=C), and therefore can be ignored; It should be noted that under the conditions of different types of bases, formulas (VII) and (VIII) must be adjusted according to the number of base types, and the adjustment method will be further explained in the following text.

由於除了PCR錯誤之外,定序錯誤也可能造成錯誤的觀察鹼基s j;因此,於本實施方式中同樣納入了定序錯誤以評估共識鹼基c(S)非為未知鹼基N的機率;一般而言,錯誤定序的機率q i可以通過定序品質分數Q i進行計算,其係通過公式(IX)計算而得;在對於該些定序錯誤所產生的錯誤鹼基沒有任何先備資訊的前提下,於該些錯誤鹼基中觀察到任一種錯誤鹼基的機率是q i/3。於知悉此訊息的前提下,可以整合PCR錯誤及定序錯誤以評估該共識鹼基c(S)的品質。 In addition to PCR errors, sequencing errors may also cause erroneous observed bases s j ; therefore, in this embodiment, sequencing errors are also included to evaluate the probability that the consensus base c(S) is not the unknown base N; generally speaking, the probability of erroneous sequencing q i can be calculated by the sequencing quality score Q i , which is calculated by formula (IX); under the premise that there is no prior information about the erroneous bases generated by these sequencing errors, the probability of observing any erroneous base among these erroneous bases is q i /3. With this information in mind, PCR errors and sequencing errors can be integrated to assess the quality of the consensus base c(S).

在同時考量PCR錯誤及定序錯誤的前提下,必須回到公式(IV)以進一步闡述該共識鹼基c(S)非為未知鹼基N的機率是如何具體運算的;具體地,必須重寫P(s i|N=b∩F R=F R);如公式(X)所描述的,當未知鹼基N為假定鹼基b時,觀察到另一個觀察鹼基b’的機率,其中F b’’ b代表擴增鹼基b”類型於最終擴增子池的錯誤鹼基比例,且於{}之中為代表擴增鹼基b”類型的組合。當觀察到的另一觀察鹼基b’為假定鹼基b時,它可以是正確的PCR擴增之機率 ( ) 及正確的定序之機率( ),或是PCR擴增錯誤導致了一個錯誤的擴增鹼基b” 的機率(PE bb ’’ b)再加上定序錯誤而將擴增鹼基b”定序為假定鹼基b 的機率 (SE b”b)。 Under the premise of considering both PCR errors and sequencing errors, we must return to formula (IV) to further explain how the probability that the consensus base c(S) is not the unknown base N is specifically calculated; specifically, P(s i |N= b∩FR = FR ) must be rewritten; as described in formula (X), when the unknown base N is the assumed base b, the probability of observing another observed base b', where Fb '' b represents the proportion of erroneous bases of the type of amplified base b' in the final amplicon pool, and in {} is a combination representing the type of amplified base b'. When the observed another observed base b' is the assumed base b, it can be the probability of the correct PCR amplification ( ) and the probability of correct sequencing ( ), or the probability of a PCR amplification error resulting in an incorrectly extended base b” (PE bb '' b ) plus the probability of a sequencing error sequencing the extended base b” as a hypothetical base b (SE b”b ).

當另一觀察鹼基b’非為假定鹼基b,有三種情境造成此一結果: (1) 一個PCR擴增錯誤導致一個錯誤的(不同於該假定鹼基b的)另一觀察鹼基b’產生且沒有任何定序錯誤;(2) PCR擴增並未出錯,但假定鹼基b遭到錯誤定序為不同於該假定鹼基b的另一觀察鹼基b’;(3) 一個PCR擴增錯誤產生了一個錯誤的擴增鹼基b”,但該擴增鹼基b”既不是假定鹼基b也不是另一觀察鹼基b’,且該擴增鹼基b”後續被錯誤定序為另一觀察鹼基b’。When the observed base b' is not the assumed base b, there are three scenarios that lead to this result: (1) a PCR amplification error results in the generation of an incorrect observed base b' (different from the assumed base b) without any sequencing error; (2) the PCR amplification did not go wrong, but the assumed base b was incorrectly sequenced as an observed base b' different from the assumed base b; (3) A PCR amplification error produces an incorrect extended base b", but the extended base b" is neither the assumed base b nor another observed base b', and the extended base b" is subsequently incorrectly sequenced as another observed base b'.

假定PCR錯誤及定序錯誤互為獨立事件,上述情境中的機率可以由公式(X)計算而得,其中, F R代表錯誤鹼基的總佔比,例如F Rb” bF b ,並將公式(X)和併入公式(VII)及公式(VIII),即可計算該共識鹼基c(S)的機率以賦予該共識鹼基c(S)一品質分數,該品質分數不僅具體考量了PCR錯誤並整合了定序錯誤。 (X) Assuming that PCR errors and sequencing errors are independent events, the probability in the above scenario can be calculated by formula (X), where FR represents the total proportion of erroneous bases, for example, FRb” b F b . By incorporating formula (X) into formula (VII) and formula (VIII), the probability of the consensus base c(S) can be calculated to assign a quality score to the consensus base c(S). The quality score not only specifically considers PCR errors but also integrates sequencing errors. (X)

在前述方程式中,3種不同類型的錯誤觀察鹼基整合到了一起。由於這3種類型的錯誤觀察鹼基並不會顯現相同的概率,是以該些方程式需要進一步的修改;舉例來說,假定該假定鹼基b為腺嘌呤A,在最終的擴增池中觀察到兩個胞嘧啶C的概率大於一個胞嘧啶C及一個鳥糞嘌呤G,這是由於前者僅需要發生一次PCR錯誤,但後者需要至少兩次PCR錯誤。In the above equations, three different types of incorrectly observed bases are combined. Since these three types of incorrectly observed bases do not appear with equal probability, the equations need to be further modified; for example, assuming that the hypothetical base b is adenine A, the probability of observing two cytosine C in the final expansion pool is greater than one cytosine C and one guanosine G, because the former only requires one PCR error, but the latter requires at least two PCR errors.

因此,四種鹼基(A、T、C、G)都需要在該些方程式中納入考量;具體來說,當假定鹼基b為腺嘌呤A時,原先採納的機率P(F R)將以4種鹼基版本的機率P({F C, F G, F T})取而代之,以針對前述方程式進行修正;需注意的是,對於本評估模型簡化的關鍵係動差值的近似處理;在處理四種鹼基的情境下,在公式(XI)的多項次動差值(multivariate moment)需要進一步的近似處理;再一次地,在PCR錯誤率r遠小於1的前提下,該些動差值的近似處理僅需考慮最初幾個PCR錯誤。 Therefore, all four bases (A, T, C, G) need to be taken into account in these equations; specifically, when assuming that base b is adenine A, the probability P( FR ) originally adopted will be replaced by the probability P({ FC , FG , FT }) for the four bases to correct the above equations; it should be noted that the key to the simplification of this evaluation model is the approximate treatment of the moment; in the case of treating four bases, the multivariate moments of formula (XI) need to be further approximated; again, under the premise that the PCR error rate r is much less than 1, the approximate treatment of these moment values only needs to consider the first few PCR errors.

該評估模型可以進一步延伸以將DNA的雙股特徵納入考量;對於雙股核酸模板而言,一次的PCR錯誤在合成序列上造成一個錯誤鹼基,而需要再一個循環才會產生兩股序列攜帶錯誤鹼基的結果。因此,在第一個循環中發生的PCR錯誤僅導致 而非 的最終擴增池擁有錯誤鹼基。須說明的是,一條UMI通常僅標示了DNA模板的其中一股序列。整合了前述因子後,該些多項次動差值可以被近似處理。於此,以該些次方數中0的數量來定義該些多項次動差值的水平;該些水平越高,則該些動差值越小。 The evaluation model can be further extended to take into account the double-stranded nature of DNA; for a double-stranded nucleic acid template, a single PCR error results in an incorrect base in the synthesized sequence, and it takes another cycle to produce the result that both strands carry the incorrect base. Therefore, a PCR error in the first cycle only results in Rather than The final amplification pool has an erroneous base. It should be noted that a UMI usually only marks one strand of the DNA template. After integrating the above factors, the multi-term difference values can be approximated. Here, the level of the multi-term difference values is defined by the number of 0s in the powers; the higher the level, the smaller the difference values.

前述說明僅用於闡述本發明所提供的共識鹼基錯誤率評估方法其統計模型之數學推導,並非用以限制本發明方法之具體實施。The above description is only used to illustrate the mathematical derivation of the statistical model of the consensus base error rate evaluation method provided by the present invention, and is not intended to limit the specific implementation of the method of the present invention.

本發明之第三實施方式,其係提供一種共識鹼基錯誤率的評估系統(1),請參閱圖3A,該系統(1)包括一聚合酶連鎖反應品質評估模組(11),配置以執行第一實施方式所述之方法以計算一共識鹼基錯誤率P c(S)The third embodiment of the present invention provides a consensus base error rate evaluation system (1), see FIG3A . The system (1) includes a polymerase chain reaction quality evaluation module (11) configured to execute the method described in the first embodiment to calculate a consensus base error rate P c(S) .

在多個實施例中,請繼續參閱圖3A,該系統(1)進一步包括一評分模組(14),與該聚合酶連鎖反應品質評估模組(11)訊號連接,配置以識別該共識鹼基錯誤率P c(S)值並給予該共識鹼基c(S)一品質分數Q cIn various embodiments, please continue to refer to FIG. 3A , the system (1) further comprises a scoring module (14) signal-connected to the polymerase chain reaction quality assessment module (11) and configured to identify the consensus base error rate P c(S) value and give the consensus base c(S) a quality score Q c .

在另一些實施例中,請繼續參閱圖3A,該系統(1)更包括一判斷模組(15),與該評分模組(14)及該聚合酶連鎖反應品質評估模組(11)訊號連接,配置以接收該品質分數Q c並與一第二品質分數P’ c (S)進行比對以產生一判斷指示,其中,該P’ c (S)係由該聚合酶連鎖反應品質評估模組(11)所輸出,當該P’ c (S)大於該Q c時,該判斷指示為具信效度,當該P’ c (S)小於該Qc時,該判斷指示為不具信效度。 In some other embodiments, please continue to refer to Figure 3A, the system (1) further includes a judgment module (15), which is signal-connected to the scoring module (14) and the polymerase chain reaction quality assessment module (11), and is configured to receive the quality score Qc and compare it with a second quality score P'c (S) to generate a judgment indication, wherein the P'c (S) is output by the polymerase chain reaction quality assessment module (11), when the P'c (S) is greater than the Qc , the judgment indication is reliable and valid, and when the P'c (S) is less than the Qc, the judgment indication is not reliable and valid.

舉例來說,使用者取得該品質分數Q c後,可聚焦於高品質分數之鹼基群,並以該品質分數Q c為基準設定截止分數Q x,於後續的定序數據批次處理中,可以藉由該截止分數Q x來判斷共識鹼基之信效度;前述示例中的截止分數設定,更可以推廣至包括了一個以上共識鹼基c(S)所組成之共識序列 (consensus sequence);以另一示例說明共識序列之具體信效度判斷,請參閱表1,其通過針對一未知序列進行定序,分別於不同批次中獲得共識序列1及共識序列2;其中,觀察鹼基集由5’端往3’端讀取分別觀察鹼基集S 1-1={A 1C 1C 1}、觀察鹼基集S 1-2={A 2G 2A 2}及觀察鹼基集S 1-3={G 3A 3A 3},系統由各個觀察鹼基集判斷共識序列1為5’-CAA-3’;另一方面,共識序列2由觀察鹼基集S 2-1={A 1C 1C 1}、觀察鹼基集S 2-2={A 2G 2A 2}及觀察鹼基集S 2-3={G 3G 3A 3}判斷為5’-CAG-3’;在針對每個共識鹼基進行品質分數計算後,系統比對由共識鹼基集S 1-3所獲得之共識鹼基A品質分數判定小於截止分數Q x,是以給出判斷指示為不具信效度;而共識鹼基集S 2-1、S 2-2及S 2-3當中所有的共識鹼基之品質分數均大於截止分數Q x,是以給出判斷指示為具信效度;於此,使用者可依據系統所給出之判斷指示以決定共識序列之信效度,並做出最合適的序列判讀。 表1 觀察鹼基集 共識鹼基 品質分數 判斷指示 共識序列1 5’-CAA-3’ S 1-1 A 1C 1C 1 C >Q x 具信效度 S 1-2 A 2G 2A 2 A >Q x 具信效度 S 1-3 G 3A 3A 3 A <Q x 不具信效度 共識序列2 5’-CAG-3’ S 2-1 A 1C 1C 1 C >Q x 具信效度 S 2-2 A 2G 2A 2 A >Q x 具信效度 S 2-3 G 3G 3A 3 G >Q x 具信效度 For example, after obtaining the quality score Q c , the user can focus on the base group with high quality score, and set the cutoff score Q x based on the quality score Q c . In the subsequent batch processing of sequencing data, the cutoff score Q x can be used to determine the reliability and validity of the consensus base. The cutoff score setting in the above example can be extended to include a consensus sequence composed of more than one consensus base c (S). For another example to illustrate the specific reliability and validity judgment of the consensus sequence, please refer to Table 1. By sequencing an unknown sequence, consensus sequence 1 and consensus sequence 2 are obtained in different batches. Among them, the observation base set is read from the 5' end to the 3' end to obtain the observation base set S 1-1 ={A 1 C 1 C 1 }, observed basis set S 1-2 ={A 2 G 2 A 2 } and observed basis set S 1-3 ={G 3 A 3 A 3 }, the system determines that consensus sequence 1 is 5'-CAA-3' according to each observed basis set; on the other hand, consensus sequence 2 is determined to be 5'-CAG-3' by observed basis set S 2-1 ={A 1 C 1 C 1 }, observed basis set S 2-2 ={A 2 G 2 A 2 } and observed basis set S 2-3 ={G 3 G 3 A 3 }; after calculating the quality score for each consensus base, the system compares the consensus sequence 1 determined by the consensus basis set S The quality score of the consensus base A obtained in 1-3 is judged to be less than the cut-off score Q x , so the judgment indication is given as not having reliability and validity; while the quality scores of all the consensus bases in the consensus base sets S 2-1 , S 2-2 and S 2-3 are greater than the cut-off score Q x , so the judgment indication is given as having reliability and validity; here, the user can determine the reliability and validity of the consensus sequence according to the judgment indication given by the system, and make the most appropriate sequence judgment. Table 1 Observation base set Consensus base Quality Score Judgment Instructions Consensus sequence 1 5'-CAA-3' S 1-1 A 1 C 1 C 1 C >Q x Reliable and valid S 1-2 A 2 G 2 A 2 A >Q x Reliable and valid S 1-3 G 3 A 3 A 3 A <Q x No reliability or validity Consensus sequence 2 5'-CAG-3' S 2-1 A 1 C 1 C 1 C >Q x Reliable and valid S 2-2 A 2 G 2 A 2 A >Q x Reliable and valid S 2-3 G 3 G 3 A 3 G >Q x Reliable and valid

本發明之第四實施方式,其係提供一種共識鹼基錯誤率的評估系統1,請參閱圖3B,其基本組件及運作原理與第三實施方式相同,惟該系統(1)進一步包括一定序品質評估模組(12),與該聚合酶連鎖反應品質評估模組(11)訊號連接,配置以讀取另一觀察鹼基b’之定序品質分數Q i計算一定序正確率P seq1及一定序錯誤率P seq2,其中,該定序正確率P seq1可以由下述公式計算而得: ; 該定序錯誤率P seq2可以由下述公式計算而得: ;及 一共識鹼基錯誤率P c(S)修正單元(11a),配置於該聚合酶連鎖反應品質評估模組(11)中,並與該定序品質評估模組(12)訊號連接,以依據該定序正確率P seq1、該定序錯誤率P seq2及一擴增鹼基比例 修正該 ,其為該未知鹼基N相同於該假定鹼基b時,且該擴增子池中一不相同於該假定鹼基b之擴增鹼基b”的比例為 時,該觀察鹼基s j相同於另一觀察鹼基b’的機率,其中,當該擴增鹼基b”不相同於該假定鹼基b時,該擴增鹼基比例 相等於該擴增鹼基分數 ,修正後該 滿足下述公式: ; 其中,當該另一觀察鹼基b’相同於該假定鹼基b時,P pcr1代表該假定鹼基b擴增為一相同於該另一觀察鹼基b’之鹼基的機率,P pcr2代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b; 其中,當該另一觀察鹼基b’不同於該假定鹼基b時,P pcr3代表該假定鹼基b擴增為相同於該另一觀察鹼基b’之鹼基的機率,P pcr4代表該假定鹼基b擴增為相同於該假定鹼基b之鹼基的機率,P pcr5代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b,亦不相同於該另一觀察鹼基b’; 其中,該定序正確率P seq1代表該另一觀察鹼基b’定序正確之機率,該定序錯誤率P seq2代表該擴增鹼基b”錯誤定序為該另一觀察鹼基b’之機率。 The fourth embodiment of the present invention provides a consensus base error rate evaluation system 1, see FIG. 3B , the basic components and operation principle of which are the same as those of the third embodiment, except that the system (1) further includes a sequence quality evaluation module (12) signal-connected to the polymerase chain reaction quality evaluation module (11) and configured to read the sequencing quality score Qi of another observed base b' to calculate a sequence accuracy rate Pseq1 and a sequence error rate Pseq2 , wherein the sequence accuracy rate Pseq1 can be calculated by the following formula: The sequencing error rate P seq2 can be calculated by the following formula: and a consensus base error rate P c(S) correction unit (11a), which is disposed in the polymerase chain reaction quality assessment module (11) and is signal-connected to the sequencing quality assessment module (12) to adjust the base error rate P c(S) according to the sequencing accuracy rate P seq1 , the sequencing error rate P seq2 and an expanded base ratio Correct the for , which is when the unknown base N is identical to the assumed base b, and the proportion of an extended base b" in the extender pool that is different from the assumed base b is When the observed base sj is the same as another observed base b', when the extended base b" is different from the assumed base b, the extended base ratio Equal to the expanded base score , after correction Satisfy the following formula: ; wherein, when the other observed base b' is the same as the assumed base b, P pcr1 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr2 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is not the same as the assumed base b; wherein, when the other observed base b' is different from the assumed base b, P pcr3 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr4 represents the probability that the assumed base b is expanded to a base that is the same as the assumed base b, P pcr5 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is different from the assumed base b and the other observed base b'; wherein the sequencing accuracy rate Pseq1 represents the probability that the other observed base b' is sequenced correctly, and the sequencing error rate Pseq2 represents the probability that the expanded base b" is incorrectly sequenced to the other observed base b'.

本發明之第五實施方式係提供一種共識鹼基錯誤率的評估裝置2,請參閱圖4,其包括一存儲器21及一處理器22,其中,該存儲器22用於存儲一程序P;該處理器22與該存儲器21訊號連接,配置以執行該程序P以實現如第一實施方式或第二實施方式所述之共識鹼基錯誤率評估方法、或實現如第三實施方式或第四實施方式所述之系統。The fifth embodiment of the present invention provides a consensus base error rate evaluation device 2, please refer to Figure 4, which includes a memory 21 and a processor 22, wherein the memory 22 is used to store a program P; the processor 22 is signal-connected to the memory 21 and is configured to execute the program P to implement the consensus base error rate evaluation method described in the first embodiment or the second embodiment, or to implement the system described in the third embodiment or the fourth embodiment.

本發明所提供之方法的全部或部分功能可以通過硬體搭載的方式實現,也可以通過軟體程式的方式實現;當該方法中全部或部分功能通過軟體程式的方式實現時,該程式可以存儲於一存儲器中,該存儲器可以是惟讀存儲器、隨機存儲器、磁碟、光碟、硬碟等,並通過處理器執行該程式以實現上述功能。舉例來說,將程式存儲於定序設備的存儲器中,當通過處理器執行存儲器中程序,即可實現上述全部或部分功能;除此之外,該程式也可以存儲於伺服器、另一計算機、磁碟、光碟、快顯記憶體或可攜式硬碟等存儲器中,以下載、傳送、複製等方式轉移至定序設備的存儲器中,或對定序設備的系統進行不同版本的覆蓋或更新,當通過處理器執行存儲器中的程序時,即可實現上述方法中全部或部分功能。All or part of the functions of the method provided by the present invention can be implemented by hardware mounting or by software program; when all or part of the functions of the method are implemented by software program, the program can be stored in a storage device, which can be a read-only memory, random access memory, a disk, an optical disk, a hard disk, etc., and the program is executed by a processor to implement the above functions. For example, the program is stored in the memory of the sequencing device. When the program in the memory is executed by the processor, all or part of the above functions can be realized. In addition, the program can also be stored in a storage device such as a server, another computer, a disk, an optical disk, a flash memory or a portable hard disk, and transferred to the memory of the sequencing device by downloading, transmitting, copying, etc., or the system of the sequencing device can be overwritten or updated with different versions. When the program in the memory is executed by the processor, all or part of the functions in the above method can be realized.

實施例Embodiment 11 :基於:Based on PCRPCR 及定序錯誤機率以計算共識鹼基and sequencing error probability to calculate consensus bases c(S)c(S) 之品質Quality

於本實施例中,將示例性的通過本發明所提供的方法以計算共識鹼基c(S)非為真實鹼基的機率;取一觀察鹼基集合S={A 1C 2A 3},計算出共識鹼基c(S)為A;請參見下述公式(XII),其呈現了根據公式 (X) 在已知假定鹼基 b 和錯誤鹼基組成 {F b'' b} 的情況下觀察每個鹼基的概率。於公式(XII),u ki及 v ki分別定義為觀察鹼基相同於及不同於該假定鹼基 b。指數k及i分別指出了級數及第i個觀察鹼基。根據公式(XII),於已知一個假定鹼基 b的前提下觀察一鹼基S的機率可以通過公式(XIII)計算而得。 In this embodiment, the method provided by the present invention is used to calculate the probability that the consensus base c(S) is not the true base; take an observed base set S={A 1 C 2 A 3 }, and calculate the consensus base c(S) to be A; please refer to the following formula (XII), which presents the probability of observing each base according to formula (X) when the assumed base b and the wrong base composition {F b'' b } are known. In formula (XII), u ki and v ki are defined as the observed base that is the same as and different from the assumed base b, respectively. The indexes k and i indicate the order and the i-th observed base, respectively. Based on formula (XII), the probability of observing a base S given a hypothetical base b can be calculated using formula (XIII).

上述計算可以加權之多項動差值表示,前述加權係由以下數個因子共同決定,其包括了觀察鹼基是否匹配該假定鹼基 b、動差值之級數、觀察鹼基之定序品質。由於對應之F為單純之F R,其代表了3種不同型態錯誤鹼基之總佔比,是故「3」出現於部分動差值之前。有了這些權重,共識鹼基c(S)非為假定鹼基 b的機率可以根據公式(XIV)計算而得。於實施例1中,基於兩種鹼基的共識鹼基錯誤率計算得以實現,而當有三種以上不同型態之觀察鹼基時,上述計算方式可以在遵守著相近的邏輯下進一步修改。 …………(XIII) The above calculation can be expressed as a weighted multiple moment value. The weighting is determined by the following factors, including whether the observed base matches the assumed base b, the order of the moment value, and the sequencing quality of the observed base. Since the corresponding F is a simple FR , it represents the total proportion of 3 different types of erroneous bases, so "3" appears before some of the moment values. With these weights, the probability that the consensus base c(S) is not the assumed base b can be calculated according to formula (XIV). In Example 1, the consensus base error rate calculation based on two bases is realized, and when there are more than three different types of observed bases, the above calculation method can be further modified while following similar logic. …………(XIII)

實施例Embodiment 22 :評估:evaluate PCRPCR 錯誤率Error rate

於實施例2中,腺嘌呤(A)作為假定鹼基 b,當存在有一種錯誤鹼基,例如C、G或T,於公式(XI)中的第一動差值M x00’s可以代換(r/3)為相應的替代率 (substitution rate),例如p AC、p AG或p AT;請參見公式(XV),m x00定義為不含替代率之動差值,並表示為p bb 。同樣地,第二動差值可以項應的替代率乘積代換(r/3) 2,例如當錯誤鹼基為C和G時,以p AC*p AG代換M xy0’s。於此,可以簡單地以多變項動差值代換原計算式中的單變項動差值以計算共識鹼基c(S)之品質分數。 In Example 2, adenine (A) is used as the assumed base b. When there is an erroneous base, such as C, G or T, the first moment M x00's in formula (XI) can be replaced (r/3) with the corresponding substitution rate, such as p AC , p AG or p AT ; see formula (XV), m x00 is defined as the moment without substitution rate and is expressed as p bb ' . Similarly, the second moment can be replaced by the product of the corresponding substitution rates (r/3) 2 , for example, when the erroneous base is C and G, p AC *p AG is used to replace M xy0's . Here, the quality score of the consensus base c(S) can be calculated by simply replacing the single variable moment in the original calculation formula with the multivariate moment.

當假定鹼基 b為已知時,可以使用最大近似法估計這PCR 替代率。 於實施例2中,聚焦於一具有真實鹼基為腺嘌呤A之基因座,在沒有定序錯誤的前提下,基因座上相對應知位點非為腺嘌呤A時,代表發生了PCR錯誤;於實施例2中,該基因座具有2或3種錯誤鹼基的情形且忽略不計,係由於在低PCR錯誤率的條件下,其頻率要小得多;給定在該基因座觀察到的鹼基,觀察鹼基的機率可以藉由公式(XVII)計算而得,而其中假設各基因座之間互為獨立。在下列計算式中,S i是在基因座i處觀察到的鹼基,{S i}代表觀察鹼基集合,{p bb ''}代表了本發明所提供的PCR 錯誤評估模型,藉公式(VXIII)可以計算所有觀察鹼基均為腺嘌呤A的機率,而觀察鹼基中有部分錯誤鹼基,例如錯誤鹼基為胞嘧啶C,其機率可以通過公式(XIX)計算而得。 When the base b is assumed to be known, the PCR substitution rate can be estimated using the maximum approximation method. In Example 2, the focus is on a locus with a true base of adenine A. Under the premise of no sequencing error, when the corresponding known position on the locus is not adenine A, it means that a PCR error has occurred; in Example 2, the situation where the locus has 2 or 3 incorrect bases is ignored because its frequency is much smaller under the condition of low PCR error rate; given the base observed at the locus, the probability of observing the base can be calculated by formula (XVII), where it is assumed that each locus is independent of each other. In the following formula, Si is the base observed at locus i, { Si } represents the observed base set, and { pbb '' } represents the PCR error assessment model provided by the present invention. The probability that all observed bases are adenine A can be calculated by formula (VXIII), and the probability that some of the observed bases are erroneous bases, for example, the erroneous base is cytosine C, can be calculated by formula (XIX).

於實施例2中,以最大似然法(maximum likelihood approach)計算替代率 p AC,而p AC給出了一個導數為零的似然對數(log likelihood)。在公式(XX)中,n AA及n AC分別為該基因座具有腺嘌呤A或具有腺嘌呤A及胞嘧啶C之抽樣數量。根據公式(XX),p AC與由鹼基腺嘌呤A轉為胞嘧啶C之基因座數量成比例,其中,分母涉及了未具錯誤鹼基之基因座其動差值,其可以具體反映PCR過程。 In Example 2, the substitution rate p AC is calculated by the maximum likelihood approach, and p AC gives a log likelihood with a derivative of zero. In formula (XX), n AA and n AC are the number of samples with adenine A or adenine A and cytosine C, respectively. According to formula (XX), p AC is proportional to the number of loci that are converted from adenine A to cytosine C, wherein the denominator involves the error value of the loci without the wrong base, which can specifically reflect the PCR process.

實施例Embodiment 33 :評估背景錯誤率: Evaluate background error rate

在計算獲得基於UMI標定核酸片段群集所得之共識鹼基c(S)的錯誤機率後,可以藉由該機率設定一截止分數以進一步聚焦於具有高品質分數之共識鹼基c(S),其中具有高品質分數之共識鹼基c(S)可以推估其僅有相當低比例的PCR及/或定序錯誤。是以,可以合理的得知,僅有的錯誤鹼基係源自於背景錯誤,例如在DNA片段化過程中產生的隨機突變;在已知的DNA模板序列上應用UMI定序法,以實施例1所得之共識鹼基錯誤率為基礎以估計於該次UMI定序中所發生的背景錯誤率。After calculating the error probability of the consensus base c(S) obtained based on the UMI-labeled nucleic acid fragment cluster, a cutoff score can be set by the probability to further focus on the consensus base c(S) with a high quality score, wherein the consensus base c(S) with a high quality score can be estimated to have only a relatively low proportion of PCR and/or sequencing errors. Therefore, it can be reasonably known that the only erroneous bases are derived from background errors, such as random mutations generated during the DNA fragmentation process; the UMI sequencing method is applied to the known DNA template sequence, and the consensus base error rate obtained in Example 1 is used as the basis to estimate the background error rate occurring in the UMI sequencing.

實施例Embodiment 44

於實施例4中,以100條長度為100個鹼基對之隨機序列作為「正常細胞」之參考序列,並針對1000個正常細胞共10 7個基因座進行突變檢測。接著,於前述參考序列上黏合UMI分子條碼,而UMI標定之雙股序列於後續進行20個循環的PCR擴增,其擴增錯誤率r為0.001。 In Example 4, 100 random sequences of 100 base pairs in length were used as reference sequences for "normal cells", and mutation detection was performed on 107 loci in 1000 normal cells. Then, the UMI molecular barcode was attached to the reference sequence, and the double-stranded sequence labeled by the UMI was subsequently amplified by PCR for 20 cycles, with an amplification error rate r of 0.001.

針對個別UMI,隨機取樣3或5個擴增子進行定序;個別UMI定序之擴增子集合成一共識序列,而個別共識鹼基之品質則以前述評估方法進行計算;於實施例4中,通過該評估方法得以識別出錯誤的共識鹼基,並同時計算出個別品質分數之頻率;接著,針對觀察頻率與預期頻率進行比較。For each UMI, 3 or 5 extenders are randomly sampled for sequencing; the extenders sequenced for each UMI are grouped into a consensus sequence, and the quality of each consensus base is calculated using the aforementioned evaluation method; in Example 4, the evaluation method is used to identify erroneous consensus bases and simultaneously calculate the frequency of individual quality scores; then, the observed frequency is compared with the expected frequency.

請參閱圖5A,其呈現了長度3個鹼基之UMI共識序列其共識品質 的觀察頻率與預期頻率之間擬合程度,如圖5A所示,展現了高度擬合;再請參閱圖5B,5個鹼基大小之UMI共識序列其共識品質分數的觀察頻率與預期頻率同樣展現了高度擬合,除了有部分品質分數呈現較大的波動外,這可能是由於錯誤率太低,且沒有足夠的基因座具備該等品質分數和/或只有很少甚至沒有錯誤的共識鹼基;通過實施例4的模型擬合,足以說明本發明提供的評估方法得以提供共識鹼基真實的錯誤機率,以作為共識鹼基乃至於共識序列品質分數之計算依據。See Figure 5A, which shows the consensus quality of a 3-base UMI consensus sequence. The degree of fit between the observed frequency and the expected frequency is shown in FIG. 5A , which shows a high fit. Referring to FIG. 5B , the observed frequency and the expected frequency of the consensus quality score of the UMI consensus sequence of 5 bases also show a high fit, except that some quality scores show a large fluctuation. This may be due to the low error rate and insufficient loci with such quality scores and/or only a few or even no erroneous consensus bases. The model fitting of Example 4 is sufficient to illustrate that the evaluation method provided by the present invention can provide the true error probability of the consensus base as a basis for calculating the consensus base and even the consensus sequence quality score.

本發明具體提供了用以評估共識鹼基錯誤率的方法,在同時納入PCR錯誤及定序錯誤的前提下建立共識鹼基錯誤機率之評估模型,並對於分子標籤 (UMI) 定序所產生的共識鹼基錯誤率進行精確的評估。The present invention specifically provides a method for evaluating the consensus base error rate, establishes a consensus base error probability evaluation model under the premise of simultaneously incorporating PCR errors and sequencing errors, and accurately evaluates the consensus base error rate generated by molecular tag (UMI) sequencing.

其次,通過本發明所提供的方法,可以針對共識序列上的每一個共識鹼基給予一個精確的品質分數,藉此排除低頻突變對於定序準確度的影響,有效提高定序的精確性,對於提升未來精準醫療具有相當潛在的發展性。Secondly, through the method provided by the present invention, an accurate quality score can be given to each consensus base on the consensus sequence, thereby eliminating the impact of low-frequency mutations on sequencing accuracy, effectively improving sequencing accuracy, and having considerable potential for improving future precision medicine.

其三,通過本發明所提供的方法,在完成共識鹼基的錯誤機率的評估後,更可以進一步的估計於每一次的分子標籤定序中所發生的背景錯誤率,可作為改善分子標籤定序的程序設計、試劑品質、引子規劃等之參考依據,對於未來核酸定序之品質提升有其助益。Thirdly, through the method provided by the present invention, after completing the evaluation of the error probability of the consensus base, the background error rate occurring in each molecular tag sequencing can be further estimated, which can be used as a reference for improving the program design, reagent quality, primer planning, etc. of molecular tag sequencing, which will be helpful for improving the quality of nucleic acid sequencing in the future.

以上內容是結合了具體實施方式對本發明內容所作的進一步說明,其不可認定本發明之具體實施僅局限於該些說明及示例性實施例之內容;對於本發明所屬技術領域之通常知識者而言,於不脫本發明的構思之前提下,均可做出簡單的修改或替換。The above contents are further explanations of the contents of the present invention in combination with specific implementations, and it cannot be determined that the specific implementation of the present invention is limited to the contents of these descriptions and exemplary embodiments; for those with ordinary knowledge in the technical field to which the present invention belongs, simple modifications or substitutions can be made without departing from the concept of the present invention.

1:共識鹼基錯誤率評估系統 11:聚合酶連鎖反應品質評估模組 11a:共識鹼基錯誤率P c(S)修正單元 12:定序品質評估模組 14:評分模組 15:判斷模組 2:共識鹼基錯誤率評估裝置 21:處理器 22:存儲器 S1:擴增子池選取處理 S1a:擴增鹼基建立處理 S2:觀察鹼基集合建立處理 S3:共識鹼基c(S)計算處理 S4:共識鹼基錯誤率P c(S)計算處理 S4a:定序正確率及定序錯誤率計算處理 S4-1:共識鹼基錯誤率P c(S)之計算修正處理 1: Consensus base error rate evaluation system 11: Polymerase chain reaction quality evaluation module 11a: Consensus base error rate P c(S) correction unit 12: Sequencing quality evaluation module 14: Scoring module 15: Judgment module 2: Consensus base error rate evaluation device 21: Processor 22: Storage S1: Amplifier pool selection process S1a: Amplification base establishment process S2: Observation base set establishment process S3: Consensus base c(S) calculation process S4: Consensus base error rate P c(S) calculation process S4a: Sequencing accuracy and sequencing error rate calculation process S4-1: Consensus base error rate P c(S) calculation and correction process

圖1A為一步驟流程圖,用以說明第一實施方式所提供的方法; 圖1B為一步驟流程圖,用以說明第二實施方式所提供的方法; 圖1C為一機率分布圖,其說明十個擴增循環後錯誤鹼基比例F 10在PCR錯誤率為0.01的條件下之機率分布; 圖2為一機率分布圖,其說明於不同擴增循環中錯誤率理論值與近似值之擬合; 圖3A為一方塊圖,其說明第三實施方式所提供之共識鹼基錯誤率評估系統; 圖3B為一方塊圖,其說明第四實施方式所提供之共識鹼基錯誤率評估系統; 圖4為一方塊圖,其說明第五實施方式所提供之共識鹼基錯誤率評估裝置;及 圖5A至5B為機率分布圖,其說明實施例4中共識鹼基品質之觀察頻率與預期頻率之擬合。 FIG. 1A is a step flow chart for illustrating the method provided by the first embodiment; FIG. 1B is a step flow chart for illustrating the method provided by the second embodiment; FIG. 1C is a probability distribution diagram for illustrating the probability distribution of the error base ratio F10 after ten amplification cycles under the condition that the PCR error rate is 0.01; FIG. 2 is a probability distribution diagram for illustrating the fit between the theoretical value and the approximate value of the error rate in different amplification cycles; FIG. 3A is a block diagram for illustrating the consensus base error rate evaluation system provided by the third embodiment; FIG. 3B is a block diagram for illustrating the consensus base error rate evaluation system provided by the fourth embodiment; FIG. 4 is a block diagram illustrating a consensus base error rate evaluation apparatus provided in the fifth embodiment; and FIGS. 5A to 5B are probability distribution diagrams illustrating the fit between the observed frequency and the expected frequency of consensus base quality in the fourth embodiment.

S1:擴增子池選取處理 S1: Expanded sub-pool selection process

S1a:擴增鹼基建立處理 S1a: Extended base building process

S2:觀察鹼基集合建立處理 S2: Observe the base set establishment process

S3:共識鹼基c(S)計算處理 S3: Consensus base c(S) calculation processing

S4:共識鹼基錯誤率Pc(S)計算處理 S4: Consensus base error rate P c(S) calculation process

Claims (8)

一種共識鹼基錯誤率的評估方法,其包括:自一擴增子池選取一個或多個擴增鹼基,其中,該擴增子池係擴增自一未知鹼基N;定序第j個選取之擴增鹼基為觀察鹼基sj,並將該些觀察鹼基sj組成一觀察鹼基集合S,其中j為任意正整數;由該觀察鹼基集合S計算一共識鹼基c(S);計算一共識鹼基錯誤率Pc(S),其係該共識鹼基c(S)非為該未知鹼基N之機率,其中,該共識鹼基錯誤率Pc(S)為自該觀察鹼基集合S中隨機選取出該未知鹼基N相同於一假定鹼基b,但該假定鹼基b不相同於該共識鹼基c(S)之機率的總和,其係透過下述公式而得:
Figure 112124699-A0305-02-0035-1
其中,該P(S|N=b)為該未知鹼基N相同於該假定鹼基b時,自該擴增子池中隨機選取出該觀察鹼基集合S之機率,其計算方法包括:定義一錯誤鹼基分數FR,其為一擴增錯誤鹼基佔該擴增子池中該些擴增鹼基的比例,其中,該擴增錯誤鹼基為不同於該假定鹼基b之鹼基;及依據下述公式計算該P(S|N=b):
Figure 112124699-A0305-02-0035-2
其中,該P(S|N=b F R =F R )為該觀察鹼基集合S中每一個觀察鹼基sj,在該未知鹼基N相同於該假定鹼基b,且該擴增子池中該錯誤擴增鹼基的比例為FR時的機率P(s j |N=b F R =F R )之總乘積; 其中,該假定鹼基b選自由腺嘌呤(A)、胸腺嘌呤(T)、鳥糞嘌呤(G)及胞嘧啶(C)所組成之群組,且該假定鹼基b不相同於該共識鹼基c(S)。
A method for evaluating a consensus base error rate comprises: selecting one or more extended bases from an extender pool, wherein the extender pool is extended from an unknown base N; ordering the jth selected extended base as an observed base s j , and grouping the observed bases s j into an observed base set S, wherein j is an arbitrary positive integer; calculating a consensus base c(S) from the observed base set S; calculating a consensus base error rate P c(S) , which is the probability that the consensus base c(S) is not the unknown base N, wherein the consensus base error rate P c(S) is the sum of the probabilities that the unknown base N randomly selected from the observed base set S is the same as a hypothetical base b, but the hypothetical base b is not the same as the consensus base c(S), which is obtained by the following formula:
Figure 112124699-A0305-02-0035-1
Wherein, the P(S|N=b) is the probability of randomly selecting the observed base set S from the expander pool when the unknown base N is the same as the assumed base b, and the calculation method thereof includes: defining an error base fraction FR , which is the ratio of an extended error base to the extended bases in the expander pool, wherein the extended error base is a base different from the assumed base b; and calculating the P(S|N=b) according to the following formula:
Figure 112124699-A0305-02-0035-2
wherein P ( S | N = b FR = FR ) is the total product of the probability P ( s j | N = b FR = FR ) for each observed base s j in the observed base set S when the unknown base N is the same as the assumed base b and the proportion of the erroneous expanded base in the expander pool is FR ; wherein the assumed base b is selected from the group consisting of adenine (A), thymine ( T ), guanosine (G) and cytosine (C), and the assumed base b is different from the consensus base c(S).
如請求項1所述之方法,其中,在該觀察鹼基sj相同於該假定鹼基b時,該P(s j |N=b F R =F R )相等於一正確鹼基分數F R ',在該觀察鹼基sj不同於該假定鹼基b時,該P(s j |N=b F R =F R )相等於該錯誤鹼基分數F R ,其中,該正確鹼基分數F R '與該錯誤鹼基分數F R 之總和為1。 A method as described in claim 1, wherein, when the observed base sj is the same as the assumed base b, the P ( sj | N = b FR = FR ) is equal to a correct base score FR ' , and when the observed base sj is different from the assumed base b , the P ( sj | N = b FR = FR ) is equal to the incorrect base score FR, wherein the sum of the correct base score FR' and the incorrect base score FR is 1 . 如請求項2所述之方法,其進一步包括:依據一定序正確率Pseq1、一定序錯誤率Pseq2及一擴增鹼基比例F b"b 修正該P(s j |N=b F R =F R )為P(s j =b'|N=b∩{ F b " b }={F b"b }),其為該未知鹼基N相同於該假定鹼基b時,且該擴增子池中一不相同於該假定鹼基b之擴增鹼基b”的比例為F b"b 時,該觀察鹼基sj相同於另一觀察鹼基b’的機率,其中,當該擴增鹼基b”不相同於該假定鹼基b時,該擴增鹼基比例F b"b 相等於該擴增鹼基分數F R ,修正後該P(s j =b'|N=b∩{ F b " b }={F b"b })滿足下述公式:
Figure 112124699-A0305-02-0036-3
其中,當該另一觀察鹼基b’相同於該假定鹼基b時,Ppcr1代表該假定鹼基b擴增為一相同於該另一觀察鹼基b’之鹼基的機率,Ppcr2代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b;其中,當該另一觀察鹼基b’不同於該假定鹼基b時,Ppcr3代表該假定鹼基b擴增為相同於該另一觀察鹼基b’之鹼基的機率,Ppcr4代表該假定鹼基b 擴增為相同於該假定鹼基b之鹼基的機率,Ppcr5代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b,亦不相同於該另一觀察鹼基b’;其中,該定序正確率Pseq1代表該另一觀察鹼基b’定序正確之機率,該定序錯誤率Pseq2代表該擴增鹼基b”錯誤定序為該另一觀察鹼基b’之機率。
The method as described in claim 2 further comprises: correcting P ( s j | N = b FR = FR ) to P ( s j = b' | N = b ∩ { F b " ≠ b } = { F b " ≠ b } ) according to a certain sequence correctness rate P seq1 , a certain sequence error rate P seq2 and an extended base ratio F b"b , which is when the unknown base N is the same as the assumed base b , and the ratio of an extended base b" in the extender pool that is different from the assumed base b is F b"b , the observed base s The probability that j is the same as another observed base b', wherein, when the extended base b" is different from the assumed base b, the extended base ratio F b"b is equal to the extended base fraction FR , and the modified P ( sj = b' | N = b ∩{ F b " b }={ F b"b }) satisfies the following formula:
Figure 112124699-A0305-02-0036-3
Wherein, when the other observed base b' is the same as the assumed base b, P pcr1 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr2 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is not the same as the assumed base b; wherein, when the other observed base b' is different from the assumed base b, P pcr3 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr4 represents the probability that the assumed base b is expanded to a base that is the same as the assumed base b, P pcr5 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is different from the assumed base b and the other observed base b'; wherein the sequencing accuracy rate Pseq1 represents the probability that the other observed base b' is sequenced correctly, and the sequencing error rate Pseq2 represents the probability that the expanded base b" is incorrectly sequenced to the other observed base b'.
如請求項3所述之方法,其更包括以該另一觀察鹼基b’之定序品質分數Qi計算該定序正確率Pseq1及該定序錯誤率Pseq2,其中,該定序正確率Pseq1可以由下述公式計算而得:
Figure 112124699-A0305-02-0037-4
該定序錯誤率Pseq2可以由下述公式計算而得:
Figure 112124699-A0305-02-0037-5
The method as claimed in claim 3 further comprises calculating the sequencing accuracy rate P seq1 and the sequencing error rate P seq2 using the sequencing quality score Qi of the other observed base b', wherein the sequencing accuracy rate P seq1 can be calculated by the following formula:
Figure 112124699-A0305-02-0037-4
The sequencing error rate P seq2 can be calculated by the following formula:
Figure 112124699-A0305-02-0037-5
如請求項1至4任一項所述之方法,其中,該觀察鹼基集合S包括有d個觀察鹼基sj,該d個觀察鹼基sj包含有nb個相同於該假定鹼基b,以及(d-nb)個不同於該假定鹼基b之該擴增錯誤鹼基,其中,該d值為j的最大正整數值,且該nb為小於或等於該d之正整數;在該觀察鹼基sj不同於該假定鹼基b時,該P(s j |N=b F R =F R )相等於該錯誤鹼基分數FR,計算該P(s j |N=b F R =F R )之總乘積如下式所示:
Figure 112124699-A0305-02-0037-6
A method as described in any one of claims 1 to 4, wherein the observed base set S includes d observed bases s j , the d observed bases s j include n b the same as the assumed base b, and (dn b ) the extended error bases different from the assumed base b, wherein the d value is the maximum positive integer value of j, and the n b is a positive integer less than or equal to d; when the observed base s j is different from the assumed base b, the P ( s j | N = b FR = FR ) is equal to the error base fraction FR , and the total product of the P ( s j | N = b FR = FR ) is calculated as shown in the following formula :
Figure 112124699-A0305-02-0037-6
如請求項5所述之方法,其中,該P(S|N=b)進一步以動差值近似計算而獲得:
Figure 112124699-A0305-02-0037-7
其中,該
Figure 112124699-A0305-02-0038-8
係於該擴增錯誤鹼基分數FR的機率分布之第(k+d-nb)的動差值,其中,該k為小於或等於該nb的任意非負整數;其中,該
Figure 112124699-A0305-02-0038-9
之近似值為r/2k+d-nb,其中,r為每一循環中對應於該未知鹼基N之基因座發生聚合連鎖反應擴增錯誤的機率。
The method of claim 5, wherein P(S|N=b) is further approximated by the difference in momentum to obtain:
Figure 112124699-A0305-02-0037-7
Among them, the
Figure 112124699-A0305-02-0038-8
is the (k+dn b )th moment value of the probability distribution of the extended error base fraction FR , where k is any non-negative integer less than or equal to n b ; where
Figure 112124699-A0305-02-0038-9
The approximate value of is r/2 k+d-nb , where r is the probability of a polymerization cascade amplification error occurring at the locus corresponding to the unknown base N in each cycle.
一種共識鹼基錯誤率的評估系統,其包括一聚合酶連鎖反應品質評估模組,配置以執行如請求項1至4任一項所述之方法以計算一共識鹼基錯誤率Pc(S)A consensus base error rate evaluation system includes a polymerase cascade reaction quality evaluation module configured to execute the method described in any one of claims 1 to 4 to calculate a consensus base error rate P c(S) . 如請求項7所述之評估系統,其更包括一定序品質評估模組,與該聚合酶連鎖反應品質評估模組訊號連接,配置以讀取另一觀察鹼基b’之定序品質分數Qi計算一定序正確率Pseq1及一定序錯誤率Pseq2,其中,該定序正確率Pseq1可以由下述公式計算而得:
Figure 112124699-A0305-02-0038-10
該定序錯誤率Pseq2可以由下述公式計算而得:
Figure 112124699-A0305-02-0038-11
;及一共識鹼基錯誤率Pc(S)修正單元,配置於該聚合酶連鎖反應品質評估模組中,並與該定序品質評估模組訊號連接,以依據該定序正確率Pseq1、該定序錯誤率Pseq2及一擴增鹼基比例F b"b 修正該P(s j |N=b F R =F R )為P(s j =b'|N=b∩{ F b " b }={F b"b }),其為該未知鹼基N相同於該假定鹼基b時,且該擴增子池中一不相同於該假定鹼基b之擴增鹼基b”的比例為F b"b 時,該觀察鹼基sj相同於另一觀察鹼基b’的機率,其中,當該擴增鹼基b”不相同於該假定鹼基b時,該擴增鹼基比例F b"b 相等於該擴增鹼基分數F R ,修正後該P(s j =b'|N=b∩{ F b " b }={F b"b })滿足下述公式:
Figure 112124699-A0305-02-0039-12
其中,當該另一觀察鹼基b’相同於該假定鹼基b時,Ppcr1代表該假定鹼基b擴增為一相同於該另一觀察鹼基b’之鹼基的機率,Ppcr2代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b;其中,當該另一觀察鹼基b’不同於該假定鹼基b時,Ppcr3代表該假定鹼基b擴增為相同於該另一觀察鹼基b’之鹼基的機率,Ppcr4代表該假定鹼基b擴增為相同於該假定鹼基b之鹼基的機率,Ppcr5代表該假定鹼基b擴增為該擴增鹼基b”之機率,且該擴增鹼基b”不相同於該假定鹼基b,亦不相同於該另一觀察鹼基b’;其中,該定序正確率Pseq1代表該另一觀察鹼基b’定序正確之機率,該定序錯誤率Pseq2代表該擴增鹼基b”錯誤定序為該另一觀察鹼基b’之機率。
The evaluation system as claimed in claim 7 further comprises a sequence quality evaluation module, signal-connected to the polymerase chain reaction quality evaluation module, configured to read the sequence quality score Qi of another observed base b' to calculate a sequence accuracy rate Pseq1 and a sequence error rate Pseq2 , wherein the sequence accuracy rate Pseq1 can be calculated by the following formula:
Figure 112124699-A0305-02-0038-10
The sequencing error rate P seq2 can be calculated by the following formula:
Figure 112124699-A0305-02-0038-11
and a consensus base error rate P c(S) correction unit, which is configured in the polymerase chain reaction quality assessment module and is signal-connected to the sequencing quality assessment module to correct the P ( s j | N = b ∩ FR = FR ) to P ( s j = b ' | N = b { F b "b }={ F b " ≠ b } ) according to the sequencing accuracy rate P seq1 , the sequencing error rate P seq2 and an extended base ratio F b " b , which is when the unknown base N is the same as the assumed base b , and the ratio of an extended base b" in the amplicon pool that is different from the assumed base b is F b"b , the probability that the observed base sj is the same as another observed base b', wherein, when the extended base b" is different from the assumed base b, the extended base ratio Fb "b is equal to the extended base fraction FR , and the modified P ( sj = b' | N = b ∩{ Fb " b }={ Fb "b }) satisfies the following formula:
Figure 112124699-A0305-02-0039-12
Wherein, when the other observed base b' is the same as the assumed base b, P pcr1 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr2 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is not the same as the assumed base b; wherein, when the other observed base b' is different from the assumed base b, P pcr3 represents the probability that the assumed base b is expanded to a base that is the same as the other observed base b', P pcr4 represents the probability that the assumed base b is expanded to a base that is the same as the assumed base b, P pcr5 represents the probability that the assumed base b is expanded to the expanded base b", and the expanded base b" is different from the assumed base b and the other observed base b'; wherein the sequencing accuracy rate Pseq1 represents the probability that the other observed base b' is sequenced correctly, and the sequencing error rate Pseq2 represents the probability that the expanded base b" is incorrectly sequenced to the other observed base b'.
TW112124699A 2022-07-06 2023-07-03 Method for evaluating consensus base error rate and a system thereof TWI852661B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
TW111125402 2022-07-06
TW111125402 2022-07-06

Publications (2)

Publication Number Publication Date
TW202403774A TW202403774A (en) 2024-01-16
TWI852661B true TWI852661B (en) 2024-08-11

Family

ID=90457515

Family Applications (1)

Application Number Title Priority Date Filing Date
TW112124699A TWI852661B (en) 2022-07-06 2023-07-03 Method for evaluating consensus base error rate and a system thereof

Country Status (1)

Country Link
TW (1) TWI852661B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180211001A1 (en) * 2016-04-29 2018-07-26 Microsoft Technology Licensing, Llc Trace reconstruction from noisy polynucleotide sequencer reads

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180211001A1 (en) * 2016-04-29 2018-07-26 Microsoft Technology Licensing, Llc Trace reconstruction from noisy polynucleotide sequencer reads

Also Published As

Publication number Publication date
TW202403774A (en) 2024-01-16

Similar Documents

Publication Publication Date Title
CN106909806B (en) Method and device for spot detection of variants
Zagordi et al. Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction
TWI781230B (en) Method, system and computer product using site-specific noise model for targeted sequencing
US20200105375A1 (en) Models for targeted sequencing of rna
EP3794145A1 (en) Inferring selection in white blood cell matched cell-free dna variants and/or in rna variants
MX2011004589A (en) Methods for assembling panels of cancer cell lines for use in testing the efficacy of one or more pharmaceutical compositions.
US12020777B1 (en) Cancer diagnostic tool using cancer genomic signatures to determine cancer type
CN116344067B (en) Influenza susceptibility marker, construction method and application of influenza high risk group prediction model based on same
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
TWI852661B (en) Method for evaluating consensus base error rate and a system thereof
CN116246703A (en) A quality assessment method for nucleic acid sequencing data
CN117497047B (en) Method, equipment and medium for screening tumor gene markers based on exon sequencing
CN118374599A (en) Gene pair marker combination and application for predicting the prognostic risk of pathological complete response to adjuvant chemotherapy for sex hormone receptor-positive breast cancer
Hobbs et al. Biostatistics and bioinformatics in clinical trials
WO2024036475A1 (en) Concensus base error rate evaluaton mehod and system
CN118447924A (en) A method, system and computer device for identifying single nucleotide polymorphism
US20090287631A1 (en) Computer-Implemented Method and Computer System for Identifying Organisms
Mayrink et al. Bayesian factor models for the detection of coherent patterns in gene expression data
WO2019213810A1 (en) Method, apparatus, and system for detecting chromosome aneuploidy
US7613662B2 (en) Apparatus, machine-readable medium, and system for the detection of atypical sequences via generalized compositional methods
US20220336044A1 (en) Read-Tier Specific Noise Models for Analyzing DNA Data
US20200105374A1 (en) Mixture model for targeted sequencing
Thant et al. Impact of Normalization Techniques in Microarray Data Analysis
Fouodo et al. Effect of hyperparameters on variable selection in random forests
CN116779040B (en) Data processing method based on multiple groups of chemical cancer subtype typing