TW201619456A - Method for detecting chromosomal structural abnormalities and device therefor - Google Patents
Method for detecting chromosomal structural abnormalities and device therefor Download PDFInfo
- Publication number
- TW201619456A TW201619456A TW103140128A TW103140128A TW201619456A TW 201619456 A TW201619456 A TW 201619456A TW 103140128 A TW103140128 A TW 103140128A TW 103140128 A TW103140128 A TW 103140128A TW 201619456 A TW201619456 A TW 201619456A
- Authority
- TW
- Taiwan
- Prior art keywords
- read
- read length
- cluster
- pair
- clusters
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000005856 abnormality Effects 0.000 title claims abstract description 32
- 230000002759 chromosomal effect Effects 0.000 title abstract description 4
- 210000000349 chromosome Anatomy 0.000 claims abstract description 71
- 238000012163 sequencing technique Methods 0.000 claims abstract description 46
- 239000012634 fragment Substances 0.000 claims abstract description 36
- 238000001914 filtration Methods 0.000 claims abstract description 21
- 230000002159 abnormal effect Effects 0.000 claims abstract description 19
- 238000012070 whole genome sequencing analysis Methods 0.000 claims abstract description 7
- 238000003860 storage Methods 0.000 claims description 13
- 210000001726 chromosome structure Anatomy 0.000 claims description 11
- 238000005056 compaction Methods 0.000 claims description 6
- 208000034951 Genetic Translocation Diseases 0.000 claims description 4
- 238000012217 deletion Methods 0.000 claims description 4
- 230000037430 deletion Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 206010061764 Chromosomal deletion Diseases 0.000 claims description 2
- 230000001568 sexual effect Effects 0.000 claims 1
- 230000005945 translocation Effects 0.000 abstract description 18
- 238000004458 analytical method Methods 0.000 description 18
- 239000000523 sample Substances 0.000 description 12
- 238000010586 diagram Methods 0.000 description 7
- 230000003252 repetitive effect Effects 0.000 description 6
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 4
- 238000007480 sanger sequencing Methods 0.000 description 4
- 238000011529 RT qPCR Methods 0.000 description 3
- 101150068479 chrb gene Proteins 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 208000002330 Congenital Heart Defects Diseases 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 208000028831 congenital heart disease Diseases 0.000 description 2
- 229940104302 cytosine Drugs 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 210000000245 forearm Anatomy 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 238000012067 mathematical method Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 108091034117 Oligonucleotide Proteins 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000002230 centromere Anatomy 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000007901 in situ hybridization Methods 0.000 description 1
- 238000002493 microarray Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000011895 specific detection Methods 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 210000003411 telomere Anatomy 0.000 description 1
- 102000055501 telomere Human genes 0.000 description 1
- 108091035539 telomere Proteins 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
本發明涉及基因組學及生物信息學技術領域,具體涉及檢測染色體結構異常的方法及裝置。 The invention relates to the technical field of genomics and bioinformatics, in particular to a method and a device for detecting abnormalities of chromosome structure.
目前常見的染色體檢查方法有:核型分析:例如G帶核型分析,由於採用400到600個BAND的分佈情況來判斷染色體結構異常,因此通常只能檢測染色體級別的異常,最好情況下可以檢測出5Mbp以上的缺失和重複,對於更小片段(<5M)的檢測則無能為力。並且,該方法需要對活體細胞進行培養,要求細胞必須保持活性。 The current common methods of chromosome examination are: karyotype analysis: for example, G-band karyotype analysis, because the distribution of 400 to 600 BAND is used to judge the abnormality of chromosome structure, it is usually only possible to detect abnormalities at the chromosome level. Deletions and repeats above 5 Mbp were detected and were incapable of detection of smaller fragments (<5 M). Moreover, this method requires the cultivation of living cells, requiring the cells to remain active.
螢光原位雜交(FISH,fluorescence in situ hybridization)方法:可以檢測出更小片段的缺失、重複和平衡易位,但需要預先確定所檢測的染色體片段以準備相應的探針,因此受探針設計的限制。由於FISH無法檢測未知區域,因此常用於驗證檢測結果。 Fluorescence in situ hybridization (FISH) method: deletion, repetitive and balanced translocation of smaller fragments can be detected, but the detected chromosome fragments need to be determined in advance to prepare corresponding probes, thus being probed Design restrictions. Because FISH cannot detect unknown areas, it is often used to verify test results.
微陣列(Microarray)方法:其中包括兩種探針方法,一種基於單核苷酸多態性(SNP,single nucleotide polymorphisms)設計,一種基於CNV設計,因此具有與FISH類似的局限性。 Microarray method: It includes two probe methods, one based on single nucleotide polymorphism (SNP) design, one based on CNV design, and thus has similar limitations to FISH.
隨著全基因組測序技術的不斷發展,測序成本不斷降低,使得全基因組測序的普及化成為可能,有必要研究基於全基因組測序結果來 發現染色體結構異常的手段。 With the continuous development of genome-wide sequencing technology, the cost of sequencing is decreasing, making the popularization of whole genome sequencing possible. It is necessary to study the results based on whole genome sequencing. A means of discovering abnormalities in chromosome structure.
依據本發明的一方面提供一種檢測染色體結構異常的方法,包括如下步驟:獲取目標個體的全基因組測序結果,其中包括多對讀長對,每對讀長對由兩個讀長序列組成,分別位於所測染色體片段的兩端,每對讀長對分別來自相應染色體片段的正鏈和負鏈,或者,每對讀長對同時來自相應染色體片段的正鏈或負鏈;將測序結果與參考序列進行比對,獲得異常匹配集,異常匹配集包括符合下述描述的第一類讀長對,第一類讀長對中的兩個讀長序列分別匹配到參考序列的不同染色體;按照匹配到的位置將異常匹配集中的讀長序列聚類成簇,每個簇中含有來自一組讀長對的單端的讀長序列,相應的另一端的讀長序列位於另一個簇中;對聚類得到的簇進行過濾,其中包括,計算各個簇的緊緻程度,過濾掉緊緻程度不滿足預置要求R-va的簇及與其成對的簇,獲得過濾後的含有第一類讀長對的結果簇,以用於判斷染色體易位性結構異常的發生。 According to an aspect of the present invention, a method for detecting abnormality of a chromosome structure includes the steps of: obtaining a whole genome sequencing result of a target individual, wherein a plurality of pairs of read length pairs are included, and each pair of read length pairs is composed of two read length sequences, respectively Located at both ends of the measured chromosome fragment, each pair of read pairs are from the positive and negative strands of the corresponding chromosome fragment, respectively, or each pair of read pairs are from the positive or negative strand of the corresponding chromosome fragment; sequencing results and references The sequences are aligned to obtain an abnormal matching set, and the abnormal matching set includes a first type of read length pair that meets the following description, and the two read length sequences of the first type of read long pair are respectively matched to different chromosomes of the reference sequence; The obtained position clusters the read long sequences in the abnormal matching set into clusters, each cluster contains a single-ended read length sequence from a set of read long pairs, and the corresponding read length sequence at the other end is located in another cluster; The cluster obtained by the class is filtered, which includes, calculating the degree of compactness of each cluster, filtering out clusters whose degree of compaction does not satisfy the preset requirement R-va and clusters paired therewith, The filtered result cluster containing the first type of read length pair is used to determine the occurrence of chromosomal translocation structural abnormalities.
依據本發明的另一方面提供一種檢測染色體結構異常的裝置,包括:數據輸入單元,用於輸入數據;數據輸出單元,用於輸出數據;存儲單元,用於存儲數據,其中包括可執行的程序;處理器,與數據輸入單元、數據輸出單元及存儲單元數據連接,用於執行存儲單元中存儲的可執行的程序,該程序的執行包括完成上述檢測染色體結構異常的方法。 According to another aspect of the present invention, an apparatus for detecting an abnormality of a chromosome structure includes: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data including an executable program The processor is coupled to the data input unit, the data output unit, and the storage unit for executing an executable program stored in the storage unit, and the executing of the program includes performing the foregoing method for detecting an abnormality of the chromosome structure.
依據本發明的再一方面提供一種計算機可讀存儲介質,用於存儲供計算機執行的程序,本領域普通技術人員可以理解,在執行該程序時,通過指令相關硬件可完成上述檢測染色體結構異常的方法的全部或部 分步驟。所稱存儲介質可以包括:只讀存儲器、隨機存儲器、磁盤或光盤等。 According to still another aspect of the present invention, a computer readable storage medium for storing a program for execution by a computer is provided, and those skilled in the art can understand that when the program is executed, the above-mentioned detection of abnormality of a chromosome structure can be completed by instructing related hardware. All or part of the method Step by step. The storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
依據本發明的方法通過全基因組測序結果與參考序列的比對獲得匹配到不同染色體的讀長對,使得能夠篩選出染色體易位性結構異常,並且通過聚類以及過濾進一步提高獲得的結果的有效性和可靠性,使得能夠獲得具有分析意義的結果。 The method according to the present invention obtains a pair of read lengths matched to different chromosomes by alignment of the whole genome sequencing result with the reference sequence, thereby enabling screening of chromosomal translocation structural abnormalities, and further improving the obtained results by clustering and filtering. Sex and reliability make it possible to obtain analytically meaningful results.
本發明的上述和/或附加的方面和優點從結合下面附圖對實施方式的描述中將變得明顯和容易理解,其中:圖1是依據本發明的一種實施方式的雙端測序獲得的一對Reads示意圖;圖2是依據本發明的一種實施方式的異常匹配的第一類Reads示意圖;圖3是依據本發明的一種實施方式的異常匹配的第二類Reads示意圖;圖4是依據本發明的一種實施方式的異常匹配的第三類Reads示意圖圖5是依據本發明的一種實施方式的位於不同染色體的一對簇的示意圖;圖6是依據本發明的一種實施方式的實驗例1中“FA”的RPK示意圖;圖7是依據本發明的一種實施方式的實驗例1中“SON”的 RPK示意圖。 The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments of the accompanying drawings in which: FIG. FIG. 2 is a schematic diagram of a first type of Reads for abnormal matching according to an embodiment of the present invention; FIG. 3 is a schematic diagram of a second type of Reads for abnormal matching according to an embodiment of the present invention; FIG. 4 is a schematic diagram of a second type of Reads according to an embodiment of the present invention; FIG. 5 is a schematic diagram of a pair of clusters of abnormal chromosomes according to an embodiment of the present invention; FIG. 6 is a schematic diagram of Experimental Example 1 according to an embodiment of the present invention. RPK schematic diagram of FA"; FIG. 7 is a "SON" of Experimental Example 1 according to an embodiment of the present invention. RPK schematic.
依據本發明的一種實施方式,提供一種檢測染色體結構異常的方法,包括如下步驟: According to an embodiment of the present invention, a method for detecting an abnormality in a chromosome structure is provided, comprising the steps of:
步驟1. 獲取目標個體的全基因組測序結果。 Step 1. Obtain the whole genome sequencing results of the target individual.
測序結果包括成對讀長對(也稱“讀段”)Reads,每對Reads由兩個讀長序列組成,分別位於所測染色體片段的兩端,每對Reads分別來自相應染色體片段的正鏈和負鏈,或者,每對Reads同時來自相應染色體片段的正鏈或負鏈。 The sequencing results include paired read length pairs (also called "reads") Reads, each pair of Reads consisting of two read length sequences located at the ends of the measured chromosome fragments, each pair of Reads from the positive strand of the corresponding chromosome fragment And the negative strand, or, each pair of Reads comes from both the positive or negative strand of the corresponding chromosome fragment.
所測染色體片段通常是將來自目標個體的染色體樣本經過打斷獲得的,並根據所選用的測序方法進行相應的文庫(library)製備,可選用的測序方法根據來自的測序平台包括但不限於CG(Complete Genomics)、Illumina/Solexa、ABI/SOLiD和Roche 454,依據所選測序平台進行單端或雙端測序文庫的製備。根據本發明一種具體實施方式可進行雙末端測序,獲得的每對Reads中的兩個讀長序列Read1和Read2分別來自相應染色體片段的正鏈Sp和負鏈Sm,如圖1所示;Read1的長度L-r1與Read2的長度L-r2可以相同也可以不同。當然,若使用的單端(single-read)測序方法能夠完整獲得整個染色體片段的序列,從完整獲得的序列的兩端分別截取適當長度的序列來構成一對Reads也是可行的,這種情況下,每對Reads中的兩個讀長序列Read同時來自相應染色體片段的正鏈或負鏈。本實施例對所選用的具體測序方法不作限定。 The measured chromosome fragment is usually obtained by interrupting the chromosome sample from the target individual, and the corresponding library preparation is performed according to the selected sequencing method. The optional sequencing method is based on the sequencing platform from but not limited to CG. (Complete Genomics), Illumina/Solexa, ABI/SOLiD, and Roche 454, preparation of single-ended or double-ended sequencing libraries based on the selected sequencing platform. According to a specific embodiment of the present invention, double-end sequencing can be performed, and the two read-length sequences Read1 and Read2 in each pair of Reads obtained are respectively derived from the positive-strand Sp and the negative-strand Sm of the corresponding chromosome segment, as shown in FIG. 1; The length L-r1 and the length L-r2 of Read2 may be the same or different. Of course, if a single-read sequencing method is used to completely obtain the sequence of the entire chromosome segment, it is also feasible to intercept a sequence of an appropriate length from both ends of the completely obtained sequence to form a pair of Reads. The two read length sequences Read in each pair of Reads are simultaneously from the positive or negative strand of the corresponding chromosome fragment. This embodiment does not limit the specific sequencing method selected.
本發明中,將測序所使用的文庫的大小記為L-lib,一般將L-lib為100~1000bp的文庫稱為小片段文庫,將L-lib為2K、5K-6K、10K、20K、40Kbp的文庫稱為大片段文庫。本發明對L-lib的大小無要求,不過一般而言,在保證文庫建設質量的前提下,長度較大的文庫對於獲得有效的結果而言是有益的,因此,優選Llib300bp。通常可使用大片段,例如5Kbp的文庫,或小片段,例如500bp的文庫。為使得測序結果具有較好的豐度,大片段文庫的測序深度可選擇大於2乘,小片段文庫的測序深度可選擇大於5乘,為避免數據的浪費,大片段文庫的測序深度優選為2乘,小片段文庫的測序深度優選為5乘。需要說明的是,由於本發明中涉及的具體數據大多具有統計意義,因此,如無特殊說明,任意以精確方式表達的數值均代表一個範圍,即包含該數值正負10%的區間,以下不再重複說明。 In the present invention, the size of the library used for sequencing is referred to as L-lib, and a library in which L-lib is 100 to 1000 bp is generally referred to as a small fragment library, and L-lib is 2K, 5K-6K, 10K, 20K, A 40 Kbp library is called a large fragment library. The present invention does not require the size of L-lib, but in general, a library of a larger length is beneficial for obtaining an effective result while ensuring the quality of library construction. Therefore, Llib is preferred. 300bp. Large fragments, such as 5 Kbp libraries, or small fragments, such as a 500 bp library, can generally be used. In order to make the sequencing results have better abundance, the sequencing depth of the large fragment library can be selected to be greater than 2 times, and the sequencing depth of the small fragment library can be selected to be greater than 5 times. To avoid data waste, the sequencing depth of the large fragment library is preferably 2 Multiplication, the sequencing depth of the small fragment library is preferably 5 times. It should be noted that since the specific data involved in the present invention is mostly statistically significant, any numerical value expressed in an accurate manner represents a range, that is, a range including plus or minus 10% of the numerical value, unless otherwise specified. Repeat the instructions.
L-r1和L-r2優選大於等於25bp,因為若低於25bp則唯一比對率會降低,使後續獲得比對結果的複雜度增加。L-r1和L-r2也不需要太大,以免浪費數據,因此可優選為50bp。L-r1和L-r2無最高值限制,可根據測序技術的發展而作改變,例如根據當前的測序技術,L-r1和L-r2一般不超過150bp。 L-r1 and L-r2 are preferably 25 bp or more, because if it is less than 25 bp, the unique aligning ratio is lowered, and the complexity of subsequent obtaining alignment results is increased. L-r1 and L-r2 also need not be too large to avoid wasting data, so it is preferably 50 bp. L-r1 and L-r2 have no maximum value and can be changed according to the development of sequencing technology. For example, according to current sequencing technology, L-r1 and L-r2 generally do not exceed 150 bp.
步驟2. 將測序結果與參考序列進行比對。 Step 2. Align the sequencing results with the reference sequence.
所使用的參考序列是已知序列,可以是預先獲得的目標個體所屬生物類別中的任意的參考模板。例如,若目標個體是人類,參考序列可選擇美國國家生物技術信息中心(NCBI,national center for biotechnology information)提供的HG19。進一步地,也可以預先配置包含更多參考序列的資源庫,在進行序列比對前,先依據目標個體的性別、人種、地域等因 素選擇更接近的參考序列,以有助於獲得更準確的檢測結果。在比對過程中,根據比對參數的設置,一對Reads最多允許有n個鹼基錯配(mismatch),n優選為1或2,若Reads中有超過n個鹼基發生錯配,則視為該對Reads無法比對到參考序列,或者,若錯配的n個鹼基全部位於Reads中的一個Read,則視為該Reads中的該Read無法比對到參考序列。具體比對時,可使用各種比對軟件,例如SOAP(Short Oligonucleotide AnalysisPackage),bwa,samtools等,本實施方式對此不作限定。 The reference sequence used is a known sequence and may be any reference template in the biological category to which the target individual belongs in advance. For example, if the target individual is a human, the reference sequence may select HG19 provided by the National Center for Biotechnology Information (NCBI). Further, the resource library including more reference sequences may be pre-configured, and the sex, race, and region of the target individual may be determined according to the target individual before the sequence comparison. Select a closer reference sequence to help obtain more accurate results. In the alignment process, according to the setting of the alignment parameter, a pair of Reads is allowed to have at most n base mismatches, n is preferably 1 or 2. If there are more than n bases in the Reads, then It is considered that the pair of Reads cannot be compared to the reference sequence, or if the mismatched n bases are all located in a Read in Reads, then the Read in the Reads cannot be compared to the reference sequence. For the specific comparison, various comparison softwares, such as SOAP (Short Oligonucleotide Analysis Package), bwa, samtools, etc., may be used, which is not limited in this embodiment.
根據Reads的比對情況,可獲得如下分類: According to the comparison of Reads, the following classifications can be obtained:
(一)正常匹配集*.pair:其中包括符合下述描述的Reads,即,Reads中的兩個讀長序列Read1和Read2匹配到參考序列的相同的染色體,且匹配到的位置的正負鏈關係與Reads中的正負鏈關係一致,且根據匹配到的位置所計算出來的染色體片段的長度L-pr與L-lib的偏差小於預置的閾值V-lib。V-lib優選為5%×L-lib~15%×L-lib,進一步優選為10%×L-lib。上述閾值是根據經驗,按照文庫大小的標準差來設置的。根據經驗,小片段文庫的標準差在15bp左右,大片段文庫的標準差在50bp左右,可以認為L-pr與L-lib的偏差在3倍標準差的範圍內是合適的,例如,對於500bp的文庫,可以認為L-pr的合適的範圍為455bp~545bp。 (1) Normal match set *.pair: This includes Reads that conform to the description below, that is, the two read length sequences Read1 and Read2 in Reads match the same chromosome of the reference sequence, and the positive and negative links of the matched positions Consistent with the positive and negative chain relationship in Reads, and the length of the chromosome segment L-pr calculated from the matched position is less than the preset threshold V-lib. V-lib is preferably 5% x L-lib~15% x L-lib, further preferably 10% x L-lib. The above thresholds are set empirically according to the standard deviation of the library size. According to experience, the standard deviation of small fragment libraries is about 15 bp, and the standard deviation of large fragment libraries is about 50 bp. It can be considered that the deviation of L-pr and L-lib is within the range of 3 standard deviations, for example, for 500 bp. The library can be considered to have a suitable range of 455 bp to 545 bp for L-pr.
基於*.pair可以獲得Reads按照所匹配到的位置的數量分佈,例如可以統計單位長度所包含的Reads的數量RPU,可以根據L-lib設置相應的單位長度,例如可設置為1.5~4倍L-lib。若L-lib為500bp,單位長度可設置為1Kbp,此時RPU可記為RPK。根據RPU相對於平均值的變化情況,例如變化是否超過預置閾值V-rm,可用於輔助判斷結構異常的發生,增加 結果分析的準確性。優選的,V-rm為10~30%,進一步優選為20%。此外,RPU的平均值可通過統計獲得,也可以根據估計得到,例如,可採用如下方式估算RPU的平均值:測序深度×(單位長度/L-lib)。若不需要使用RPU,可不必獲得*.pair。 Based on *.pair, the number of Reads can be obtained according to the number of positions matched. For example, the number of Reads included in the unit length can be counted. The unit length can be set according to L-lib, for example, it can be set to 1.5~4 times L. -lib. If L-lib is 500 bp, the unit length can be set to 1 Kbp, and the RPU can be recorded as RPK. According to the change of the RPU relative to the average value, for example, whether the change exceeds the preset threshold value V-rm, it can be used to assist in judging the occurrence of structural anomalies and increase The accuracy of the results analysis. Preferably, V-rm is from 10 to 30%, further preferably 20%. In addition, the average value of the RPU can be obtained by statistics or by estimation. For example, the average value of the RPU can be estimated by the following method: sequencing depth × (unit length / L-lib). If you do not need to use RPU, you do not need to get *.pair.
(二)異常匹配集*.sin:其中包括符合下述描述的三類Reads, (2) Abnormal matching set *.sin: This includes three types of Reads that meet the following description.
第一類Reads中的兩個讀長序列分別匹配到參考序列的不同染色體;這類Reads與易位性結構異常有關,例如平衡易位和非平衡易位。如圖2所示,表示一種平衡易位的情況,一對Reads中的Read1匹配到染色體chra,而Read2匹配到染色體chrb,而另一對Reads的情況正好相反,圖中連接Read1和Read2的虛線表示他們在染色體片段中的首尾位置關係(下同),pa和pb分別表示可能存在的斷點的位置,所稱“斷點”指染色體發生結構異常的邊界點。 The two read length sequences in the first type of Reads are matched to different chromosomes of the reference sequence, respectively; such Reads are associated with translocation structural anomalies, such as balanced translocations and unbalanced translocations. As shown in Figure 2, it shows a case of balanced translocation. Read1 in a pair of Reads matches the chromosome chra, and Read2 matches the chromosome chrb, while the other pair of Reads is the opposite. The dotted line connecting Read1 and Read2 in the figure Indicates their positional relationship in the chromosome segment (the same below), pa and pb respectively indicate the position of the possible breakpoint, and the so-called "breakpoint" refers to the boundary point of the structural abnormality of the chromosome.
第二類Reads中的兩個讀長序列匹配到參考序列的相同染色體,但L-pr為負值;這類Reads與串聯的重複性結構異常有關。如圖3所示,一對Reads中的Read1和Read2均匹配到染色體chra,但匹配到的位置的首尾位置關係與Read1和Read2在染色體片段中的首尾位置關係相反,pa1和pa2分別表示可能存在的重複片段的起止位置,L-sv表示重複片段的長度,圖中chra中部的虛線表示省略的長度(下同) The two read length sequences in the second type of Reads match the same chromosome of the reference sequence, but L-pr is negative; such Reads are associated with repetitive structural anomalies in tandem. As shown in Figure 3, both Read1 and Read2 in a pair of Reads match the chromosome chra, but the head-to-tail position relationship of the matched position is opposite to the head-to-tail position relationship of Read1 and Read2 in the chromosome segment, respectively, pa1 and pa2 indicate possible existence. The starting and ending position of the repeated segment, L-sv indicates the length of the repeated segment, and the dotted line in the middle of the chra indicates the length of the omission (the same below)
第三類Reads中的兩個讀長序列匹配到參考序列的相同染色體,但L-pr大於L-lib且偏差超過預置的閾值V-lib;這類Reads與缺失性結構異常有關。如圖4所示,一對Reads中的Read1和Read2均匹配到染色體chra, 且匹配到的位置的首尾位置關係與Read1和Read2在染色體片段中的首尾位置關係相同,但距離超過適合範圍,pa1和pa2分別表示可能存在的缺失片段的起止位置,L-sv表示缺失片段的長度。 The two read length sequences in the third type of Reads match the same chromosome of the reference sequence, but L-pr is greater than L-lib and the deviation exceeds the preset threshold V-lib; such Reads are associated with missing structural anomalies. As shown in Figure 4, both Read1 and Read2 in a pair of Reads match the chromosome chra. And the matching position of the head and tail position is the same as the head and tail position relationship of Read1 and Read2 in the chromosome segment, but the distance exceeds the suitable range, pa1 and pa2 respectively indicate the starting and ending positions of the missing fragments, and L-sv indicates the missing fragments. length.
由於異常匹配集中的不同類型的Reads分別代表可能出現的不同種類的染色體結構異常,因此,根據檢測需要,可以不必全部獲取上述種類的異常匹配Reads,例如,若只需要檢測易位性結果異常,可以僅從比對結果中獲取第一類Reads。同樣,異常匹配集也不局限於包括上述三種類型的Reads,只要不屬於正常匹配集,但又能匹配到參考序列的Reads或Reads中的一個讀長序列,都可以統計入異常匹配集。本領域一般技術人員可以將不同類型的異常匹配的表現形式與可能出現的相應的染色體結構異常相關聯。此外,考慮到可能存在的噪聲等乾擾的影響,在異常匹配集中可以不考慮區分正負鏈匹配或不匹配的情況。 Since different types of Reads in the abnormal matching set respectively represent different types of chromosome structural abnormalities that may occur, it is not necessary to obtain all kinds of abnormal matching Reads according to the detection requirements, for example, if only the translocation result abnormality needs to be detected, You can get the first type of Reads only from the comparison results. Similarly, the exception matching set is not limited to including the above three types of Reads, as long as it does not belong to the normal matching set, but can match the read sequence of Reads or Reads in the reference sequence, can be counted into the abnormal matching set. One of ordinary skill in the art can associate different types of abnormally matched expressions with corresponding chromosomal structural anomalies that may occur. In addition, considering the influence of interference such as noise that may exist, the case of distinguishing positive or negative chain matching or mismatch may not be considered in the abnormal matching set.
(三)無法匹配集*.unmap:其中包括無法匹配到參考序列的Read,這些Read可以是成對的(兩個均無法匹配),也可以是單端的(另一個Read能夠匹配)。 (3) Unable to match set *.unmap: This includes Read that cannot be matched to the reference sequence. These Reads can be paired (both cannot match) or single-ended (the other Read can match).
*.unmap中存在的單端Read可以用於在獲得結果簇後進一步進行斷點組裝,以獲得更加準確的斷點範圍。若不需要進行斷點組裝,可不必獲得*.unmap。 The single-ended Read that exists in *.unmap can be used to further breakpoint assembly after obtaining the result cluster to obtain a more accurate breakpoint range. If you do not need to breakpoint assembly, you don't have to get *.unmap.
步驟3. 按照匹配到的位置將*.sin中的讀長序列聚類成簇(cluster)。 Step 3. Cluster the read length sequences in *.sin into clusters according to the matched locations.
聚類可採用各種聚類算法,本實施例對此不作限定。例如,一種簡單的做法是,按照設置的簇間最小距離V-cl進行簇的劃分,即,搜索 按位置排序的讀長序列Read,從第一條Read開始,若第二條Read與其之間的距離小於V-cl,則劃分在同一個簇中,並從第二條Read開始繼續搜索,直到第n條Read與第n-1條Read之間的距離大於V-cl,則從第n條Read開始劃分為第二個簇,循環執行前述過程直到遍歷所有Read。聚類時,可不必分別考慮正負鏈的情況,按照Read匹配在染色體上的位置進行聚類即可。 The clustering can adopt various clustering algorithms, which is not limited in this embodiment. For example, a simple approach is to divide the cluster according to the set minimum inter-cluster distance V-cl, ie, search Read length sequence Read sorted by position, starting from the first Read, if the second Read is less than V-cl, it is divided into the same cluster, and continues to search from the second Read until The distance between the nth read and the n-1th Read is greater than V-cl, and the second cluster is divided from the nth Read, and the foregoing process is performed cyclically until all Reads are traversed. When clustering, it is not necessary to consider the case of positive and negative chains separately, and clustering according to the position of Read matching on the chromosome.
聚類後的每個簇中含有來自一組Reads的單端的讀長序列,相應的另一端的讀長序列位於另一個簇中,因此可以將這兩個簇稱為一對簇。如圖5所示,為分別位於不同染色體的一對簇cluster1和cluster2的示意圖,當然,成對的簇也可能位於相同染色體上。為使聚類後的分析有意義,每個簇中優選包含兩條以上的Read,若出現單個Read與其前後Read的距離均大於V-cl,可以丟棄該異常數據。 Each cluster after clustering contains a single-ended read length sequence from a set of Reads, and the corresponding read length sequence at the other end is located in another cluster, so these two clusters can be referred to as a pair of clusters. As shown in FIG. 5, which are schematic diagrams of a pair of clusters cluster1 and cluster2 located on different chromosomes, of course, the paired clusters may also be located on the same chromosome. In order to make the analysis after clustering meaningful, each cluster preferably contains more than two Reads. If a single Read is more than V-cl from both before and after Read, the abnormal data can be discarded.
V-cl最小不低於L-lib,若設置過低,會使得候選簇過多,且簇中的Read數過少,不便於後期的篩选和過濾,也可能導致假陽性結果的增多,若設置過高,則可能不便於斷點的確定,增大了斷點的範圍,因此,可優選為10Kbp。根據所採用的聚類算法的不同,V-cl可以有不同的具體含義,例如可以是相鄰的兩個簇的重心之間的距離,或者,是指相鄰的兩個簇中位置最接近的兩條Read之間的距離等。 V-cl is not lower than L-lib. If it is set too low, it will make the candidate cluster too much, and the number of Reads in the cluster is too small, which is not convenient for later screening and filtering, and may also lead to an increase in false positive results. If it is too high, it may be inconvenient to determine the breakpoint, and the range of the breakpoint is increased. Therefore, it may preferably be 10 Kbp. Depending on the clustering algorithm used, V-cl can have different specific meanings, such as the distance between the centers of gravity of two adjacent clusters, or the closest two clusters. The distance between the two Reads and so on.
步驟4. 對聚類得到的簇進行過濾。 Step 4. Filter the clusters obtained by clustering.
過濾是為了儘量除去各種可能存在的干擾,例如樣本污染、測序錯誤、比對錯誤、雜訊等,使得結果能儘量反映真實的染色體結構異常,因此可以根據實際需要以及可能出現的干擾類型來設置過濾條件,本實施例優選地提供如下過濾方式,在實際應用中,可以聯合或單獨使用其 中的一種或幾種過濾方式: Filtration is to remove as much as possible of possible interferences, such as sample contamination, sequencing errors, alignment errors, noise, etc., so that the results can reflect real chromosome structural anomalies as much as possible, so it can be set according to actual needs and possible types of interference. Filtering conditions, the present embodiment preferably provides the following filtering mode, and in practical applications, it can be used jointly or separately. One or several filtering methods:
(一)依據簇的緊緻程度:計算各個簇的緊緻程度,過濾掉緊緻程度不滿足預置要求R-va的簇及與其成對的簇。可以採用各種可用的數學方法來計算各個簇的緊緻程度,例如可以以方差來表示緊緻程度,計算簇中各個Read的位置與簇的中心或重心的方差,方差越小則緊緻程度越高。優選地,在計算各個簇的緊緻程度時,可以放棄位於簇的兩端的長度範圍為5%至25%中的讀長序列,優選為20%,以減小外圍數據對計算結果的影響。優選地,R-va可設置為固定閾值,例如要求方差低於固定閾值,或者設置為淘汰比例,例如要求方差在全部簇中的排名處於預置的最低區間內,例如,R-va設置為方差在全部簇中的排名處於2%~10%的最低區間內,優選為5%。 (1) According to the degree of compactness of the cluster: Calculate the degree of compactness of each cluster, and filter out the clusters whose degree of compaction does not satisfy the preset requirement R-va and the clusters paired with them. Various available mathematical methods can be used to calculate the degree of compactness of each cluster. For example, the degree of compaction can be expressed by the variance, and the variance of the position of each Read in the cluster and the center or center of gravity of the cluster can be calculated. The smaller the variance, the more compact the degree high. Preferably, when calculating the degree of compactness of each cluster, the length of the read length sequence in the range of 5% to 25% of the length of both ends of the cluster may be discarded, preferably 20%, to reduce the influence of the peripheral data on the calculation result. Preferably, R-va may be set to a fixed threshold, for example, the required variance is lower than a fixed threshold, or set to a elimination ratio, for example, the ranking of the required variance in all clusters is within a preset minimum interval, for example, R-va is set to The ranking of the variance in all clusters is in the lowest range of 2% to 10%, preferably 5%.
簇的緊緻程度反映了Read分佈的穩定性,表明Read是不是集中在一個較小的區間內,一般而言,真實的結構變異會淹沒在眾多的“環境噪音”之中,但“環境噪音”對整個全基因組的影響基本是均勻的,所以在全序列中呈現基本平均分佈的趨勢(當然,也可能會受到例如GC(鳥嘌呤Guanine和胞嘧啶Cytosine)含量等的影響),而在真實的結構變異發生的地方,簇內的Read通常會呈現類似正態分佈的趨勢,因此緊緻程度,例如方差,能很好地反映簇間的差異情況。 The degree of compactness of the cluster reflects the stability of the Read distribution, indicating whether Read is concentrated in a small interval. In general, the actual structural variation will be submerged in numerous "environmental noise", but "environmental noise." "The effect on the whole genome is basically uniform, so there is a tendency to show a basic average distribution in the whole sequence (of course, it may also be affected by, for example, GC (guanine Guanine and cytosine Cytosine) content), but in reality Where the structural variation occurs, the Read in the cluster usually exhibits a trend similar to a normal distribution, so the degree of compactness, such as variance, can well reflect the differences between clusters.
(二)依據成對簇的線性相關性:計算成對的兩個簇的線性相關性,過濾掉線性相關性不滿足預置要求R-li的成對的簇。可以採用各種可用的數學方法來計算一對簇的線性相關性,例如計算兩個簇的相關係數,相關係數越高則線性相關性越高。優選地,R-li可設置為固定閾值,例 如要求相關係數高於固定閾值,或者設置為淘汰比例,例如要求相關係數在全部簇中的排名處於預置的最高區間內,例如,R-li設置為相關係數在全部簇中的排名處於2%~10%的最高區間內,優選為5%。 (B) According to the linear correlation of the paired clusters: Calculate the linear correlation of the two pairs of pairs, and filter out the paired clusters whose linear correlation does not satisfy the preset requirement R-li. Various available mathematical methods can be used to calculate the linear correlation of a pair of clusters, such as calculating the correlation coefficients of two clusters, and the higher the correlation coefficient, the higher the linear correlation. Preferably, R-li can be set to a fixed threshold, for example If the correlation coefficient is required to be higher than the fixed threshold, or set to the elimination ratio, for example, the ranking of the correlation coefficient in all clusters is required to be within the preset maximum interval. For example, R-li is set to rank the correlation coefficient in all clusters at 2 Within the highest range of %~10%, preferably 5%.
線性相關性更加註重成對簇內Reads分佈的一致性,即表現Reads兩端的分佈趨勢是否基本一致,因此線性相關性更能反映成對簇內部的分佈情況。 The linear correlation pays more attention to the consistency of the Reads distribution in the paired clusters, that is, whether the distribution trends at both ends of Reads are basically the same, so the linear correlation can better reflect the distribution inside the paired clusters.
作為一種優選的實施方式,聯合使用簇的緊緻程度,例如方差,以及簇的線性相關性來對候選的簇進行過濾能夠獲得良好的效果。 As a preferred embodiment, the degree of compaction of the clusters, such as the variance, and the linear correlation of the clusters to filter the candidate clusters can achieve good results.
(三)依據正常樣本的對照集:將成對的簇與預置的包含多個正常樣本的對照集進行比對,過濾掉命中正常樣本的數目達到預置閾值V-con的成對的簇。正常樣本是指將與目標個體相同生物種類的其他正常的個體經過如上“比對-聚類-過濾”等分析過程所獲得的結果簇的集合。為便於比對,可將簇內的所有Read融合成一個,成對的簇即產生一對融合後的數值對(類似於一對Reads),使用融合後的數值對進行比對。通過採集包含大量正常樣本的對照集,能夠得到結果簇在正常個體中出現的頻率,如果某個結果簇出現的頻率高,可能說明該結果簇可能是由於樣品性質、實驗過程、測序過程或環境噪音等引起的,並不代表樣品本身真實的發生了這樣的結構變異。這樣的結果簇就是不同樣品用同樣方法分析所得到的一個共同假陽性結果,應該去掉。因此,使用對照集對簇進行過濾能進一步降低假陽性的概率,有助於得到真實的結構變異分析結果。V-con可以根據正常樣本的建立方式以及特點等進行確定,例如V-con與對照集中正常樣本數的比例可以為3%-10%,優選為5%-6%,例如若對照集中包含90個正常樣 本,則可以將命中5個視為達到閾值。 (3) A comparison set according to a normal sample: the paired cluster is compared with a preset comparison set containing a plurality of normal samples, and the paired clusters whose number of hit normal samples reaches a preset threshold V-con are filtered out. . A normal sample refers to a collection of result clusters obtained by an analysis process such as "alignment-cluster-filtering" with other normal individuals of the same biological species as the target individual. For ease of alignment, all of the Reads in the cluster can be merged into one, and the paired clusters produce a pair of fused value pairs (similar to a pair of Reads), using the fused value pairs for comparison. By collecting a control set containing a large number of normal samples, the frequency of occurrence of the result cluster in a normal individual can be obtained. If the frequency of occurrence of a result cluster is high, it may indicate that the result cluster may be due to sample nature, experimental process, sequencing process or environment. What is caused by noise or the like does not mean that such structural variation has occurred in the sample itself. Such a result cluster is a common false positive result obtained by the same method analysis of different samples and should be removed. Therefore, filtering the clusters using the control set can further reduce the probability of false positives and help to obtain real structural variation analysis results. V-con can be determined according to the establishment manner and characteristics of the normal sample, for example, the ratio of the V-con to the normal sample number in the control set may be 3%-10%, preferably 5%-6%, for example, if the control set contains 90% Normal sample In this case, 5 hits can be considered as reaching the threshold.
(四)依據其他輔助參數:所稱輔助參數包括各種有助於進一步證實、區分結構異常類型或者有助於瞭解結構異常的細節情況的參數。例如在比對過程中產生的mismatch數,支援簇的Reads的數目,基於*.pair獲得的相關區域的RPU值,簇是否位於N區等。對於輔助參數的利用可包括兩種方式,一是作為過濾條件,設置與輔助參數相關的過濾要求,直接過濾掉不符合要求的簇,另一是作為輔助判斷的參考依據,將輔助參數隨同結果簇一起提供,通過人工分析的方式進行判斷,因此本節內容可應用於步驟4中(用於過濾),也可應用於下一步驟步驟5之後(用於輔助人工分析),本實施例對輔助參數的具體使用方式不作限定。以下列舉部分輔助參數及其與結果分析的關係,實際使用時,既可以按照下述描述設置為過濾條件,也可以作為人工分析的輔助判斷依據,不同的輔助參數既可以聯合使用,也可以分別單獨使用。 (iv) Based on other ancillary parameters: The so-called auxiliary parameters include various parameters that help to further confirm, distinguish the type of structural anomaly, or help to understand the details of the structural anomaly. For example, the number of mismatch generated in the comparison process, the number of Reads of the support cluster, the RPU value of the relevant region obtained based on *.pair, whether the cluster is located in the N region, or the like. The use of auxiliary parameters may include two methods. One is as a filtering condition, the filtering requirements related to the auxiliary parameters are set, the clusters that do not meet the requirements are directly filtered out, and the other is used as a reference for the auxiliary judgment, and the auxiliary parameters are accompanied by the results. The clusters are provided together and judged by means of manual analysis. Therefore, the content of this section can be applied to step 4 (for filtering), and can also be applied to the next step after step 5 (for assisting manual analysis). The specific use of the parameters is not limited. The following sections list some auxiliary parameters and their relationship with the result analysis. In actual use, they can be set as filtering conditions according to the following description, or as auxiliary judgment basis for manual analysis. Different auxiliary parameters can be used in combination or separately. Use alone.
(1)mismatch數:成對簇中Reads的平均mismatch數一般不超過1個或2個,即允許每對Reads具有1個或2個mismatch,優選為不超過1個。若比對時的匹配要求即按此設置,可不必再次考慮該參數,若比對時的設置較為寬鬆,例如設置為允許有2個mismatch,則在獲得結果簇時可再次依據該參數進行過濾或判斷,例如設置為平均僅允許1個mismatch。 (1) Mismatch number: The average mismatch number of Reads in a paired cluster is generally no more than one or two, that is, one or two mismatches are allowed for each pair of Reads, preferably no more than one. If the matching requirement is set according to this setting, it is not necessary to consider the parameter again. If the setting is relatively loose, for example, if two mismatches are allowed, the result cluster can be filtered again according to the parameter. Or judge, for example, to set an average of only one mismatch.
(2)支援簇的Reads的數目:即成對簇所包含Reads的數目,該參數原則上越大越好,一般可設置其判斷依據為與測序深度的歸一值基本一致,或略小於該值(例如取整),所稱測序深度的歸一值為:測序深度×(L-lib對斷點的影響範圍/L-lib)×(成對的簇兩端的跨度的平均值/L-lib)。其 中“L-lib對斷點的影響範圍”通常會大於“成對的簇兩端的跨度之和”,“L-lib對斷點的影響範圍”一般會以2倍L-lib為均值左右波動,例如在1~4倍L-lib之間,在具體設置時,可以根據實際情況適當放寬或收緊。 (2) The number of Reads supporting the cluster: that is, the number of Reads included in the paired cluster. The larger the parameter, the better, in general, the judgment may be set to be substantially consistent with the normalized value of the sequencing depth, or slightly smaller than the value ( For example, rounding), the normalized value of the so-called sequencing depth is: sequencing depth × (L-lib impact range of breakpoints / L-lib) × (average of the span of the paired clusters / L-lib) . its The "L-lib impact range of breakpoints" is usually larger than the sum of the spans of the paired clusters. The range of influence of L-lib on breakpoints is generally fluctuated by 2 times L-lib. For example, between 1 and 4 times L-lib, in the specific setting, it can be appropriately relaxed or tightened according to the actual situation.
(3)基於*.pair獲得的相關區域的RPU值:不同類型的結構異常通常會對RPU產生不同的影響,例如,平衡易位的情況下,斷點兩側的RPU不會發生明顯變化,而缺失或重複性結構異常的情況下,斷點之間的區域的RPU會明顯降低或增高,因此可利用相關區域的RPU值來進一步驗證或輔助判斷染色體結構異常的發生。例如:對於含有第一類Reads的簇,若根據簇內的Reads之間的關係判斷為平衡易位(詳見下文步驟5,第一節),則斷點兩側的RPU相對於平均值的變化應該不超過V-rm,若根據簇內的Reads之間的關係判斷為非平衡易位(詳見下文步驟5,第一節),則斷點背離結果簇的一側的RPU應低於平均值,且變化範圍超過V-rm;對於含有第二類Reads的簇,位於斷點之間的區域的RPU應高於平均值,且變化範圍超過V-rm;對於含有第三類Reads的簇,位於斷點之間的區域的RPU應低於平均值,且變化範圍超過V-rm。 (3) The RPU value of the relevant area obtained based on *.pair: different types of structural anomalies usually have different effects on the RPU. For example, in the case of balanced translocation, the RPU on both sides of the breakpoint does not change significantly. In the case of missing or repetitive structural abnormalities, the RPU of the region between the breakpoints is significantly reduced or increased, so the RPU value of the relevant region can be used to further verify or assist in determining the occurrence of chromosomal structural abnormalities. For example, for a cluster containing the first type of Reads, if it is judged as a balanced translocation according to the relationship between Reads in the cluster (see step 5, section 1 below), the RPUs on both sides of the breakpoint are relative to the average value. The change should not exceed V-rm. If it is judged as an unbalanced translocation according to the relationship between Reads in the cluster (see step 5, section 1 below), the RPU of the side of the breakpoint away from the result cluster should be lower than Average, and the range of variation exceeds V-rm; for clusters containing the second type of Reads, the RPU of the region between the breakpoints should be higher than the average and the range of variation exceeds V-rm; for the third class of Reads Clusters, the RPU of the region between breakpoints should be below the average and vary beyond V-rm.
當使用RPU作為人工分析的輔助判斷依據時,可以將相關區域的RPU以圖形、表格或其它易於識讀的方式提供,或者也可以將全部範圍的RPU變化情況以圖形、表格等方式提供,以便於操作者瞭解整體情況。 When the RPU is used as the auxiliary judgment basis for manual analysis, the RPU of the relevant area may be provided in a graphical, tabular or other easy-to-read manner, or the entire range of RPU changes may be provided in a graphical form, a table, etc., so that The operator understands the overall situation.
(4)簇是否位於N區:根據經驗,N區(其中包含著絲粒和端粒區)附近的Reads比對情況與其他區域相對具有更高的複雜性,若獲得 的簇並非位於N區,通常可認為能夠基於已獲得的資訊進行判斷,若獲得的簇位於N區,則可能需要更謹慎的驗證,例如聯合使用過濾條件及輔助參數,或者可以結合其他外部資料,例如目標個體的表型,和/或進一步對斷點進行精確測序(例如Sanger測序)的結果來作出最終判斷。 (4) Whether the cluster is located in the N zone: According to experience, the Reads alignment near the N zone (which includes the centromere and telomere regions) is relatively more complex than other regions, if obtained The cluster is not located in the N zone and can generally be considered to be judged based on the information obtained. If the obtained cluster is located in the N zone, more careful verification may be required, such as joint use of filtering conditions and auxiliary parameters, or may be combined with other external data. The final judgment is made, for example, by the phenotype of the target individual, and/or by further sequencing the breakpoints (eg, Sanger sequencing).
步驟5. 對過濾後的結果簇進行資料分析。 Step 5. Perform data analysis on the filtered result cluster.
過濾後得到的結果簇的存在即反映可能發生了相應類型的染色體結構異常,因此若僅需要發現可能存在的結構異常,本步驟並不是必須的。為獲得更加詳細的關於結構異常的資訊,可以進一步對獲得的結果簇進行資料分析,根據不同類型的結果簇,可採用如下分析方式: The presence of the resulting cluster after filtering reflects the possible occurrence of a corresponding type of chromosome structural abnormality, so this step is not necessary if only structural anomalies that may exist are found. In order to obtain more detailed information about structural anomalies, the obtained result clusters can be further analyzed. According to different types of result clusters, the following analysis methods can be used:
(一)染色體易位性結構異常(第一類Reads) (1) chromosomal translocation structural abnormalities (first class Reads)
搜索含有第一類Reads的結果簇,若相鄰的兩個讀長序列在各自所屬的Reads中的位置相反,獲取這兩個讀長序列匹配到的位置之間的範圍作為斷點的範圍。這種情況通常與平衡易位有關,同一簇中的Read分佈在斷點的兩側。 Searching for a result cluster containing the first type of Reads, if the two adjacent read length sequences are in opposite positions in their respective Reads, the range between the positions where the two read length sequences match is obtained as the range of the breakpoint. This situation is usually related to balanced translocations, where Read in the same cluster is distributed on both sides of the breakpoint.
若不存在上述情況的Read,則獲取最靠內的Read的位置,並從該位置向內延伸預置長度作為斷點的範圍,所稱最靠內的Read是指,若簇包含的都是左端Read,則最右邊的Read為最靠內的Read,若簇包含的都是右端Read,則最左邊的Read為最靠內的Read。這種情況通常與非平衡易位有關,同一簇中的Read分佈在斷點的一側。從最靠內的Read延伸出的斷點範圍的跨度可根據L-lib、L-r1/L-r2、測序深度等確定,例如可以是0.5~2倍L-lib,一般不大於2倍L-lib。 If there is no Read in the above case, the position of the innermost Read is obtained, and the preset length is extended inward from the position as the range of the breakpoint, and the innermost Read refers to if the cluster contains all Read on the left end, the Read on the far right is the innermost Read. If the cluster contains the right Read, the leftmost Read is the innermost Read. This situation is usually related to unbalanced translocations, where Read in the same cluster is distributed on one side of the breakpoint. The span of the breakpoint range extending from the innermost Read can be determined according to L-lib, L-r1/L-r2, sequencing depth, etc., for example, 0.5 to 2 times L-lib, generally not more than 2 times L -lib.
參考圖2,示出一種平衡易位的情況,若獲得的一對結果簇 (每個簇中僅畫出兩個讀長序列,其餘的視為省略)如圖2所示分佈,一個結果簇位於染色體chra的位置pa附近,其成對結果簇位於染色體chrb的位置pb附近。由於chra上的簇中,Read1為所在染色體片段的左端Read,而其相鄰的Read2為所在染色體片段的右端Read,因此可認為chra的斷點pa位於Read1和Read2之間,chrb上的分析與之類似 Referring to Figure 2, there is shown a case of balanced translocation if a pair of result clusters are obtained (only two read length sequences are drawn in each cluster, and the rest are treated as omitted) as shown in Fig. 2. One result cluster is located near the position pa of the chromosome chra, and the paired result cluster is located near the position pb of the chromosome chrb. . Because of the cluster on chra, Read1 is the Read end of the left end of the chromosome fragment, and its adjacent Read2 is the Read end of the right end of the chromosome fragment. Therefore, it can be considered that the breakpoint pa of chra is located between Read1 and Read2, and the analysis on chrb is Similar
根據上述資料分析,對於可能發生的易位性結構異常,可輸出如下結果資料:可能發生易位性結構異常的兩個染色體的編號(結果簇分別位於的染色體),成對結果簇的兩端的位置範圍(簇的兩端在兩個染色體上的邊界的位置範圍,相應可以獲得簇的兩端的跨度),經分析獲得的斷點的範圍等。在過濾過程中產生的相關參數以及其他輔助參數也可以一併輸出,例如一對結果簇各自的緊致度,相互的線性相關度,支援該對結果簇的Reads的數目,以及表現斷點兩側RPU變化情況的圖形、表格等。 According to the above data analysis, for possible translocation structural abnormalities, the following results can be output: the number of two chromosomes in which a translocation structural abnormality may occur (the chromosome in which the result cluster is located), and the two ends of the paired result cluster The range of positions (the range of positions of the ends of the cluster on the two chromosomes, the span of the two ends of the cluster can be obtained), the range of the breakpoint obtained by the analysis, and the like. The relevant parameters and other auxiliary parameters generated during the filtering process can also be output together, for example, the compactness of each pair of result clusters, the linear correlation between each other, the number of Reads supporting the pair of result clusters, and the performance breakpoints. Graphs, tables, etc. of the side RPU changes.
(二)染色體串聯重複性結構異常(第二類Reads) (2) Chromosome tandem repetitive structural abnormalities (second type Reads)
搜索含有第二類Reads的結果簇,在成對的簇中獲取匹配到的距離最遠的兩個位置之間的範圍作為發生重複的範圍,並從該兩個位置分別向外延伸預置長度,例如0.5~2倍L-lib,作為斷點(重複片段的起止點)的範圍。 Searching for a result cluster containing the second type of Reads, obtaining a range between the two positions that are the farthest distance in the paired cluster as the range in which the repetition occurs, and extending the preset length from the two positions respectively For example, 0.5 to 2 times L-lib, as a range of breakpoints (starting and ending points of repeated segments).
參考圖3,示出一種串聯重複的情況,成對結果簇(每個簇中僅畫出一個讀長序列,其餘的視為省略)的兩端均位於重複片段的起止點之間的範圍內,因此可認為重複片段的起止點位於從簇兩端最邊緣的Read(這兩個Read不一定屬於一對Reads)向外延伸的範圍內。 Referring to FIG. 3, a case of tandem repetition is shown, in which both ends of the paired result cluster (only one read length sequence is drawn in each cluster, and the rest are regarded as omitted) are located within the range between the start and end points of the repeated segments. Therefore, it can be considered that the start and end points of the repeated segments are located in a range extending outward from the most edge of the cluster (the two Reads do not necessarily belong to a pair of Reads).
與易位性結構異常相比,重複性結構異常輸出的結果資料類 型大致相同,區別在於:簇兩端的染色體編號相同,還可以輸出表示估計的重複片段長度的資料。 Result data class of repetitive structural anomaly output compared with translocation structural anomaly The types are roughly the same, except that the chromosome numbers at both ends of the cluster are the same, and data indicating the length of the estimated repeat fragment can also be output.
(三)染色體缺失性結構異常(第三類Reads) (3) Chromosomal deletion structural abnormalities (third type Reads)
搜索含有第三類Reads的結果簇,在成對的簇中獲取匹配到的距離最近的兩個位置之間的範圍作為發生缺失的範圍,並從該兩個位置分別向內延伸預置長度,例如0.5~2倍L-lib,作為斷點(缺失片段的起止點)的範圍。 Searching for a result cluster containing the third type of Reads, obtaining a range between the two closest positions in the paired cluster as the range in which the deletion occurs, and extending the preset length from the two positions respectively. For example, 0.5 to 2 times L-lib, as a range of breakpoints (starting and ending points of missing fragments).
參考圖4,示出一種片段缺失的情況,成對結果簇(每個簇中僅畫出一個讀長序列,其餘的視為省略)的兩端均位於缺失片段的起止點之外,因此可認為缺失片段的起止點位於從簇兩端最接近的Read(這兩個Read不一定屬於一對Reads)向內延伸的範圍內。 Referring to FIG. 4, a case where a segment is missing is shown, and both ends of the paired result cluster (only one read long sequence is drawn in each cluster, and the rest are regarded as omitted) are located outside the start and end points of the missing segment, and thus The starting and ending points of the missing segment are considered to be in the range extending inward from the closest Read at both ends of the cluster (the two Reads do not necessarily belong to a pair of Reads).
與重複性結構異常相比,缺失性結構異常輸出的結果資料類型大致相同,區別在於:輸出的表示估計的斷點之間的片段長度的資料代表的是缺失片段的長度。 The result data type of the missing structural anomaly output is roughly the same as that of the repetitive structural anomaly, except that the output data representing the length of the segment between the estimated breakpoints represents the length of the missing segment.
步驟6. 斷點組裝。 Step 6. Breakpoint assembly.
為進一步縮小斷點的範圍,還可以利用*.unmap中的數據進行斷點組裝,例如,獲取所確定的斷點範圍周圍設定範圍(例如0.5~2倍L-lib)內的單端Read(能夠單端匹配到參考序列的Read,在比對時可分入*.sin中),從*.unmap中提取與之成對的Read作為補丁序列,將所有補丁序列截成N段,N優選為2,並將補丁序列截斷後獲得的子序列重新與參考序列進行比對,按照能夠正常匹配的結果對斷點區域進行組裝。 To further narrow the range of breakpoints, you can also use the data in *.unmap for breakpoint assembly, for example, to obtain a single-ended Read within the set range around the determined breakpoint range (for example, 0.5 to 2 times L-lib). Read that can be matched to the reference sequence with a single end, can be divided into *.sin when compared, extract Read from the *.unmap as a patch sequence, and cut all patch sequences into N segments, N preferred 2, and the subsequence obtained after truncating the patch sequence is re-aligned with the reference sequence, and the breakpoint region is assembled according to the result of normal matching.
在實際使用中。N值可根據Lr1/Lr2的長度合理設置,由於序 列長度低於25bp後會導致唯一比對率的較大下降,因此,在設置N值時可以考慮使得截斷後的子序列的長度不低於或不明顯低於25bp。 In actual use. The value of N can be reasonably set according to the length of Lr1/Lr2, due to the order A column length of less than 25 bp results in a large decrease in the unique alignment ratio. Therefore, when setting the N value, it can be considered that the length of the truncated subsequence is not lower than or less than 25 bp.
在進行斷點組裝後,能有效縮小斷點的範圍,在此基礎上,可以進一步根據斷點所處的位置範圍製備探針,使用其他的精確測序手段,例如Sanger測序等,最終獲得準確的斷點位置,以便於進一步進行針對斷點的研究。如果不需要縮小斷點範圍,本步驟可以省略。 After the breakpoint assembly, the range of the breakpoint can be effectively reduced. On this basis, the probe can be further prepared according to the position range of the breakpoint, and other accurate sequencing methods, such as Sanger sequencing, can be used to obtain accurate Breakpoint position for further study of breakpoints. If you do not need to narrow the breakpoint range, this step can be omitted.
本領域普通技術人員可以理解,上述實施方式中各種方法的全部或部分步驟可以通過程序來指令相關硬件完成,該程序可以存儲於一計算機可讀存儲介質中存儲介質可以包括:只讀存儲器、隨機存儲器、磁盤或光盤等。 A person skilled in the art can understand that all or part of the steps of the various methods in the above embodiments may be completed by a program to instruct related hardware, and the program may be stored in a computer readable storage medium. The storage medium may include: a read only memory, a random Memory, disk or disc, etc.
依據本發明的另一方面還提供一種檢測染色體結構異常的裝置,包括:數據輸入單元,用於輸入數據;數據輸出單元,用於輸出數據;存儲單元,用於存儲數據,其中包括可執行的程序;處理器,與上述數據輸入單元、數據輸出單元及存儲單元數據連接,用於執行存儲單元中存儲的可執行的程序,該程序的執行包括完成上述實施方式中各種方法的全部或部分步驟。 According to another aspect of the present invention, there is provided an apparatus for detecting an abnormality in a chromosome structure, comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data including executable a program, coupled to the data input unit, the data output unit, and the storage unit, for executing an executable program stored in the storage unit, the execution of the program comprising completing all or part of the steps of the various methods in the above embodiments. .
以下結合具體目標個體對依據本發明的具體檢測方法的運行結果進行詳細的描述。下述檢測過程所使用的具體參數設置為:1. L-lib為500bp,PE50測序(pair-end測序,L-r1和L-r2基本為50bp);2. 選擇NCBI的HG19作為參考序列,使用SOAP軟件對測序結果進行比對; 3. V-lib為±45bp,RPK的V-rm為20%,V-cl為10Kbp(簇間距離定義為兩個最近的Read之間的距離),簇中最少Read數為2,R-va和設置為方差在全部簇中的排名處於5%的最低區間內(計算方差時,忽略位於簇的兩端的長度範圍為20%中的讀長序列),R-li設置為相關係數在全部簇中的排名處於5%的最低區間內,對照集包括90個正常樣本,V-con為5。 The results of the operation of the specific detection method according to the present invention will be described in detail below in conjunction with a specific target individual. The specific parameters used in the following detection procedure are set as follows: 1. L-lib is 500 bp, PE50 is sequenced (pair-end sequencing, L-r1 and L-r2 are substantially 50 bp); 2. NCBI HG19 is selected as the reference sequence, Sequencing the sequencing results using SOAP software; 3. V-lib is ±45 bp, RPK has V-rm of 20%, V-cl is 10Kbp (inter-cluster distance is defined as the distance between two nearest Reads), and the minimum number of Reads in the cluster is 2, R- Va and set the variance in the rank of all clusters in the lowest interval of 5% (when calculating the variance, ignore the read length sequence in the length range of 20% at both ends of the cluster), R-li is set to the correlation coefficient at all The ranking in the cluster is in the lowest interval of 5%, and the control set includes 90 normal samples with a V-con of 5.
實驗例一 Experimental example one
本例為貓叫綜合症家系研究。本例中的兩個目標個體屬於一個家系,其中“FA”表示爸爸,“SON”表示兒子。 This example is a study of the family of meow syndrome. The two target individuals in this example belong to a family, where "FA" means father and "SON" means son.
1. 分別對兩個目標個體進行全基因組低乘數的測序,其中“FA”的測序深度為2.2,“SON”的測序深度為3.1。 1. Complete genome-wide low-multiplier sequencing of the two target individuals, with a "FA" sequencing depth of 2.2 and a "SON" sequencing depth of 3.1.
2. 然後使用SOAP比對軟件將兩個目標個體的測序結果分別與參考序列HG19進行比對,獲得兩個文件FA.sin和SON.sin。 2. Then use the SOAP comparison software to compare the sequencing results of the two target individuals with the reference sequence HG19, respectively, and obtain two files FA.sin and SON.sin.
3. 對兩個文件FA.sin和SON.sin進行聚類、過濾和分析處理,獲得結果簇及相關參數輸出如下: 3. Cluster, filter and analyze the two files FA.sin and SON.sin, and obtain the result cluster and related parameters as follows:
“FA”: "FA":
成對結果簇所在的兩個染色體的編號:chr12,chr5 The number of the two chromosomes in which the paired result cluster is located: chr12, chr5
成對結果簇的兩端的位置範圍:14779615-14780233,23314785-23314205 The position range of the two ends of the paired result cluster: 14797215-14780233, 23314785-23314205
成對結果簇的兩端的跨度:618,580 Span of the two ends of the paired result cluster: 618, 580
支援該對結果簇的Reads的數目:5 Number of Reads supporting the pair of result clusters: 5
左右兩端的緊致度(方差):90.59,87.01 Tightness at the left and right ends (variance): 90.59, 87.01
是否位於N區:否 Whether it is located in the N zone: No
斷點的範圍:chr12:14779968~14780233,chr5:23314205~23314455 The range of breakpoints: chr12:14779968~14780233, chr5:23314205~23314455
染色體上相關區域RPK的變化情況:如圖6所示,圖中橫坐標為染色體上的位置,以10Kbp為單位,縱坐標為RPK,曲線依據FA.pair的數據繪出,pa和pb表示斷點位置,由圖可以看出“FA”的RPK沒有明顯變化。 The change of RPK in the relevant region on the chromosome: as shown in Figure 6, the abscissa is the position on the chromosome, in units of 10Kbp, the ordinate is RPK, the curve is drawn according to the data of FA.pair, and pa and pb are broken. The position of the point, as can be seen from the figure, there is no significant change in the RPK of "FA".
“SON”: "SON":
成對結果簇所在的兩個染色體的編號:chr12,chr5 The number of the two chromosomes in which the paired result cluster is located: chr12, chr5
成對結果簇的兩端的位置範圍:14779618-14779968,23314455-23314830 The position range of the two ends of the paired result cluster: 14797618-14779968, 23314455-23314830
成對結果簇的兩端的跨度:350,375 Span of the two ends of the paired result cluster: 350,375
支援該對結果簇的Reads的數目:6 Number of Reads supporting the pair of result clusters: 6
左右兩端的緊致度(方差):22.43,18.44 Tightness (variance) at the left and right ends: 22.43, 18.44
是否位於N區:否 Whether it is located in the N zone: No
斷點的範圍:chr12:大於14779968,chr5:小於23314455 The range of breakpoints: chr12: greater than 14779968, chr5: less than 23314455
染色體上相關區域RPK的變化情況:如圖7所示,圖中橫坐標為染色體上的位置,以10Kbp為單位,縱坐標為RPK,曲線依據SON.pair的數據繪出,pa和pb表示斷點位置,由圖可以看出“SON”的RPK有明顯變化,查看RPK計算數值可知,SON的5號染色體的前臂的RPK只有平均值的0.5倍,而12號染色體的前臂比平均值多了0.5倍。 The change of RPK in the relevant region on the chromosome: as shown in Figure 7, the abscissa is the position on the chromosome, in 10Kbp, the ordinate is RPK, the curve is drawn according to the data of SON.pair, and pa and pb are broken. Point position, it can be seen from the figure that the RPK of "SON" has obvious changes. Looking at the RPK calculation value, the RPK of the forearm of chromosome 5 of SON is only 0.5 times of the average value, and the forearm of chromosome 12 is more than the average value. 0.5 times.
通過分析結果可以清楚的判斷出“FA”為平衡易位,“SON”為非平衡易位,且通過“FA”的結果分析出的斷點範圍已位於300bp以內。為 進一步進行斷點位置的研究,接下來我們從參考序列HG19上取出相應的序列,設計好引物,進行了qPCR的驗證和Sanger測序,最終得出準確的斷點位置為:Chr12:14780019,Chr5:23314435。 Through the analysis results, it can be clearly judged that "FA" is a balanced translocation, "SON" is an unbalanced translocation, and the range of breakpoints analyzed by the result of "FA" is already within 300 bp. for Further research on the breakpoint position, then we took the corresponding sequence from the reference sequence HG19, designed the primers, verified the qPCR and Sanger sequencing, and finally got the exact breakpoint position: Chr12:14780019, Chr5: 23314435.
實驗例二 Experimental example 2
本例為先天性心臟病研究。本例中的目標個體是一個有先天性心臟病的患者,以“XX”來表示。 This case is a study of congenital heart disease. The target individual in this case is a patient with congenital heart disease, expressed as "XX".
1. 對該目標個體進行全基因組低乘數的測序,測序深度為2.7。 1. Perform a genome-wide low-multiplier sequencing of the target individual with a sequencing depth of 2.7.
2. 然後使用SOAP比對軟件將測序結果與參考序列HG19進行比對,獲得XX.sin。 2. The sequencing results are then compared to the reference sequence HG19 using SOAP alignment software to obtain XX.sin.
3. 對XX.sin進行聚類、過濾和分析處理,獲得結果簇及相關參數輸出如下: 3. Cluster, filter and analyze XX.sin, and obtain the result cluster and related parameters as follows:
“XX”: "XX":
成對結果簇所在的兩個染色體的編號:chr14,chr14 The number of the two chromosomes in which the paired result cluster is located: chr14, chr14
成對結果簇的兩端的位置範圍73557040-73557288,73670432-73670682 The position of the two ends of the paired result cluster is in the range of 7357040-73557288, 73670432-73670682
估計的重複片段的長度:113392 Estimated length of the repeating segment: 113392
成對結果簇的兩端的跨度:248,250 Span of the two ends of the paired result cluster: 248,250
支援該對結果簇的Reads的數目:4 Number of Reads supporting the pair of result clusters: 4
左右兩端的緊致度(方差):100.63,100.59 Tightness (variance) at the left and right ends: 100.63, 100.59
是否位於N區:否 Whether it is located in the N zone: No
斷點的範圍:chr14:73556540-73557040, chr14:73670682-73671182(範圍大小按照1倍L-lib估計,即500bp) The range of breakpoints: chr14:73556540-73557040, Chr14:73670682-73671182 (the range size is estimated by 1 times L-lib, ie 500bp)
通過分析結果可以清楚的判斷出“XX”的14號染色體發生了一個長度約為113Kbp的重複,且該重複為串聯發生。為進一步進行斷點位置的研究,接下來我們從參考序列HG19上取出相應的序列,設計好引物,進行了qPCR的驗證和Sanger測序,qPCR的擴增比值>1,顯示為重複,Sanger測序最終得出準確的斷點位置為:Chr14:73557008,Chr14:73670820,證實了“XX”的14號染色體是發生了一個113812bp的重複,重複片段串聯插入到該片段末端 From the analysis results, it can be clearly judged that a chromosome of "XX" has a repetition of about 113 Kbp, and the repetition occurs in tandem. To further study the position of the breakpoint, we took the corresponding sequence from the reference sequence HG19, designed the primers, verified the qPCR and Sanger sequencing. The amplification ratio of qPCR was >1, which was shown as repeat, and Sanger was finally sequenced. The exact breakpoint position is: Chr14:73557008, Chr14:73670820, confirming that a chromosome of chromosome 14 of "XX" has a 113812 bp repeat, and the repeat fragment is inserted in tandem to the end of the fragment.
以上所述僅為本發明的較佳實施例,應當理解,這些實施例僅用以解釋本發明,並不用於限定本發明。對於本領域的一般技術人員,依據本發明的思想,可以對上述具體實施方式進行變化。 The above is only the preferred embodiment of the present invention, and it should be understood that these embodiments are only used to explain the present invention and are not intended to limit the invention. Variations to the above-described embodiments may be made in accordance with the teachings of the present invention.
Claims (14)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW103140128A TW201619456A (en) | 2014-11-19 | 2014-11-19 | Method for detecting chromosomal structural abnormalities and device therefor |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW103140128A TW201619456A (en) | 2014-11-19 | 2014-11-19 | Method for detecting chromosomal structural abnormalities and device therefor |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| TW201619456A true TW201619456A (en) | 2016-06-01 |
Family
ID=56754917
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW103140128A TW201619456A (en) | 2014-11-19 | 2014-11-19 | Method for detecting chromosomal structural abnormalities and device therefor |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TW201619456A (en) |
-
2014
- 2014-11-19 TW TW103140128A patent/TW201619456A/en unknown
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104302781B (en) | A kind of method and device detecting chromosomal structural abnormality | |
| US11155863B2 (en) | Sequence assembly | |
| JP7119014B2 (en) | Systems and methods for detecting rare mutations and copy number variations | |
| Johansson et al. | CoNVaDING: single exon variation detection in targeted NGS data | |
| US10741270B2 (en) | Size-based analysis of cell-free tumor DNA for classifying level of cancer | |
| Pu et al. | Detection and analysis of ancient segmental duplications in mammalian genomes | |
| US20130324417A1 (en) | Determining the clinical significance of variant sequences | |
| CN106909806A (en) | The method and apparatus of fixed point detection variation | |
| CN107229841B (en) | A kind of genetic mutation appraisal procedure and system | |
| CN107750279A (en) | Foranalysis of nucleic acids system and method | |
| JP2015506684A (en) | Method, system, and computer-readable storage medium for determining presence / absence of genome copy number variation | |
| WO2015149719A1 (en) | Heterozygous genome processing method | |
| CN106951731A (en) | A kind of large fragment insertion or the Forecasting Methodology and system of missing | |
| CN116386718B (en) | Methods, devices and media for detecting copy number variations | |
| CN110021351A (en) | Analyze base linkage strength and methods of genotyping and system | |
| WO2015043278A1 (en) | Method and system for simultaneously performing target gene haplotype analysis and chromosomal aneuploidy detection | |
| Kõks et al. | Sequencing and annotated analysis of full genome of Holstein breed bull | |
| JPWO2019132010A1 (en) | Methods, devices and programs for estimating base species in a base sequence | |
| Heo | Improving quality of high-throughput sequencing reads | |
| TW201619456A (en) | Method for detecting chromosomal structural abnormalities and device therefor | |
| JP2014530629A5 (en) | ||
| Stander | De novo assembly of the rooibos genome | |
| Howe et al. | Illumina sequencing artifacts revealed by connectivity analysis of metagenomic datasets | |
| TWI564742B (en) | Methods for determining the aneuploidy of fetal chromosomes, systems and computer-readable media | |
| Kumaran et al. | Bioinformatics for Whole Exome Studies |