WO2019132010A1 - Method, apparatus and program for estimating base type in base sequence - Google Patents
Method, apparatus and program for estimating base type in base sequence Download PDFInfo
- Publication number
- WO2019132010A1 WO2019132010A1 PCT/JP2018/048502 JP2018048502W WO2019132010A1 WO 2019132010 A1 WO2019132010 A1 WO 2019132010A1 JP 2018048502 W JP2018048502 W JP 2018048502W WO 2019132010 A1 WO2019132010 A1 WO 2019132010A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- lead
- base
- nucleic acid
- consensus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6806—Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
Definitions
- the present invention relates to a technique for estimating a base type in a base sequence.
- Cancer cells are said to be generated by mutation of base sequences in oncogenes and tumor suppressor genes (cancer related genes) in normal cells.
- cancer related genes cancer related genes
- development of a method to distinguish normal cells from cancer cells has been promoted by applying genetic abnormalities of these cancer-related genes.
- Non-Patent Document 1 and Non-Patent Document 2 disclose that a trace amount of abnormal DNA derived from cancer cells present in blood or stool could be detected.
- molecular targeting drugs that target abnormal proteins translated by mutations in cancer-related genes, it is required to determine the base sequence with high accuracy.
- non-patent document 3 discloses smCounter as a technique for enhancing detection accuracy by statistical processing of base sequence information obtained from a sequence analysis device, higher detection sensitivity and accuracy such as liquid biopsy can be obtained. In the case of lack of precision, it is insufficient.
- An object of the present invention is to provide a method, an apparatus, and a computer program for carrying out the method for estimating a base type in a base sequence.
- the present inventor determines the reliability score for each molecular barcode from the large amount of base sequence data (read information) of the next-generation sequencer obtained from the DNA fragment prepared by adding the molecular barcode to the template DNA, We found that parametric analysis can detect mutations with higher sensitivity than conventional methods.
- the present inventor calculates an index for efficiently removing an error generated by nucleic acid amplification or sequencing, and the true base type We have developed a statistical method that determines with high accuracy.
- the technique for enhancing detection accuracy of the present invention can be expected, for example, to detect variant nucleic acids at an earlier disease stage.
- a computer is used to provide a first estimation method for estimating the presence of a specific base type present at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
- the first estimation method is a) obtaining a lead sequence derived from each nucleic acid molecule, b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result; c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b, d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result, e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score, f) obtaining a ratio of the number of consensus reads containing each specific base species at the particular position selected in step e to the total number of consensus leads including the particular position used in step e, g) Estimate the occurrence rate of appearance of a specific base due to an analysis error
- a second estimation is performed using a computer to estimate the proportion of a specific base species present at a specific position in a base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture in the clonal nucleic acid mixture.
- a method is provided.
- the second estimation method is a) obtaining a lead sequence derived from each nucleic acid molecule, b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result; c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b, d) obtaining consensus lead information comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result, e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score, f) obtaining a ratio of the number of consensus reads containing each base species specific to the particular position selected in step e to the total number of consensus leads including the particular position used in step e, g) Estimate the occurrence rate of a distinctive base due to an analysis error at a
- step h a step of presuming presence in clonal nucleic acid mixture at a significant level, and i) containing the base species at the specific position obtained in step f, when it is judged that the base species selected in step e is present in step h
- step h Calculating the ratio of the specific base species at the specific position selected in step e at the threshold or significance level set in step g from the number of consensus leads, and adopting it as the ratio present in the clonal nucleic acid mixture, including.
- a third estimation method for estimating a base sequence of a specific region of each nucleic acid molecule constituting a clonal nucleic acid mixture using a computer.
- the third estimation method is a) obtaining a lead sequence derived from each nucleic acid molecule, b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result; c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b, d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result, e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score, f) obtaining a ratio of the number of consensus
- a computer readable program for causing a computer to execute any of the above first to third estimation methods.
- an estimation device for estimating the presence of a specific base species at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
- the estimation device is A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule; A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device; A clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence; A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result; Equipped with
- an estimation device for estimating the presence of a specific base species at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
- the estimation device is A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule; A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device; A first clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence; A second clustering result acquisition unit that executes a local assembly using a lead sequence included in each clustering and obtains a clustering result of the assembly sequence; A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result of the second clustering result acquisition unit; Equipped with
- the present invention it is possible to detect infrequent mutations occurring on a part of nucleic acids derived from a body sample of a subject, and to estimate the base type in the base sequence.
- FIG. 6 shows the evaluation results of sensitivity for detecting eight types of mutations in a nucleic acid mixture in Example 1.
- FIG. 8 shows the abundance ratio comparison of mutant molecules estimated from the read ratio in the nucleic acid mixture in Example 2.
- the "clonal nucleic acid mixture” in the present disclosure refers to a mixture of multiple nucleic acids generated by repeating biological replication from one nucleic acid molecule.
- the sample from which such clonal nucleic acid is obtained is not particularly limited by species, but includes, for example, human biological samples, tissue, blood, plasma, serum, urine, saliva, mucus drainage, sputum, stool and Tears are illustrated.
- the FFPE formalin fixed paraffin embedded
- the FFPE sample-derived DNA is often damaged such as short fragments, nicks and gaps, and deamination (urasylation) of cytosine. Since it directly contributes to poor analysis, it is desirable to repair the DNA using a commercially available reagent such as NEBNext FFPE DNA Repair Mix (manufactured by NEB).
- the "read sequence” in the present disclosure is base sequence data output from a sequencer.
- the “infrequently occurring base species at a specific position” in the present disclosure is not particularly limited, and includes, for example, “infrequent mutations”.
- the term "low frequency mutation” is intended to mean (a sudden) mutation of a base in a nucleic acid generated in vivo and satisfying the following two conditions a and b. a) In a nucleic acid molecule, it appears at a frequency of 1 ⁇ 10 -3 / base or less (ie, a frequency of 1 or less per 1,000 bases). b) In a sample containing nucleic acid molecules, the proportion of nucleic acid molecules having different sequences of bases at specific positions is 10% or less of the total number of nucleic acid molecules in the sample.
- low frequency mutation as an example as “base type at a specific position which exists at low frequency” as an example, but the present disclosure is not limited to detection of low frequency mutations. Absent.
- the (sudden) mutation in the present disclosure may be any of substitution, insertion and deletion. Also, the (sudden) mutation may be known or novel.
- FIG. 1 is a diagram showing a configuration of a mutation detection system (an example of a estimation system) that implements a method of estimating a base sequence according to an embodiment of the present invention.
- the mutation detection system 100 detects mutations in DNA sequences.
- the mutation detection system 100 analyzes the fragmented DNA with the sequencer 50, and inputs the lead sequence data obtained as a result of the analysis into the mutation detection apparatus 10 (an example of an estimation apparatus).
- the mutation detection device 10 acquires lead sequence data from the sequencer 50 and analyzes it to detect mutations in the DNA sequence.
- FIG. 2 is a diagram showing the procedure of a novel mutation detection method in the DNA sequence of the present embodiment. The procedure of the method for detecting mutations in DNA sequences will be described with reference to FIG.
- genomic DNA and RNA are extracted from a biological sample, fragmented so as to contain DNA (for example, 1000 base or less) of a length suitable for a sequencer, and fragmented clonal nucleic acid is generated (S1). If the biological sample is already fragmented DNA (cell-free DNA, FFPE sample, etc.) or short RNA (non-coding RNA, etc.), the fragmentation step may not be performed.
- the present invention is particularly suitably used for analysis of nucleic acids which may contain low frequency mutations.
- the clonal nucleic acid to be analyzed is DNA or RNA, preferably DNA, more preferably genomic DNA.
- the sequence length that can be analyzed may be as short as several hundred bp. Therefore, for example, when the clonal nucleic acid is genomic DNA, it is necessary to fragment it into a length that can be analyzed by a sequencer.
- fragmentation There is no limitation on the method of fragmentation, and fragmentation may be performed by a known method.
- the present invention is not limited in any way, for example, DNA Shearing System M220 / ME220 (manufactured by Covaris), which physically cleaves using ultrasound, QIAseq FX Library Kit (Qiagen), which cleaves using an enzyme.
- Commercially available instruments and reagents such as those manufactured by Preferably, physical cutting using ultrasound is used.
- a nucleic acid called "molecular barcode” is added to each fragmented clonal nucleic acid (S2).
- the template nucleic acid thus obtained is copied later, but the molecular barcode can be added to identify which clonal nucleic acid the copy is derived from.
- the molecular barcode is used.
- the molecular barcode given to the clonal nucleic acid may be one or more unique.
- the molecular barcode has various names, and may be called “unique molecular identifier (UMI)", “unique molecular tag (UMT)", or simply "molecular tag”. There is no particular limitation on the method of adding the molecular barcode to the clonal nucleic acid, but it can be prepared by binding with an enzyme such as ligase.
- a clonal nucleic acid to which a molecular barcode is added may be referred to as a template nucleic acid or a nucleic acid to be analyzed.
- nucleic acid amplification procedure it is easy to perform the below-mentioned nucleic acid amplification procedure by adding a sequence that can be aligned with a primer for PCR (polymerase chain reaction) amplification in addition to the molecular barcode to the template nucleic acid.
- a primer for PCR polymerase chain reaction
- ThruPLEX registered trademark
- Tag-Seq Kit manufactured by Takara Bio Inc.
- nucleic acid region to be analyzed when the nucleic acid region to be analyzed is determined in advance, addition of the molecular barcode to the clonal nucleic acid and amplification can be performed simultaneously by using a nucleic acid in which the PCR primer capable of amplifying the region and the molecular barcode are linked. it can.
- template nucleic acid including the target region may be concentrated by the so-called probe capture method or the like. The concentration operation may be performed after obtaining a copy.
- the molecular bar coat may further contain a sequence called "stem” or the like.
- the stem sequence is a sequence that marks the start of the molecular barcode. Although the sequence length and sequence pattern are not particularly limited, for example, a sequence of 8 to 10 bases in length is used.
- the stem sequence is placed at a position between the clonal nucleic acid and the molecular barcode.
- the resulting template nucleic acid is amplified to create a library (S3).
- S3 a library
- a nucleic acid molecule containing a low frequency mutation also increases the amount of nucleic acid, that is, copies are obtained because it requires a certain amount of nucleic acid to perform sequence analysis. There is a need.
- the method for producing the copy of the template nucleic acid is not particularly limited, a nucleic acid amplification method may be mentioned, and it is preferable to carry out by a method based on PCR from the viewpoint of ease of operation and the like.
- the sample to be decoded by the sequencer obtained in this manner is called a library.
- Mutations in the DNA sequence are analyzed using the nucleotide sequence data obtained by sequencing this library.
- amplification of the template nucleic acid is performed by a PCR-based method, but is not limited thereto.
- the method for producing the library is not particularly limited.
- the region to be analyzed may be concentrated by PCR method or probe capture method, but there is no limitation on the concentration method in practice.
- the library is input to the sequencer 50, and the base sequence is analyzed (S4). That is, the lead sequence can be obtained by nucleotide sequence sequencing by a known sequencing method.
- the platform for the sequencer is not particularly limited, but next generation sequencers are desirable when the amount of information to be analyzed is enormous, such as genomic DNA.
- HiSeq manufactured by illumina Inc.
- MiSeq ® Ion Proton TM and Ion PGM TM (manufactured by Thermo Fisher Scientific Inc.)
- PacBio ® RS II and PacBio registered TM
- Sequel TM manufactured by Pacific Biosciences, Inc.
- MinION MinION
- GridION X5 PromethION
- SmidgION SmidgION
- the file format of the base sequence data (read sequence) obtained from the sequencer 50 is FASTQ format, but may be other format.
- the FASTQ format is a format known in the art and is as shown in FIG. 3 (A).
- the meaning of each column of data is as shown in FIG.
- sequence data a Phred quality score is output as an index indicating the accuracy for base call (base designation) of each base.
- the error rate by the sequence per base is indicated by the error frequency (P error ).
- Phred quality score is calculated automatically by the next generation sequencer. Although the conversion formula to error frequency differs depending on each platform, there is no limited factor in this embodiment.
- lead filtering also referred to as lead cleaning processing is performed to remove low quality lead sequences, that is, sequence data itself containing bases with low basecall accuracy (Phred quality score).
- Lead filtering also referred to as lead cleaning
- Low quality lead sequences that is, sequence data itself containing bases with low basecall accuracy (Phred quality score).
- trimming processing may be performed to remove only the base sequence of the corresponding part.
- the trimming process is sometimes called a masking process.
- a known technique can be used for the treatment, and examples thereof include known software such as Trimmomatic and sickle (https://github.com/najoshi/sickle).
- FIGS. 4 (a) and 4 (b) are diagrams showing a more specific procedure of the mutation analysis process (S5) of FIG.
- FIG. 4 (a) shows the procedure by data analysis in consideration of PCR errors and sequence errors as described in detail later, and FIG. 4 (b) considers lead alignment. It shows the procedure by the data analysis.
- FIG. 4 (b) differs from FIG. 4 (a) only in that it includes the steps of "local assembly” (S17).
- each step of the mutation analysis process (S5) will be specifically described with reference to both FIG. 4 (a) and FIG. 4 (b).
- mapping processing is performed to classify the lead sequence by comparison with a reference sequence.
- the reference sequence is a reference sequence, and any sequence can be used.
- sequences registered in NCBI GenBank, DDBJ, UCSC Genome Browser, etc. can be used.
- the analysis target is human genomic DNA
- Genome Reference Consortium human build 38 (GRCh38) or UCSC human genome 19 (hg19) can be used as reference sequences (these versions are updated as needed, and as necessary) choose the appropriate version).
- the region to be the reference sequence can be appropriately selected according to the purpose. It may be the entire length of the sequence included in the record, or only the desired region.
- the reference sequence is compared with a consensus read described later, and a base different from the reference sequence is in the consensus read, the base is a candidate for mutation.
- the genome sequence can be generally obtained in FASTA format as a reference sequence.
- FASTA FASTA format
- the read sequence is mapped to the reference sequence acquired above.
- BWA Wellcome Trust Sanger Institute
- Bowtie Johns Hopkins University
- DNASTAR registered trademark
- SeqMan NGen made by DNASTAR
- Hisat 2 Johns Hopkins University
- NOVACRAFT Novoalign
- the mapping result is output in a known file format called a SAM / BAM format file as shown in FIG.
- BAM format files are binary files of SAM format files.
- the file format for outputting the mapping result is not limited to the BAM format.
- FIG. 5A is a SAM / BAM format file to which the result of mapping to the reference sequence (S12) is output, and FIG. 5B is added with the result of molecular bar code clustering (S13) It is a SAM / BAM format file.
- SAM / BAM format file The meaning of each column of the SAM / BAM format data is as shown in FIG.
- N (a base where the base type can not be determined) is regarded as “N” (a base having a low Phred quality score) and “completely matched base” when compared with the reference sequence. It could be used for the subsequent analysis by assuming it as
- mapping results in one group different molecular barcode read sequences may be mixed. In that case, by grouping based on molecular barcode sequences, it is possible to reorganize into correct groups.
- the unaligned read sequence contributes to analysis failure and is not used for the subsequent analysis. Therefore, there has been a problem that while obtaining a large amount of sequence data, the number of read sequences available for analysis decreases.
- the unmapped (unmapped) lead sequences are left for judgment and used for analysis in comparison with the generated family in the grouping step based on molecular barcode sequences. Determine if you can. In this way, the number of lead sequences that can be used for analysis is increased (the number of non-analyzable lead sequences is reduced).
- UMI Molecular barcode clustering
- mapping information eg, lead sequence, UMI sequence, mapping position relative to reference sequence
- each lead sequence is obtained.
- UMI sequencing step clustering of molecular barcodes
- FIG. 6A shows an example of a state after the lead arrangement is grouped by the mapping start point and end point on the reference arrangement.
- FIG. 6A exemplifies three groups (Position family) classified based on the mapping position on the reference sequence.
- FIG. 6 (B) shows an example of the state where it is further divided by the sequence of UMI (molecular barcode).
- FIG. 6 (B) illustrates two groups (UMI family) classified according to the sequence of UMI (molecular barcode).
- Step a Classify the lead sequence group independently of the UMI sequence based on the position of each lead sequence consisting of the mapping position to the reference sequence.
- the classified groups are managed as Position family. That is, a read sequence group having the same mapping start point and end point on the reference sequence described in the SAM / BAM format file is managed as a Position family.
- Step b Count the number of lead sequences in each Position family. A flag is set for Position family whose aggregate value is less than the threshold.
- Step c Classify the lead sequences based on the similarity of UMI sequences within each Position family. A group classified based on the similarity of UMI sequences is called UMI family. In addition, when two or more types of UMI exist on the same library, all UMI sequences are combined.
- L-UMI and R-UMI are combined.
- the read sequences in the Position family are further classified based on the similarity of UMI sequences. By setting a similarity threshold and allowing a certain degree of mismatch, mismatched UMIs within the threshold have the same sequence.
- a read sequence having a UMI sequence having a smaller number of occurrences than the total number of read sequences is flagged on the SAM / BAM file.
- Step d Based on the number of lead sequences in the Position family, reclassification of the final UMI family is performed with limitation to the Position family located in the vicinity.
- the Position family which has set the flag and which has a small number of reads can be integrated into the nearby Position family. If it is determined that integration is possible, the flagged position family with a small number of leads is integrated into the neighboring position family.
- determination of sequence similarity as to whether UMI sequences are identical uses editing distance such as Hamming distance.
- determination of sequence similarity is not limited to this, and, for example, similarity of Zscore-based vectors assuming (similarity) network analysis based on motif appearance frequency, phylogenetic distance and evolution distance used in molecular phylogenetic analysis An indicator such as is also selectable.
- the criteria for determining the degree of similarity may be changed depending on the length of the UMI sequence and the copy number of the template nucleic acid.
- Step a Bases with low Phred quality scores in the sequence data are masked with N (a symbol indicating an arbitrary base) and excluded from comparison targets of sequence similarity.
- Step b Adjust the edit distance threshold to allow for a mismatch between UMI sequences. That is, the read sequence group in the Position family is further classified based on the degree of similarity (such as Hamming distance) of the UMI sequence. By setting a similarity threshold and allowing a certain degree of mismatch, mismatched UMIs satisfying the threshold condition have the same sequence.
- WO 2016/176091 suggests a UMI substitution phenomenon that replaces a sequence different from the UMI sequence initially given in the PCR amplification and sequencing process (eg UMI Jumping, PCR Jumping [UMI -Tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy, Tom Smith et al. Http://dx.doi.org/ 10.1101 / gr.209601.116.], Index swapping [Characterization and remediation of ample index swaps by non-redundant dual indexing on massively parallel sequencing platforms, Maura Costello et al.
- the present disclosure is not limited in any way, for example, when the molecular barcode consists of 6 bases, a mismatch of 2 bases or less in one or both molecular barcodes, or a lead sequence of the molecular barcode region at either end If the six bases match, it is regarded as a lead sequence to be subjected to clustering. This enables clustering that takes into account the possibility of errors present in the UMI sequences themselves.
- the error present in the UMI sequence itself refers to a mismatched base, a base call bad base, a mapping error and the like.
- the existing analysis methods only mismatched bases were considered.
- the analysis accuracy can be improved by considering the base call defect base, the mapping error and the like.
- the presence rate of errors present in the molecular barcode sequence itself is 50% or less, preferably 40% or less, more preferably 35% or less, particularly preferably the molecular barcode base length. Is less than or equal to 30%, clustering is possible in consideration of the possibility of the error.
- the error present in the UMI sequence itself may be present in only one molecular barcode or in both molecular barcodes according to the invention.
- clustering is possible in consideration of the possibility of the error.
- the classification result of UMI is added to the data of the SAM / BAM format file as described above. That is, as shown in the SAM / BAM format (output format) of FIG. 5 (B), the name of the lead sequence in the first column is [chromosome number]-[mapping start point]-[mapping end point]-[alignment pattern] -[UMI sequence after classification]-[UMI sequence before classification] Furthermore, the information obtained by adding necessary information to BC: Z: sequence: sequence, XD: i: count, XP: i: status is output as option information of the SAM / BAM format file (see Table 1 below). The form of appending is not limited to this form.
- mutation detection is performed based on a SAM / BAM format file to which UMI classification results are added.
- the step of this local assembly is a step included in the mutation analysis process shown in FIG. 4 (b).
- processing is performed to determine a sequence region not smaller than the reference sequence but longer than the sample-specific read sequence length. By comparing this sequence with the reference sequence, it is possible to confirm base substitution, insertion or deletion having a sequence size equal to or greater than the length of the lead sequence.
- the local assembly of this embodiment is described in Rimmer et al. It is coded with reference to Platypus's algorithm published in (Nature Genetics, 2014), and a two-colored de Bruijn Graph is constructed from a reference sequence and a lead sequence to construct an assembly sequence. The options available for local assembly and their examples are shown in Table 2 below.
- de Bruijn graph the base sequence is first cut into base sequences of length k (k-mer) to make it a Node (apex). Furthermore, each Node (vertex) is connected by Edge (edge). A k-mer Node-Edge is generated with a plurality of read sequences or reference sequences, and a graph in which nodes having the same k-mer and the same Node are combined becomes a de Bruijn graph (see FIG. 7).
- the origin of the base sequence derived from the reference sequence or the lead sequence, presence or absence of the molecular barcode and its sequence information
- the reliability score the Phred quality score of the sequence
- de Bruijn graph targets the partial length of the analyzed gene and can change the sequence length (which can be changed by the window size option in Table 2). Create using the read array mapped to the same region as the reference sequence corresponding to the specified target region. Further, the read sequences to be used include the read sequences themselves mapped to the same region and paired reads (that is, unmapped reads not mapped to the reference sequence or reads mapped to the outside of the target region). Also, in this embodiment, the color of the Edge of the graph is set to 0 for the reference array side, 1 for the lead array side, and 0.5 for the portion where both exist is derived from the Edge (edge). It is possible to distinguish.
- Dijkstra's algorithm which is an algorithm based on the best priority search for solving the single-source shortest path problem when the weight of an edge is a non-negative number, in path search on a graph.
- Adopt priority queue
- individual search paths become assemble sequences, and further, Edges specific to the reference sequence or edge specific to the read sequence can be discriminated by the difference in color of Edges on the paths, so that mutation candidates can be identified.
- the route search method and algorithm are not limited to the above.
- Consensus lead (S14) As described above, especially in "3) Molecular Barcode (UMI) clustering (S13)", based on UMI sequences, taking into account PCR errors and sequence errors, the same UMI sequences (same UMI family) A common read sequence, ie, a consensus read, is output from the read sequence having.
- the consensus lead may be simply referred to as a consensus, a consensus sequence, or a consensus lead sequence.
- FIG. 8A is a view for explaining the processing steps of the consensus read process for obtaining a consensus lead. (See “(a) Data Analysis Taking Account of PCR Error and Sequence Error” below).
- FIG. 9A is a view for explaining the processing steps for obtaining a consensus lead. (See “(b) Data analysis considering lead alignment (Large Indel detection)” below). Furthermore, there is another method in “(a) Data analysis in consideration of PCR error and sequence error” (see “(a ′) Another method of data analysis in consideration of PCR error and sequence error”).
- a consensus read and a reliability score (UQ) for each base in the sequence are used. Do.
- the reliability score (UQ) is calculated for each base in each UMI family, and the reliability score (UQ) is expressed by the following equation.
- the reliability score (UQ) for each base in each UMI family is not particularly limited, for example, in the Bayesian approach, it can be estimated as a posterior distribution. Specifically, in each UMI family, the likelihood for each target base (each base) at the corresponding place is estimated.
- the target bases are a subset of the whole set ⁇ , and all insertions or deletions detected in all four types of base pairs (A, T, G, C) and all UMI families mapped onto the same reference sequence It is composed of (other obs. Alleles). In the example of FIG.
- the statistical model of the well-known software SmCounter is a specification with low detection sensitivity to insertion or deletion bases.
- the detection sensitivity also increases for base insertion and base deletion in addition to base substitution.
- each base is a likelihood function
- P (each base) is a prior distribution showing the appearance rate of the target base.
- the prior distribution can adopt a uniform distribution such as 1 / ⁇ . However, if there is any information about ⁇ or P (each base) from prior data etc., it is possible to change.
- the likelihood function expresses a sequence error (S error ) and a PCR amplification error.
- the first term indicates the frequency of sequence errors (S error ) and the second term indicates the frequency of occurrence of PCR amplification errors.
- the sequence error (error from A to C) is S error frequency in the case of three read sequences where the target base is C. It occurs in On the other hand, in the case of four read sequences in which the target base is A, the sequence is correctly performed with the frequency of (1-S error ), and the likelihood under this condition is determined.
- n y indicates the number (redundancy) of the read sequences of each constituent base (Y) of the set ⁇ .
- Y A
- n y is four.
- T and G which could not actually be detected in the set ⁇
- n y will be zero, and the calculation of the first term will not be performed. That is, only the likelihood (second term) due to PCR amplification error is calculated.
- the sequence error (S error ) is calculated based on P error calculated from the above-mentioned Phred quality score.
- SNV single nucleotide substitution
- the second term of the likelihood function is a code based on the frequency of PCR amplification errors (misC p ). misC p is calculated from PCR error (x).
- the PCR error (x) is determined by the product of the probability that one mutation occurs in the PCR cycle of x cycles and the probability that the PCR product of x cycles is included in the PCR product after the final cycle (n).
- the probability of the appearance of mutations in the former follows the Poisson distribution.
- the parameters of Poisson distribution can be obtained as expected values from the error appearance rate ( ⁇ ) of the PCR enzyme and the amplification efficiency ( ⁇ ) of the PCR enzyme.
- the latter probability is calculated as the expected value (maximum likelihood) of the binomial distribution. Therefore, PCR error (x) can be obtained by the following equation.
- PCR error (x) it is important to estimate the number of PCR cycles in which a mutation appears, but in this embodiment, this estimation is performed based on the number of lead sequences of each base in the UMI family. .
- the ratio is 3/7 (43.8%).
- the proportion of the number of lead sequences is approximately shown by the following formula of the proportion of PCR products derived from PCR amplification errors contained in all PCR products.
- Sm (1 + ⁇ m ) n x / So (1 + ⁇ w ) n
- Sm is an initial nucleic acid copy number of PCR amplification error and is 1.
- So is the first initial nucleic acid copy number, which is 1.
- ⁇ m is the PCR amplification efficiency for the nucleic acid product of PCR error.
- ⁇ w is the PCR amplification efficiency for nucleic acid products without PCR errors.
- PCR amplification error and sequence error are considered and treated as mutually exclusive events, but the effect of interaction (both PCR amplification error and sequence error Cases that arise may apply) and other factors may also be added to the equation.
- Other factors include mapping errors (in the case where the read sequence is mapped to different places and different places) and base substitution by oxidation reaction.
- the reliability score (UQ) is calculated for each base in each UMI family.
- FIG. 9A is a diagram showing processing steps for obtaining a consensus lead.
- the maximum posterior probability P is applied to a route having 10 base or more of Insert and / or deletion from the shortest path (assembly sequence, haplotype candidate, H) of each local assembly and mutation list derived from each. Determine ⁇ each haplotype
- the maximum posterior probability P in the present embodiment corresponds to the reliability score (UQ) described in (a) above.
- the first half shows the consideration portion of the mapping (alignment) distance of the lead, and the second half shows the consideration portion of the error of the sequence.
- ⁇ , ⁇ Probability of the distance between the leads to the insert size (normal distribution ⁇ , ⁇ )” in the first half is all with the same UMI sequence (same UMI Family, UMI j )
- the read sequence (k ⁇ ReadPair) is realigned to the sequence of the haplotype to represent the probability for the distance between the reads (dist k ).
- the population in this embodiment is a normal distribution defined by the mean value ( ⁇ ) and the variance ( ⁇ 2 ) of the distances between the leads on the reference sequence of all the leads, and the distribution of Insert size (normal distribution) I assume.
- the lead sequence used in the present embodiment can be selected by the mapping information to the reference sequence. For example, in this embodiment, a read having I / D / S / H (unmapped read not mapped to the reference sequence) in Cigar (see FIG. 5C) of the read sequence which can be obtained from the mapping result of the SAM / BAM format Extract). Furthermore, the leads having the same UMI are also extracted (even when the above conditions are not met).
- leads having mutations at the corresponding positions are selected based on Cigar and NM tags. Furthermore, the same UMI reads are also extracted regardless of mutations. All mutations in the corresponding part are represented by a set ⁇ , and an element Ai composed of A, T, G, C and all mutations (obs Allele) in the corresponding part belongs.
- a specific base type present at a specific position is arbitrarily selected from a set of consensus leads.
- "arbitrary” means “all cases” and “all cases”.
- the “specific position” means that the base sequence of the reference sequence and the consensus lead is different.
- the “specific base type” means a base of a consensus lead different from the base of the reference sequence.
- the position and base type to be confirmed in the clonal nucleic acid may be set (selected) arbitrarily.
- the site and base type for the mutation may be selected in advance.
- a position where a base type different from the reference sequence is present may be extracted by comparing the reference sequence with the consensus lead.
- the base type may also be a base type corresponding to the insertion site or a base type equivalent to deletion (in this case, treatment with no such equivalent base type).
- the subsequent estimation step may not be performed.
- a parametric error model using a reliability score estimates the occurrence rate of appearance of a specific base by analysis error at a specific position at a set threshold (or significance level).
- the basic model of base type estimation builds a parametric error model by Poisson distribution, and performs base type estimation at arbitrary places. Depending on the embodiment, it is possible to change the error model. Besides Poisson distribution, for example, tests with multi-valued variables (prior distribution is Dirichlet distribution), tests with beta binomial distribution with binary variables, supervised learning models, etc. can be used as parametric error models of the present invention.
- Base type estimation includes the following steps a to d.
- Step a) Select a target location (chr3: 38, 930 in FIG. 8 (B)).
- Step b) Extract all UMI families mapped to the corresponding part (in FIG. 8 (B), extract j UMI families).
- Process c) Acquire the score of the consensus and UQ in the corresponding part of each UMI family.
- Step d) Select only the consensus lead and UQ score different from the reference sequence (A) having the base type to be confirmed.
- Step e) An error model (Poisson distribution) is constructed from the information obtained through steps a to d (pattern of consensus lead having selected base species and UQ score).
- the parameter ⁇ of the Poisson distribution can preferably be expressed as follows.
- the parameter ⁇ of Poisson distribution can be calculated according to the following equation.
- ⁇ 1000 * (1 ⁇ argminP ⁇ C
- UMI ⁇ indicates the minimum value among five UQ scores UQ (c) (P ⁇ C
- Mutation detection S16 Mutations are detected using the error model (Poisson distribution) constructed as described above. That is, when m UMIs are obtained from the error model, it is possible to estimate the probability that n consensus leads can be obtained by the errors.
- the above-mentioned threshold is also called a significance level, and is generally set to 0.05 (5%), but if necessary, 0.01 (1%) or 0.001 (0.1%) Set to). Then, the probability estimated from the error model is compared with a predetermined threshold (such as 1% level or 5% level). If the estimated probability is lower than a predetermined threshold value, the hypothesis due to an error will be rejected, so it is determined to be the selected base type. This makes it possible to estimate the presence of a base type in which only one of 1000 molecules is present. Thus, the present disclosure can be suitably used for detection of low frequency mutations.
- the error model can be changed as appropriate.
- the ratio that the specific base species at the specific base position estimated as described above is present in the clonal nucleic acid mixture can be calculated by the following formula.
- the ratio is also called mutation frequency.
- Ratio (mutation frequency) (number of UMIs of consensus reads with mutations mapped to mutation sites) / (total UMI number of consensus reads mapped to mutations sites)
- the calculated ratio can be regarded as the percentage (low abundance) of low frequency mutations present in the clonal nucleic acid mixture.
- the specific base type at the specific base position can be present in the clonal nucleic acid mixture for all bases in the specific region to be analyzed, or the base type at the specific base position is unknown
- the aggregated sequence information can be the sequence of low frequency variants present in the clonal nucleic acid mixture.
- the specific base type at the specific base position can be estimated to be present in the clonal nucleic acid mixture, or the base type at the specific base position can be estimated unknown, the result obtained by verifying It applies to the read to create secondary sequence information, and further integrates all secondary sequence information of a specific region obtained by repeating the above-mentioned process while changing the base position, and each diffusion constituting the clonal nucleic acid mixture It can be employed as the base sequence of a specific region of the molecule.
- mutation filtering can be further performed.
- P.I. Flaherty et al., Nucleic Acids Res. Vol. 40, No. 1, page e2, 2012 (DOI: 10.1093 / nar / gkr861)
- the mutations detected in normal samples are excluded from the mutations detected in tumor samples You may Thereby, mutations specific to the tumor sample can be extracted (filtered).
- Example 1 Detection of Infrequent Mutations in Fragmented Nucleic Acid Mixture
- 4 types of EGFR, KRAS, NRAS and PIK3CA are used.
- Multiplex I cfDNA Reference Standard Set (horizon) as a human-derived clonal nucleic acid having, at different mixing ratios, a copy of a wild-type gene and a copy of eight mutant genes in which rare mutations occur in the same gene in a human gene group. Company company) was used.
- the Multiplex I cfDNA Reference Standard Set is a standard product of clonal nucleic acid artificially fragmented into about 160 bp by resembling cell-free nucleic acid (cell-free DNA) in plasma.
- ThruPLEX Tag-seq 6S (12) Kit manufactured by Takara Bio Inc.
- a Stem-Loop adapter was added to both ends of 50 ng of clonal nucleic acid by a ligation method to obtain a template nucleic acid.
- PCR amplification was performed with a universal primer having an adapter sequence to prepare an amplified nucleic acid product (adapter + template nucleic acid) for Illumina's sequencer.
- UMI is added to each template nucleic acid before amplification because UMI of random sequence consisting of 6 bases is contained in the Stem-Loop adapter (total 2 UMI is included) ).
- sequences at both ends 100 bp for each sample were performed to obtain base sequence data (read sequence data).
- the read sequence data was mapped to the human reference sequence (hg19) using BWA-MEM (version: 0.7.15). From the resulting SAM / BAM format file, mapping information for the reference sequence and UMI sequence information were obtained for each read sequence.
- Molecular barcode clustering was performed on a specific target area based on the mapping position information and UMI sequence information according to the method described above.
- the conditions for classification are as follows. Condition 1: Two or less mismatched bases are identical UMI. Condition 2: Even if the condition 1 is not satisfied, if at least one UMI sequence is identical, the same UMI is used.
- Consensus lead generation was performed according to the method described above. Thereafter, the bases of 178, 935, 997-178, 936, 121 bp corresponding to the exon 10 of the PIK3CA gene located on the 3rd chromosome are calculated from the Phred quality score given to the read sequence of the Illumina sequencer of 1) The error rate Perror and the error rate 1-UQ (base) calculated from 2) UQ were calculated, and two types of error rates were compared. The results are shown in FIG. In FIG. 10, using the SAM / BAM file after molecular bar code clustering, the above two types of error rates are calculated for the base with the largest number of UMI in each location, and the Phred scale ( ⁇ 10 log 10 (error It plotted with the rate).
- the error rate ( ⁇ in FIG. 10) calculated from the Phred quality score is between 30 and 40 on the Phred scale, but the error rate based on UQ (FIG. 10) ⁇ ) indicates a value over 50, and the error rate was reduced to about 1/10 times over the entire area.
- LoFreq is a mutation detection software reported in a paper (Wilm A et al., Nucleic Acids Res. 2012 40 (22): 11189-11201), and Bernoulli trial is performed based on the sequence error rate calculated from Phred quality score. It is software that is implemented to detect mutations.
- FIG. 11 after acquiring consensus leads using Connor (version: 0.5.1 https://github.com/umich-brcf-bioinf/Connor) for SAM / BAM files mapped by BWA Mutation detection was performed.
- SmCounter is software that performs error correction based on molecular barcodes by the Bayesian approach to detect mutations (Non-Patent Document 3).
- FIG. 11 shows the result of mutation detection using the SmCounter, with the SAM / BAM format file subjected to molecular barcode clustering (the disclosed method) as an input file.
- the analysis code of the present invention detects all eight mutant genes with a mixing ratio of 1.0% indicated by hatching, and the detection sensitivity is It was 100% (8 types / 8 types). Furthermore, while the mixing ratio is 0.1%, five kinds of mutations are shown: hatched SNV of EGFR gene L858R and Deletion of ⁇ 746-750, SN12 of NRAS gene G12D and A59T, and SNK of PIK3CA E545K. The detection sensitivity was 62.5% (5 types / 8 types). The p-values of the detected mutations were all on the order of E-06 (10 -6 ), which were lower than the threshold (1% level).
- the detection method of the present embodiment showed a detection sensitivity superior to that of the conventional method, and furthermore, the mutation could be detected even in a very low mutation which could not be detected by the conventional method.
- Example 2 Quantification of mutation frequency in a nucleic acid mixture
- a copy of a wild-type gene further Tru-Q7 (1.3% Tier) Reference Standard (Part No. :) with copies of 35 mutant genes in which rare mutations occur in the same gene at various mixing ratios of 1.0% -30%. HD734) (manufactured by Horizon) was used as a template nucleic acid.
- Example 1 a library for Illumina sequencer was prepared using ThruPLEX Tag-seq 6S (12) Kit, and sequence reads were obtained by HiSeq.
- UMI classification can be realized with high accuracy by clustering of molecular barcodes according to the method of the present disclosure.
- low-frequency mutations could be detected with high accuracy even with degraded (fragmented) low-quality genomic DNA.
- FIG. 13 is a block diagram showing a specific internal configuration of the mutation detection device 10 for performing the mutation detection method described above.
- the mutation detection device 10 is configured by an information processing device such as a personal computer.
- the mutation detection device 10 includes a control unit 11 that controls the overall operation, a display unit 13 that performs screen display, an operation unit 15 that a user performs operations, a data storage unit 17 that stores data and programs, and an external device. And an interface unit 18 which communicates with the network, and a RAM (Random Access Memory) 16 which is a main memory for the control unit 11 to perform processing.
- a RAM Random Access Memory
- the display unit 13 is configured of, for example, a liquid crystal display or an organic EL display.
- the operation unit 15 includes a keyboard, a mouse, a touch panel, and the like.
- the interface unit 18 can connect various devices (printer, communication device, input device, etc.) conforming to an interface such as USB, HDMI (registered trademark), SCSI, etc., and data between the connected device and the mutation detection device 10 And enable communication of control commands.
- the mutation detection apparatus 10 acquires the read sequence data from the next-generation sequencer 50 via the interface unit 18.
- the control unit 11 controls the entire operation of the mutation detection device 10, and is configured by a CPU or an MPU that realizes a predetermined function by executing a program.
- the program executed by the control unit 11 can be provided via a communication line, or a recording medium such as a CD, a DVD, a memory card or the like.
- the control unit 11 may be configured by a dedicated hardware circuit (FPGA, ASIC, etc.) designed to realize a predetermined function.
- the program uses, for example, a computer language such as JAVA (registered trademark), C, C ++, C #, OBJECTIVE-C, SWIFT, or a script language such as PERL, PYTHON, BIO-RUBY, or any suitable language.
- JAVA registered trademark
- C C
- C ++ C #
- OBJECTIVE-C SWIFT
- SWIFT or a script language
- PERL PYTHON
- BIO-RUBY BIO-RUBY
- the software code may be stored as a sequence of instructions on a computer readable medium for storage and / or transmission, such as RAM 16. Alternatively, it can be transmitted using a carrier signal that is encoded and adapted for transmission over wired, optical, and / or wireless communication lines that conform to various protocols, including the Internet.
- the data storage unit 17 is a device that stores data and programs, and can be configured by, for example, a hard disk (HDD), an SSD (SOLID STATE DRIVE), a semiconductor memory device, or an optical disk.
- the data storage unit 21 stores a control program executed by the control unit 11, read sequence data, and the like.
- FIGS. 14A and 14B are flowcharts showing processing executed by the control unit 11 of the mutation detection device 10 of FIG. Among them, the flowchart of FIG. 14A shows a process related to data analysis in consideration of PCR errors and sequence errors. The flowchart of FIG. 14B shows a process related to data analysis in consideration of lead alignment.
- the mutation detection device 10 acquires the read sequence data from the sequencer 50 via the interface unit 18 (S51).
- the control unit 11 of the mutation detection device 10 performs mapping processing of each read sequence data to a reference sequence (S52).
- the control unit 11 classifies the read sequence data based on the position on the reference sequence and the UMI according to the method described above (S53). Thereafter, the control unit 11 performs a consensus read process to create a consensus read for each UMI family (S54).
- the control unit 11 constructs an error model (S55). Specifically, the control unit 11 calculates a reliability score UQ, obtains a parameter ⁇ of an error model (Poisson distribution) from the pattern of consensus leads and the reliability score UQ, and constructs an error model.
- S55 an error model
- the control unit 11 calculates a reliability score UQ, obtains a parameter ⁇ of an error model (Poisson distribution) from the pattern of consensus leads and the reliability score UQ, and constructs an error model.
- the control unit 11 detects a mutation with reference to the error model (S56). Specifically, the control unit 11 selects the position and base type for which the mutation is to be confirmed, and estimates the occurrence probability of the mutation with reference to the error model. Then, when the estimated value is smaller than the threshold value, the control unit 11 determines that no mutation has occurred, and determines that the selected base type is correct. In the flow chart shown in FIG. 14 (a), small sized mutations can be detected.
- control unit 11 performs mutation filtering (S57) and excludes a mutation that satisfies a predetermined condition from the mutation detection result.
- the mutation detection device 10 acquires the read sequence data from the sequencer 50 via the interface unit 18 (S51).
- the control unit 11 of the mutation detection device 10 performs mapping processing of each read sequence data to a reference sequence (S52).
- the control unit 11 classifies the read sequence data based on the position on the reference sequence and the UMI according to the method described above (S53).
- control unit 11 sets the reference sequence to “clustering of molecular barcodes (S53)” and (following) “consensus read processing (S54)”.
- a local assembly process (S58) is performed to determine a sequence region longer than the sample-specific read sequence length.
- control unit 11 performs a consensus read process to create a consensus read for each UMI family (S54).
- control unit 11 constructs an error model (S55). Specifically, the control unit 11 calculates a reliability score UQ, obtains a parameter ⁇ of an error model (Poisson distribution) from the pattern of consensus leads and the reliability score UQ, and constructs an error model.
- the control unit 11 detects a mutation with reference to an error model (S56). Specifically, the control unit 11 selects the position and base type for which the mutation is to be confirmed, and estimates the occurrence probability of the mutation with reference to the error model.
- the control unit 11 determines that no mutation has occurred, and determines that the selected base type is correct. Furthermore, if necessary, the control unit 11 performs mutation filtering (S57) and excludes a mutation that satisfies a predetermined condition from the mutation detection result.
- the mutation detection apparatus 10 can detect a mutation in the DNA sequence based on the data of the read sequence from the sequencer 50.
- a method of using a computer to estimate the presence of a specific base type present at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture a) obtaining a lead sequence derived from each nucleic acid molecule, b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result; c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b, d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result, e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score, f) obtaining a ratio of the number of consensus reads containing each specific base species at the
- step h a step of presuming presence in clonal nucleic acid mixture at a significant level, and i) containing the base species at the specific position obtained in step f, when it is judged that the base species selected in step e is present in step h
- step h Calculating the ratio of the specific base species at the specific position selected in step e at the threshold or significance level set in step g from the number of consensus leads, and adopting it as the ratio present in the clonal nucleic acid mixture, Method including.
- step b when the reference sequence adopted in step b and the base sequence on the consensus lead are the same, the base of the secondary information prepared in step i as it is without passing through steps e to h You may adopt as a seed.
- the specific base position selected in step e may be a position different from the reference sequence adopted in step b and the base type on the consensus read .
- the nucleic acid molecule may have a molecular barcode at at least one end.
- each nucleic acid molecule may have a different molecular barcode.
- each nucleic acid molecule may have one or more different molecular barcodes.
- the clustering in step c may be performed using the arrangement of the molecular barcode region in the lead sequence as an index.
- the molecular barcode may be a random sequence of base sequences.
- the mismatch in the lead sequence of the molecular barcode region may be within the allowable range of the sequence similarity between the lead sequences of the molecular barcode region.
- the reliability score may be obtained by the Bayesian approach from the information on the clustered lead sequences.
- the lead sequence may be a lead sequence of the amplification product of the nucleic acid molecule.
- an amplification product may be obtained by polymerase chain reaction.
- the likelihood function of the Bayesian approach may be a function taking into account sequencer errors and amplification errors.
- the parametric error model of step g may be based on a Poisson distribution.
- the specific base type at the specific position may be a target base of low frequency mutation.
- step c may proceed directly to step d.
- (21) A method comprising the step of determining a specific base species present at a specific position in the base sequence of the diffusion molecule constituting the clonal nucleic acid mixture based on the results obtained by the methods of (19) and (20). May be
- An apparatus for estimating the presence of a specific base type at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule; A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device; A clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence; A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result; An estimation device equipped with
- An apparatus for estimating the presence of a specific base type at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule; A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device; A first clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence; A second clustering result acquisition unit that executes a local assembly using a lead sequence included in each clustering and obtains a clustering result of the assembly sequence; A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result of the second clustering result acquisition unit; An estimation device equipped with
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
Description
本発明は、塩基配列における塩基種を推定する技術に関する。 The present invention relates to a technique for estimating a base type in a base sequence.
がん細胞は、正常な細胞中のがん遺伝子及びがん抑制遺伝子(がん関連遺伝子)に塩基配列の突然変異が生じることにより発生すると言われている。近年、これらのがん関連遺伝子の遺伝子異常を応用して、正常な細胞とがん細胞を見分ける手法の開発が進められている。例えば、非特許文献1および非特許文献2では、血中や便中に存在する微量ながん細胞由来の異常なDNAを検出できたことが開示されている。さらに、がん関連遺伝子の変異によって翻訳された異常タンパク質を標的とする分子標的薬の登場により、塩基配列を高精度に決定することが求められる。
Cancer cells are said to be generated by mutation of base sequences in oncogenes and tumor suppressor genes (cancer related genes) in normal cells. In recent years, development of a method to distinguish normal cells from cancer cells has been promoted by applying genetic abnormalities of these cancer-related genes. For example, Non-Patent
次世代シーケンサーによる腫瘍検体の変異の包括的なプロファイリングの結果、一つの腫瘍のなかには、正常な細胞とは異なる配列のゲノムを持つ、がん細胞由来の複数のゲノムクローンが存在することが報告され、この現象は腫瘍内不均一性と呼ばれている。この現象により、腫瘍自体がポリクローナルな混合状態となっていることから、低頻度な変異も含めて網羅的に検出する必要性がある。 As a result of comprehensive profiling of mutations in tumor specimens by next-generation sequencers, it has been reported that in one tumor, there are multiple genomic clones derived from cancer cells, which have genomes of sequences different from normal cells. This phenomenon is called intratumoral heterogeneity. Since this phenomenon causes the tumor itself to be in a polyclonal mixed state, there is a need for exhaustive detection including low frequency mutations.
現在の次世代シーケンス技術では、核酸増幅時及びシーケンシング時にある頻度でエラーが生じる。したがって、シーケンス解析を実施したゲノムDNAの塩基配列には技術的なエラーに起因する偽陽性の変異(false positive variant)が含まれるため、1~5%以下の低頻度な変異の検出は難しい。しかし近年、分子バーコード技術を応用することによって、低頻度な変異検出時のエラーが抑制されることが報告されている。 In current next generation sequencing technology, errors occur at a certain frequency during nucleic acid amplification and sequencing. Therefore, detection of low frequency mutations of 1 to 5% or less is difficult because the base sequences of genomic DNA subjected to sequence analysis include false positive mutations caused by technical errors. However, in recent years, it has been reported that the application of molecular barcode technology suppresses errors in low frequency mutation detection.
例えば、非特許文献3には、配列解析装置より得られた塩基配列情報の統計処理により検出精度を高める技術としてsmCounterが開示されているが、リキッドバイオプシーなど、より高い検出感度と精度を求められる場合には精度が足らず、不十分である。
For example, although
本発明は、塩基配列における塩基種を推定する方法、装置、及びその方法を実施するためのコンピュータープログラムを提供することを目的とする。 An object of the present invention is to provide a method, an apparatus, and a computer program for carrying out the method for estimating a base type in a base sequence.
本発明者は、テンプレートDNAに分子バーコードを付加させて調製したDNA断片から得られた次世代シーケンサーの多量の塩基配列データ(リード情報)から、分子バーコード毎に信頼性スコアを決定し、パラメトリック解析することで、従来法よりも高い感度で変異を検出できることを見出した。 The present inventor determines the reliability score for each molecular barcode from the large amount of base sequence data (read information) of the next-generation sequencer obtained from the DNA fragment prepared by adding the molecular barcode to the template DNA, We found that parametric analysis can detect mutations with higher sensitivity than conventional methods.
また、本発明者は、シーケンスデータから低頻度に存在する、特定位置の塩基種を推定するために、核酸増幅又はシーケンシングで発生するエラーを効率的に取り除く指標を算出し、真の塩基種を高確度に判定する統計手法を開発した。 Also, in order to estimate the base type at a specific position, which exists infrequently from the sequence data, the present inventor calculates an index for efficiently removing an error generated by nucleic acid amplification or sequencing, and the true base type We have developed a statistical method that determines with high accuracy.
本発明の検出精度を高める技術は、例えば、より疾患早期ステージでの変異体核酸の検出を期待できる。 The technique for enhancing detection accuracy of the present invention can be expected, for example, to detect variant nucleic acids at an earlier disease stage.
本開示の第一の態様において、コンピュータを用いて、クローナル核酸混合物を構成する核酸分子の塩基配列における特定位置に存在する特定塩基種の存在を推定する第一の推定方法が提供される。
第一の推定方法は、
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特定の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで選抜した特定位置において解析エラーによって特定の塩基が出現する出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在すると推定する工程、
を含む。
In a first aspect of the present disclosure, a computer is used to provide a first estimation method for estimating the presence of a specific base type present at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
The first estimation method is
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each specific base species at the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) Estimate the occurrence rate of appearance of a specific base due to an analysis error at a specified position selected in step e at a set threshold or significance level by a parametric error model using the reliability score obtained in step d Process,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or estimating at a significant level as present in the clonal nucleic acid mixture,
including.
本開示の第二の態様において、コンピュータを用いて、クローナル核酸混合物を構成する核酸分子の塩基配列における特定位置に存在する特定塩基種がクローナル核酸混合物中で存在する割合を推定する第二の推定方法が提供される。
第二の推定方法は、
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特有の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで選抜した特定位置において解析エラーによって特有の塩基が出現する出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在すると推定する工程、及び
i)工程hにおいて工程eで選抜した塩基種が存在すると判断された場合、工程fで得られた特定位置において塩基種を含有するコンセンサスリードの数から、工程gで設定された閾値または有意水準における工程eで選抜した特定位置における特定塩基種の比率を算出し、クローナル核酸混合物中で存在する割合として採用する工程、
を含む。
In a second embodiment of the present disclosure, a second estimation is performed using a computer to estimate the proportion of a specific base species present at a specific position in a base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture in the clonal nucleic acid mixture. A method is provided.
The second estimation method is
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each base species specific to the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) Estimate the occurrence rate of a distinctive base due to an analysis error at a specific position selected in step e at a set threshold or significance level, using a parametric error model using the reliability score obtained in step d Process,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or a step of presuming presence in clonal nucleic acid mixture at a significant level, and i) containing the base species at the specific position obtained in step f, when it is judged that the base species selected in step e is present in step h Calculating the ratio of the specific base species at the specific position selected in step e at the threshold or significance level set in step g from the number of consensus leads, and adopting it as the ratio present in the clonal nucleic acid mixture,
including.
本開示の第三の態様において、コンピュータを用いて、クローナル核酸混合物を構成する各核酸分子の特定領域の塩基配列を推定する第三の推定方法が提供される。
第三の推定方法は、
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特定の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程eで選抜した特定位置の塩基種に対応する工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで確認した特定位置における塩基種の解析エラーによる出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在する、あるいは該出現率より低い場合は特定位置の塩基種は不明、と判断する工程、
i)工程hで得られた結果を該塩基の各コンセンサスリードに適用し二次配列情報を作成する工程、及び
j)工程eにおける塩基位置を変更し、工程e~iを繰り返すことにより得られた特定領域の全二次配列情報を統合し、クローナル核酸混合物を構成する各核酸分子の特定領域の塩基配列として採用する工程、
を含む。
In a third aspect of the present disclosure, there is provided a third estimation method for estimating a base sequence of a specific region of each nucleic acid molecule constituting a clonal nucleic acid mixture using a computer.
The third estimation method is
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each specific base species at the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) The base at the specific position confirmed in step e at a set threshold or significance level by the parametric error model using the reliability score obtained in step d corresponding to the base type at the specific position selected in step e Estimating the rate of occurrence due to analysis errors of the species,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or judging that the base species at the specific position is unknown if it is present in the clonal nucleic acid mixture at a significant level or lower than the occurrence rate,
obtained by applying i) the result obtained in step h to each consensus lead of the base to create secondary sequence information, and j) changing the base position in step e and repeating steps e to i. Integrating all secondary sequence information of the specific region and adopting it as the base sequence of the specific region of each nucleic acid molecule constituting the clonal nucleic acid mixture,
including.
本開示の第四の態様において、上記第一ないし第三の推定方法のいずれかをコンピュータで実行させるためのコンピュータ読み取り可能なプログラムが提供される。 In a fourth aspect of the present disclosure, there is provided a computer readable program for causing a computer to execute any of the above first to third estimation methods.
本開示の第五の態様において、クローナル核酸混合物を構成する核酸分子の塩基配列における特定位置の特定塩基種の存在を推定するための推定装置が提供される。
推定装置は、
各核酸分子由来のリード配列を取得するリード配列取得部と、
取得したリード配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、
マッピングしたリード配列のクラスタリング結果を得るクラスタリング結果取得部と、
クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報取得部と、
を備える。
In a fifth aspect of the present disclosure, there is provided an estimation device for estimating the presence of a specific base species at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
The estimation device is
A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule;
A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device;
A clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence;
A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result;
Equipped with
本開示の第六の態様において、クローナル核酸混合物を構成する核酸分子の塩基配列における特定位置の特定塩基種の存在を推定するための推定装置が提供される。
推定装置は、
各核酸分子由来のリード配列を取得するリード配列取得部と、
取得したリード配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、
マッピングしたリード配列のクラスタリング結果を得る第1のクラスタリング結果取得部と、
各クラスタリングに含まれるリード配列を用いて、ローカルアセンブリを実行し、アセンブル配列のクラスタリング結果を得る第2のクラスタリング結果取得部と、
第2のクラスタリング結果取得部のクラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報取得部と、
を備える。
In a sixth aspect of the present disclosure, there is provided an estimation device for estimating the presence of a specific base species at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture.
The estimation device is
A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule;
A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device;
A first clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence;
A second clustering result acquisition unit that executes a local assembly using a lead sequence included in each clustering and obtains a clustering result of the assembly sequence;
A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result of the second clustering result acquisition unit;
Equipped with
本発明によれば、被験体の身体サンプル由来の一部の核酸上に発生する低頻度な変異を検出でき、塩基配列における塩基種を推定できる。 According to the present invention, it is possible to detect infrequent mutations occurring on a part of nucleic acids derived from a body sample of a subject, and to estimate the base type in the base sequence.
以下、添付の図面を参照しながら本開示の実施の形態を説明する。 Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings.
<用語の定義>
本開示における「クローナル核酸混合物」とは、核酸一分子から生物学的な複製を繰り返すことにより発生した複数の核酸の混合物を指す。このようなクローナル核酸が得られる試料としては、特に生物種による限定も無いが、例えばヒトの生体試料が挙げられ、組織、血液、血漿、血清、尿、唾液、粘膜排出物、痰、便および涙が例示される。また組織のFFPE(ホルマリン固定パラフィン包埋)試料の他、in vitroでの培養による細胞集団であってもよい。FFPE試料由来DNAは、短く断片化されている、ニックやギャップが入っている、シトシンの脱アミノ化(ウラシル化)が生じているなどダメージを受けていることが多い。そのままでは解析不良の一因となるため、NEBNext FFPE DNA Repair Mix(NEB社製)など市販の試薬を用いて該DNAを修復することが望ましい。
<Definition of terms>
The "clonal nucleic acid mixture" in the present disclosure refers to a mixture of multiple nucleic acids generated by repeating biological replication from one nucleic acid molecule. The sample from which such clonal nucleic acid is obtained is not particularly limited by species, but includes, for example, human biological samples, tissue, blood, plasma, serum, urine, saliva, mucus drainage, sputum, stool and Tears are illustrated. In addition to the FFPE (formalin fixed paraffin embedded) sample of the tissue, it may be a cell population by culture in vitro. The FFPE sample-derived DNA is often damaged such as short fragments, nicks and gaps, and deamination (urasylation) of cytosine. Since it directly contributes to poor analysis, it is desirable to repair the DNA using a commercially available reagent such as NEBNext FFPE DNA Repair Mix (manufactured by NEB).
本開示における「リード配列」とはシーケンサーより出力された塩基配列データである。 The "read sequence" in the present disclosure is base sequence data output from a sequencer.
本開示における「低頻度に存在する、特定位置の塩基種」とは、特に限定するものではないが、例えば「低頻度な変異」が挙げられる。ここで、「低頻度な変異」とは、生体内で生じた核酸中の塩基の(突然)変異であって、以下の二つの条件a、bを満たす変異を意図する。
a)核酸分子において、1×10-3/塩基以下の頻度(すなわち、1,000塩基に一つ以下の頻度)で出現する。
b)核酸分子を含む試料において、特定の位置の塩基に異なる配列の塩基を有する核酸分子の割合が試料中の全核酸分子数の10%以下となる。
The “infrequently occurring base species at a specific position” in the present disclosure is not particularly limited, and includes, for example, “infrequent mutations”. Here, the term "low frequency mutation" is intended to mean (a sudden) mutation of a base in a nucleic acid generated in vivo and satisfying the following two conditions a and b.
a) In a nucleic acid molecule, it appears at a frequency of 1 × 10 -3 / base or less (ie, a frequency of 1 or less per 1,000 bases).
b) In a sample containing nucleic acid molecules, the proportion of nucleic acid molecules having different sequences of bases at specific positions is 10% or less of the total number of nucleic acid molecules in the sample.
以下、「低頻度に存在する、特定位置の塩基種」として「低頻度な変異」を例に挙げて本開示を説明するが、本開示は低頻度な変異の検出に何ら限定されるものではない。 Hereinafter, the present disclosure will be described by taking “low frequency mutation” as an example as “base type at a specific position which exists at low frequency” as an example, but the present disclosure is not limited to detection of low frequency mutations. Absent.
本開示における(突然)変異は、置換、挿入及び欠失のいずれでも良い。また、該(突然)変異は、既知であってもよく、新規であってもよい。 The (sudden) mutation in the present disclosure may be any of substitution, insertion and deletion. Also, the (sudden) mutation may be known or novel.
(実施の形態)
1.変異検出のための構成
図1は、本発明の実施の形態に係る塩基配列の推定方法を実施する変異検出システム(推定システムの一例)の構成を示した図である。変異検出システム100はDNA配列における変異を検出する。変異検出システム100は、断片化されたDNAをシーケンサー50で解析し、その解析の結果得られたリード配列データを変異検出装置10(推定装置の一例)に入力する。変異検出装置10はシーケンサー50からリード配列データを取得し、解析してDNA配列における変異を検出する。
Embodiment
1. Configuration for Mutation Detection FIG. 1 is a diagram showing a configuration of a mutation detection system (an example of a estimation system) that implements a method of estimating a base sequence according to an embodiment of the present invention. The
2.変異検出方法
図2は、本実施の形態のDNA配列における新規な変異検出方法の手順を示した図である。図2を参照して、DNA配列における変異検出方法の手順を説明する。
2. Mutation Detection Method FIG. 2 is a diagram showing the procedure of a novel mutation detection method in the DNA sequence of the present embodiment. The procedure of the method for detecting mutations in DNA sequences will be described with reference to FIG.
まず、生体試料からゲノムDNAやRNA(クローナル核酸)を抽出し、シーケンサーに適した長さのDNA(例えば1000base以下)が含まれるように断片化し、断片化クローナル核酸を生成する(S1)。生体試料が、すでに断片化した状態のDNA(cell-free DNA、FFPEサンプルなど)や短いRNA(non-coding RNAなど)の場合、断片化工程は行わなくともよい。 First, genomic DNA and RNA (clonal nucleic acid) are extracted from a biological sample, fragmented so as to contain DNA (for example, 1000 base or less) of a length suitable for a sequencer, and fragmented clonal nucleic acid is generated (S1). If the biological sample is already fragmented DNA (cell-free DNA, FFPE sample, etc.) or short RNA (non-coding RNA, etc.), the fragmentation step may not be performed.
本発明は特に、低頻度な変異を含む可能性のある核酸の解析に好適に使用される。解析したいクローナル核酸は、DNAまたはRNAであり、好ましくはDNAであり、さらに好ましくはゲノムDNAである。 The present invention is particularly suitably used for analysis of nucleic acids which may contain low frequency mutations. The clonal nucleic acid to be analyzed is DNA or RNA, preferably DNA, more preferably genomic DNA.
次世代シーケンサーの機種によっては解析可能な配列長が数百bpと短いものもある。このため、例えば、クローナル核酸がゲノムDNAの場合、シーケンサーで解析可能な長さに断片化する必要がある。断片化の方法に限定は無く、公知の方法で断片化すればよい。本発明を何ら限定するものではないが、例えば、超音波を用いた物理的な切断を行うDNA Shearing System M220/ME220(Covaris社製)、酵素を用いた切断を行うQIAseq FX DNA Library Kit(Qiagen社製)など市販の機器や試薬が使用できる。好適には、超音波を用いた物理的な切断が用いられる。 Depending on the model of the next-generation sequencer, the sequence length that can be analyzed may be as short as several hundred bp. Therefore, for example, when the clonal nucleic acid is genomic DNA, it is necessary to fragment it into a length that can be analyzed by a sequencer. There is no limitation on the method of fragmentation, and fragmentation may be performed by a known method. Although the present invention is not limited in any way, for example, DNA Shearing System M220 / ME220 (manufactured by Covaris), which physically cleaves using ultrasound, QIAseq FX Library Kit (Qiagen), which cleaves using an enzyme. Commercially available instruments and reagents such as those manufactured by Preferably, physical cutting using ultrasound is used.
次に各断片化クローナル核酸に「分子バーコード」と呼ばれる核酸を付加する(S2)。これにより得られたテンプレート核酸はのちにコピーされるが、分子バーコードを付加することにより、コピーがどのクローナル核酸由来かを識別することが可能となる。通常、クローナル核酸を構成する各核酸分子のコピーを作製し、当該コピーのリード配列を得る場合、リード配列の塩基情報のみから本来の核酸分子単位の由来を特定することは不可能である。これに対処するため、該分子バーコードを用いる。本開示において、クローナル核酸に付与される分子バーコードは、一つ以上でユニークであれば良い。また、その配列長や配列パターンは、特に限定されないが、1000種類のクローナル核酸を識別するには最低5塩基長(45=1024通り)が必要であり、通常6~10塩基長の配列で使用される。なお分子バーコードは様々な呼称があり、“unique molecular identifier(UMI)”や“unique molecular tag(UMT)”、または単に“分子タグ(molecular tag)”と呼ばれることもある。クローナル核酸への分子バーコードの付加方法に特に限定は無いが、リガーゼ等の酵素により結合させることで調製することができる。 Next, a nucleic acid called "molecular barcode" is added to each fragmented clonal nucleic acid (S2). The template nucleic acid thus obtained is copied later, but the molecular barcode can be added to identify which clonal nucleic acid the copy is derived from. In general, when a copy of each nucleic acid molecule constituting a clonal nucleic acid is produced and a lead sequence of the copy is obtained, it is impossible to specify the origin of the original nucleic acid molecule unit only from the base information of the lead sequence. To address this, the molecular barcode is used. In the present disclosure, the molecular barcode given to the clonal nucleic acid may be one or more unique. Also, the sequence length and sequence pattern are not particularly limited, but at least 5 bases long (4 5 = 1024) are required to distinguish 1000 types of clonal nucleic acids, and the sequence is usually 6 to 10 bases long. used. The molecular barcode has various names, and may be called "unique molecular identifier (UMI)", "unique molecular tag (UMT)", or simply "molecular tag". There is no particular limitation on the method of adding the molecular barcode to the clonal nucleic acid, but it can be prepared by binding with an enzyme such as ligase.
以下、分子バーコードを付加したクローナル核酸を、テンプレート核酸または解析対象核酸と称することがある。 Hereinafter, a clonal nucleic acid to which a molecular barcode is added may be referred to as a template nucleic acid or a nucleic acid to be analyzed.
テンプレート核酸に分子バーコードの他、PCR(polymerase chain reaction)増幅用プライマーがアリーニング可能な配列を加えておくと後述の核酸増幅操作が行いやすくなる。このようなテンプレート核酸の調製には、何ら限定するものではないが、例えば、市販製品であるThruPLEX(登録商標)Tag-Seq Kit(タカラバイオ社製)を利用することができる。また解析したい核酸領域があらかじめ決まっている場合、当該領域を増幅可能なPCRプライマーと分子バーコードを連結させた核酸を用いることにより、クローナル核酸への分子バーコードの付加と増幅を同時に行うことができる。あるいは、いわゆるプローブキャプチャー法などにより、目的領域を包含するテンプレート核酸を濃縮しても良い。該濃縮操作はコピーを得た後に行っても良い。 It is easy to perform the below-mentioned nucleic acid amplification procedure by adding a sequence that can be aligned with a primer for PCR (polymerase chain reaction) amplification in addition to the molecular barcode to the template nucleic acid. For preparation of such a template nucleic acid, for example, a commercially available product, ThruPLEX (registered trademark) Tag-Seq Kit (manufactured by Takara Bio Inc.) can be used without limitation. In addition, when the nucleic acid region to be analyzed is determined in advance, addition of the molecular barcode to the clonal nucleic acid and amplification can be performed simultaneously by using a nucleic acid in which the PCR primer capable of amplifying the region and the molecular barcode are linked. it can. Alternatively, template nucleic acid including the target region may be concentrated by the so-called probe capture method or the like. The concentration operation may be performed after obtaining a copy.
該分子バーコートはさらに、「ステム(stem)」などと呼ばれる配列を含んでいてもよい。当該ステム配列は、分子バーコードの開始点の目印となる配列である。その配列長や配列パターンは特に限定されないが、例えば、8~10塩基長の配列で使用される。当該ステム配列は、クローナル核酸と分子バーコードに挟まれる位置に配置される。 The molecular bar coat may further contain a sequence called "stem" or the like. The stem sequence is a sequence that marks the start of the molecular barcode. Although the sequence length and sequence pattern are not particularly limited, for example, a sequence of 8 to 10 bases in length is used. The stem sequence is placed at a position between the clonal nucleic acid and the molecular barcode.
クローナル核酸への分子バーコードの付加後、得られたテンプレート核酸を増幅してライブラリーを作製する(S3)。本実施の形態では、シーケンサー50が次世代シーケンサーの場合、配列解析を行うためにはある程度の核酸量を必要とするため、低頻度な変異を含む核酸分子も核酸量を増やす、すなわちコピーを得る必要がある。
After addition of the molecular barcode to the clonal nucleic acid, the resulting template nucleic acid is amplified to create a library (S3). In the present embodiment, when the
テンプレート核酸のコピーの製造法は特に限定しないが、核酸増幅法が挙げられ、操作の簡便さなどからPCRに基づく方法によって行うことが好ましい。このようにして得られたシーケンサーによる解読対象となるサンプルはライブラリーと呼ばれる。 Although the method for producing the copy of the template nucleic acid is not particularly limited, a nucleic acid amplification method may be mentioned, and it is preferable to carry out by a method based on PCR from the viewpoint of ease of operation and the like. The sample to be decoded by the sequencer obtained in this manner is called a library.
このライブラリーのシーケンシングにより得られた塩基配列データを用いてDNA配列の変異が解析される。本実施の形態では、テンプレート核酸の増幅は、PCRに基づく方法によって行うが、これに限定されない。ライブラリーの作製方法は特に限定されない。ライブラリー作製過程において、PCR法やプローブキャプチャー法により解析対象とする領域を濃縮する場合もあるが、実施において濃縮手法に関する制限はない。 Mutations in the DNA sequence are analyzed using the nucleotide sequence data obtained by sequencing this library. In the present embodiment, amplification of the template nucleic acid is performed by a PCR-based method, but is not limited thereto. The method for producing the library is not particularly limited. In the library preparation process, the region to be analyzed may be concentrated by PCR method or probe capture method, but there is no limitation on the concentration method in practice.
ライブラリーをシーケンサー50へ入力し、塩基配列を解析する(S4)。すなわち、リード配列は、公知のシーケンシング法による塩基配列解読法により取得することができる。シーケンサーのプラットホームは特に限定されないが、解析対象がゲノムDNAのように情報量が膨大になる場合には、次世代シーケンサーが望ましい。次世代シーケンサーとして、例えば、HiSeq(登録商標)及びMiSeq(登録商標)(illumina社製)、Ion ProtonTMおよびIon PGMTM(Thermo Fisher Scientific社製)、PacBio(登録商標) RS IIおよびPacBio(登録商標) SequelTM(Pacific Biosciences社製)、MinION、GridION X5、PromethION、およびSmidgION(Oxford Nanopore Technologies社製)などが挙げられ、数千万から数億のDNA断片を同時並行的に処理して塩基配列を決定する装置が好ましい。
The library is input to the
シーケンサー50での解析が終了すると、シーケンサー50からの解析結果を用いてDNA配列における変異が検出される(S5)。この工程(S5)は変異検出装置10により実施される。
When the analysis by the
シーケンサー50から得られる塩基配列データ(リード配列)のファイル形式はFASTQ形式であるが、他の形式でもよい。FASTQ形式は、当該技術において公知の形式であり、図3(A)に示すような形式となる。データの各列の意味は図3(B)に示すとおりである。FASTQ形式のシーケンスデータでは、各塩基のベースコール(塩基の指定)に対する正確さをあらわす指標として、Phredクオリティスコア(Phred quality score)が出力される。本実施の形態では、一塩基あたりのシーケンスによるエラー率をエラー頻度(Perror)で示す。Phredクオリティスコア(a)とエラー頻度(Perror)の関係は次式で表される。
Perror = 10-a/10 (/塩基)
a = ASCII-33
ASCIIコードの値が、FASTQ形式のデータにおいて出力される。
The file format of the base sequence data (read sequence) obtained from the
P error = 10 -a / 10 (/ base)
a = ASCII-33
ASCII code values are output in FASTQ format data.
Phredクオリティスコアは、次世代シーケンサーにより自動的に算出される。エラー頻度への変換式はそれぞれのプラットホームによって異なるが、本実施の形態においては限定される要因はない。 Phred quality score is calculated automatically by the next generation sequencer. Although the conversion formula to error frequency differs depending on each platform, there is no limited factor in this embodiment.
各核酸分子由来のリード配列を取り込む工程において、低品質なリード配列、すなわち、ベースコールの精度(Phredクオリティスコア)が低い塩基を含む配列データ自体を取り除く、いわゆるリードフィルタリング(リードクリーニングともいう)処理を行っても良い。この処理には公知の技術を利用することができ、例えば公知のソフトウエアであるCutadapt(ドルトムント工科大学)やTrimmomatic(アーヘン工科大学)などが挙げられる。 In the step of incorporating a lead sequence derived from each nucleic acid molecule, so-called lead filtering (also referred to as lead cleaning) processing is performed to remove low quality lead sequences, that is, sequence data itself containing bases with low basecall accuracy (Phred quality score). You may A known technique can be used for this process, and examples thereof include known software such as Cutadapt (Dortmund University of Technology) and Trimmomatic (Aachen University of Technology).
また、リード配列の一部分の塩基が低品質の場合、該当する一部分の塩基配列のみを取り除く、いわゆるトリミング(trimming)処理を行ってもよい。トリミング処理は、マスキング(masking)処理と呼ばれることもある。該処理には公知の技術を利用することができ、例えば公知のソフトウエアであるTrimmomaticやsickle(https://github.com/najoshi/sickle)などが挙げられる。 In addition, when the base of a part of the lead sequence is of low quality, so-called trimming processing may be performed to remove only the base sequence of the corresponding part. The trimming process is sometimes called a masking process. A known technique can be used for the treatment, and examples thereof include known software such as Trimmomatic and sickle (https://github.com/najoshi/sickle).
図4(a)(b)は、図2の変異解析処理(S5)のより具体的な手順を示した図である。後で詳しく説明するように、これらのうち図4(a)は、PCRエラーやシーケンスのエラーを考慮したデータ解析による手順を示したものであり、図4(b)は、リードのアライメントを考慮したデータ解析による手順を示したものである。両方の図から明らかなように、図4(b)は、「ローカルアセンブリ」(S17)の工程を含むことのみ、図4(a)と異なる。以下、図4(a)と図4(b)の両方を参照しつつ、変異解析処理(S5)の各工程について具体的に説明する。 FIGS. 4 (a) and 4 (b) are diagrams showing a more specific procedure of the mutation analysis process (S5) of FIG. Among these, FIG. 4 (a) shows the procedure by data analysis in consideration of PCR errors and sequence errors as described in detail later, and FIG. 4 (b) considers lead alignment. It shows the procedure by the data analysis. As is clear from both figures, FIG. 4 (b) differs from FIG. 4 (a) only in that it includes the steps of "local assembly" (S17). Hereinafter, each step of the mutation analysis process (S5) will be specifically described with reference to both FIG. 4 (a) and FIG. 4 (b).
1)リード配列の取得(S11)
まず、シーケンサー50から、リード配列データを取得する(S11)。
1) Acquisition of read sequence (S11)
First, read sequence data is acquired from the sequencer 50 (S11).
2)参照配列へのマッピング(S12)
リード配列はクローナル核酸を構成する各核酸分子由来の様々な断片由来のコピー分子の配列情報の集合体であるため、先ず、参照配列との比較によりリード配列を分類するマッピング処理を行う。ここで、参照配列とは、基準とする配列であり、任意の配列を用いることができる。本発明を何ら限定するものではないが、例えば、NCBI GenBank、DDBJ、UCSC Genome Browserなどに登録されている配列を用いることができる。例えば、解析対象がヒトゲノムDNAの場合、Genome Reference Consortium human build 38(GRCh38)やUCSC human genome 19(hg19)を参照配列とすることができる(これらのバージョンは随時更新されており、必要に応じて適切なバージョンを選択すればよい)。また、参照配列とする領域は、目的に応じて適宜選択することができる。レコードに収録されている配列全長でもよいし、目的とする任意の領域のみであってもよい。該参照配列と後述のコンセンサスリードを比較し、参照配列とは異なる塩基がコンセンサスリードにあった場合、当該塩基は変異の候補となる。
2) Mapping to reference sequence (S12)
Since the lead sequence is a collection of sequence information of copy molecules derived from various fragments derived from each nucleic acid molecule constituting the clonal nucleic acid, first, mapping processing is performed to classify the lead sequence by comparison with a reference sequence. Here, the reference sequence is a reference sequence, and any sequence can be used. Although the present invention is not limited in any way, for example, sequences registered in NCBI GenBank, DDBJ, UCSC Genome Browser, etc. can be used. For example, when the analysis target is human genomic DNA, Genome Reference Consortium human build 38 (GRCh38) or UCSC human genome 19 (hg19) can be used as reference sequences (these versions are updated as needed, and as necessary) Choose the appropriate version). Further, the region to be the reference sequence can be appropriately selected according to the purpose. It may be the entire length of the sequence included in the record, or only the desired region. When the reference sequence is compared with a consensus read described later, and a base different from the reference sequence is in the consensus read, the base is a candidate for mutation.
マッピング技術において、ゲノム配列がすでに解読されているヒトを含む生物種については、そのゲノム配列を参照配列として一般にFASTA形式にて取得可能である。本開示では、テンプレート核酸から解析したリード配列を参照配列にマッピングした後に、比較することで、変異の有無を判定することが可能であり、次世代シーケンスによる解析では、多量のリードシーケンスから変異の有無を検出できる。 In the mapping technology, for biological species including human whose genome sequence has already been decoded, the genome sequence can be generally obtained in FASTA format as a reference sequence. In the present disclosure, it is possible to determine the presence or absence of a mutation by mapping a lead sequence analyzed from a template nucleic acid to a reference sequence, and then comparing it with a next generation sequence. The presence or absence can be detected.
本実施の形態において、上記で取得した参照配列に対してリード配列をマッピングすることになる。特に限定するものではないが、たとえば、BWA(Wellcome Trust Sanger Institute)、Bowtie(ジョンズ・ホプキンズ大学)、DNASTAR(登録商標) SeqMan NGen(DNASTAR社製)、Hisat2(ジョンズ・ホプキンズ大学)、Novoalign(NOVACRAFT社製)等のソフトがマッピング作業に使用できる。 In this embodiment, the read sequence is mapped to the reference sequence acquired above. For example, but not limited to, BWA (Wellcome Trust Sanger Institute), Bowtie (Johns Hopkins University), DNASTAR (registered trademark) SeqMan NGen (made by DNASTAR), Hisat 2 (Johns Hopkins University), Novoalign (NOVACRAFT) Company software etc. can be used for mapping work.
本実施の形態において、マッピングの結果は、図5に示すようなSAM/BAM形式ファイルと呼ばれる公知のファイル形式で出力される。BAM形式ファイルはSAM形式ファイルのバイナリーファイルとなる。マッピングの結果を出力するファイル形式はBAM形式に限定されるものではない。図5(A)は、参照配列へのマッピング(S12)の結果が出力されたSAM/BAM形式ファイルであり、図5(B)は、分子バーコードのクラスタリング(S13)の結果が追加されたSAM/BAM形式ファイルである。SAM/BAM形式データの各列の意味は図5(C)に示すとおりである。 In the present embodiment, the mapping result is output in a known file format called a SAM / BAM format file as shown in FIG. BAM format files are binary files of SAM format files. The file format for outputting the mapping result is not limited to the BAM format. FIG. 5A is a SAM / BAM format file to which the result of mapping to the reference sequence (S12) is output, and FIG. 5B is added with the result of molecular bar code clustering (S13) It is a SAM / BAM format file. The meaning of each column of the SAM / BAM format data is as shown in FIG.
このようにマッピングされたリード配列に対し、次に個々のテンプレート核酸に由来するリード配列を分類するクラスタリング処理を行う。テンプレート核酸に前述の分子バーコードを付加した場合は、リード配列中に存在する分子バーコード領域の配列情報を基に分類することが可能である。なお、以下では、分子バーコードを”UMI”と称することがある。 Next, a clustering process is performed on the thus mapped lead sequences to classify the lead sequences derived from the individual template nucleic acids. When the above-mentioned molecular barcode is added to the template nucleic acid, it is possible to classify based on the sequence information of the molecular barcode region present in the lead sequence. Hereinafter, the molecular barcode may be referred to as "UMI".
ベイズ推定は解析に使用するリード配列の数が多いほど解析精度が高くなるため、得られたリード配列をいかに解析に使用するか、は重要である。
既存の解析手法において、Phredクオリティスコア(Phred quality score)が低いリード配列は解析不良の一因となるため、以降の解析に使用されていない。そのため、大量のシーケンスデータを得ながら、解析に使用できるリード配列が少なくなってしまうという問題があった。
It is important how to use the obtained lead sequence for analysis, as the Bayesian estimation becomes higher in analysis accuracy as the number of lead sequences used for analysis increases.
In the existing analysis methods, a read sequence having a low Phred quality score is not used for the subsequent analysis because it contributes to poor analysis. Therefore, there has been a problem that while obtaining a large amount of sequence data, the number of read sequences available for analysis decreases.
一方、本開示の方法では、Phredクオリティスコア(Phred quality score)が低い箇所の塩基を「N」(塩基種の判別不可な塩基)とし、参照配列と比較する際には「完全一致した塩基」とみなすことによって以降の解析に使用可能とした。 On the other hand, in the method of the present disclosure, “N” (a base where the base type can not be determined) is regarded as “N” (a base having a low Phred quality score) and “completely matched base” when compared with the reference sequence. It could be used for the subsequent analysis by assuming it as
マッピングによって1つのグループとなったとしても、異なる分子バーコードのリード配列が混在していることもある。その場合、分子バーコード配列に基づいたグループ分けを行うことにより、正しいグループへ再編することができる。 Even if the mapping results in one group, different molecular barcode read sequences may be mixed. In that case, by grouping based on molecular barcode sequences, it is possible to reorganize into correct groups.
既存の解析手法において、マッピングされなかったリード配列は解析不良の一因となるため、以降の解析に使用されていない。そのため、大量のシーケンスデータを得ながら、解析に使用できるリード配列が少なくなってしまうという問題があった。 In the existing analysis method, the unaligned read sequence contributes to analysis failure and is not used for the subsequent analysis. Therefore, there has been a problem that while obtaining a large amount of sequence data, the number of read sequences available for analysis decreases.
しかしながら、本発明において、上記マッピングから外れた(マッピングされなかった、ummapped)リード配列は判断保留とし、分子バーコード配列に基づいたグループ分け工程において、作成されたファミリーと配列比較して解析に使用できるか判断する。このようにして、解析に使用可能となるリード配列を増やしている(解析不可とするリード配列を減らしている)。 However, in the present invention, the unmapped (unmapped) lead sequences are left for judgment and used for analysis in comparison with the generated family in the grouping step based on molecular barcode sequences. Determine if you can. In this way, the number of lead sequences that can be used for analysis is increased (the number of non-analyzable lead sequences is reduced).
3)分子バーコード(UMI)のクラスタリング(S13)
本実施の形態では、SAM/BAM形式ファイルを介して、リード配列のマッピング情報(e.g.リード配列、UMI配列、参照配列に対するマッピング位置)を含む一連の情報を取得した後に、各リード配列に付与されたUMI配列を決定して、同じクローナル核酸に由来するリード配列群を一つに取りまとめる、UMIの分類工程(分子バーコードのクラスタリング)を行う。図6(A)は、リード配列を、参照配列上でのマッピング開始点と終了点によりグループ分けした後の状態の一例を示している。図6(A)は、参照配列上でのマッピング位置に基づき分類された3つのグループ(Position family)を例示している。図6(B)は、その後さらに、UMI(分子バーコード)の配列によりグループ分けした状態の一例を示している。図6(B)は、UMI(分子バーコード)の配列による分類された2つのグループ(UMI family)を例示している。
3) Molecular barcode (UMI) clustering (S13)
In the present embodiment, after obtaining a series of information including mapping information (eg, lead sequence, UMI sequence, mapping position relative to reference sequence) of lead sequences via a SAM / BAM format file, each lead sequence is obtained. UMI sequencing step (clustering of molecular barcodes) is performed, in which the UMI sequences assigned to are determined and the lead sequence groups derived from the same clonal nucleic acid are put together. FIG. 6A shows an example of a state after the lead arrangement is grouped by the mapping start point and end point on the reference arrangement. FIG. 6A exemplifies three groups (Position family) classified based on the mapping position on the reference sequence. FIG. 6 (B) shows an example of the state where it is further divided by the sequence of UMI (molecular barcode). FIG. 6 (B) illustrates two groups (UMI family) classified according to the sequence of UMI (molecular barcode).
UMIの分類の工程(分子バーコードの配列によりグループ分け)では、以下の工程a~dを行う。 In the step of classification of UMI (grouping by arrangement of molecular barcodes), the following steps ad are performed.
工程a:参照配列に対するマッピング位置からなる各リード配列の位置に基づき、UMI配列に非依存的にリード配列群を分類する。分類されたグループはPosition familyとして管理する。すなわち、SAM/BAM形式ファイルに記載されている参照配列上におけるマッピング開始点と終了点が同一のリード配列群をPosition familyとして管理する。
工程b:各Position family内のリード配列数を集計する。集計値が閾値より少ないPosition familyにはフラグを立てる。
工程c:各Position family内でUMI配列の類似度に基づきリード配列を分類する。UMI配列の類似度に基づき分類したグループをUMI familyと称する。また、二種類以上のUMIが同一ライブラリー上に存在する場合には、全てのUMI配列を結合させる。具体的には、図6の例では、L-UMIとR-UMIを結合させる。UMI配列の類似度に基づきPosition family内のリード配列群をさらに分類する。類似度の閾値を設定して、ある程度のミスマッチを許容することで、閾値内のミスマッチのUMI同士は同じ配列とする。Position family内のUMI配列の出現数によって、全リード配列数に対して出現数の少ないUMI配列を持つリード配列にはSAM/BAMファイル上でフラグを立てる。
工程d:Position family内のリード配列数の多寡に基づき、近傍に位置するPosition familyに限定して、最終UMI familyの再分類を実施する。すなわち、フラグを立てた、リード数が少ないPosition familyについて、近傍のPosition familyに統合できるかをUMI配列の類似度に基づき再判定する。統合できると判定した場合、フラグを立てた、リード数が少ないPosition familyを近傍のPosition familyに統合する。
Step a: Classify the lead sequence group independently of the UMI sequence based on the position of each lead sequence consisting of the mapping position to the reference sequence. The classified groups are managed as Position family. That is, a read sequence group having the same mapping start point and end point on the reference sequence described in the SAM / BAM format file is managed as a Position family.
Step b: Count the number of lead sequences in each Position family. A flag is set for Position family whose aggregate value is less than the threshold.
Step c: Classify the lead sequences based on the similarity of UMI sequences within each Position family. A group classified based on the similarity of UMI sequences is called UMI family. In addition, when two or more types of UMI exist on the same library, all UMI sequences are combined. Specifically, in the example of FIG. 6, L-UMI and R-UMI are combined. The read sequences in the Position family are further classified based on the similarity of UMI sequences. By setting a similarity threshold and allowing a certain degree of mismatch, mismatched UMIs within the threshold have the same sequence. Depending on the number of occurrences of UMI sequences in the Position family, a read sequence having a UMI sequence having a smaller number of occurrences than the total number of read sequences is flagged on the SAM / BAM file.
Step d: Based on the number of lead sequences in the Position family, reclassification of the final UMI family is performed with limitation to the Position family located in the vicinity. That is, it is re-judged based on the degree of similarity of the UMI arrangement whether the Position family which has set the flag and which has a small number of reads can be integrated into the nearby Position family. If it is determined that integration is possible, the flagged position family with a small number of leads is integrated into the neighboring position family.
本実施形態において、UMI配列が同一かどうかの配列類似度の判定はハミング距離などの編集距離を使用する。しかし、配列類似度の判定はこれに限定するものではなく、例えば、モチーフ出現頻度に基づいた(類似度)ネットワーク解析を想定したZscoreベースのベクトルの類似度や分子系統解析で用いる系統・進化距離といった指標も選択可能である。さらに類似度の判定基準は、UMI配列の長さやテンプレート核酸のコピー数によって変更してもよい。 In this embodiment, determination of sequence similarity as to whether UMI sequences are identical uses editing distance such as Hamming distance. However, determination of sequence similarity is not limited to this, and, for example, similarity of Zscore-based vectors assuming (similarity) network analysis based on motif appearance frequency, phylogenetic distance and evolution distance used in molecular phylogenetic analysis An indicator such as is also selectable. Furthermore, the criteria for determining the degree of similarity may be changed depending on the length of the UMI sequence and the copy number of the template nucleic acid.
次世代シーケンスの作業では、UMI配列自体にもPCR増幅やシーケンスによるエラーが生じる可能性がある。そのため本実施形態では、このエラー発生の可能性に対して、下記の工程が実施可能である。
工程a:シーケンスデータのPhredクオリティスコアが低い塩基については、N(任意の塩基を示す記号)でマスクして、配列類似度の比較対象から除外する。
工程b:UMI配列間のミスマッチを許容するように編集距離の閾値を調整する。すなわち、UMI配列の類似度(ハミング距離等)に基づき、Position family内のリード配列群をさらに分類する。類似度の閾値を設定して、ある程度のミスマッチを許容することで、閾値の条件を満たすミスマッチのUMI同士は同じ配列とする。
In the work of next-generation sequencing, UMI sequences themselves may have errors due to PCR amplification and sequencing. Therefore, in the present embodiment, the following steps can be performed with respect to the possibility of the occurrence of the error.
Step a: Bases with low Phred quality scores in the sequence data are masked with N (a symbol indicating an arbitrary base) and excluded from comparison targets of sequence similarity.
Step b: Adjust the edit distance threshold to allow for a mismatch between UMI sequences. That is, the read sequence group in the Position family is further classified based on the degree of similarity (such as Hamming distance) of the UMI sequence. By setting a similarity threshold and allowing a certain degree of mismatch, mismatched UMIs satisfying the threshold condition have the same sequence.
さらに国際公開第2016/176091号では、PCR増幅及びシーケンス過程において、最初に付与されたUMI配列とは異なる配列に置き換わるUMI置換現象が示唆されている(e.g. UMI Jumping、PCR Jumping [UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy, Tom Smith et al. http://dx.doi.org/ 10.1101/gr.209601.116.]、index swapping [Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms, Maura Costello et al.(http://dx.doi.org/10.1101/200790)]、Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms, Maura Costello et al.(http://dx.doi.org/10.1101/200790)にて報告/示唆済み)。この可能性に対処するため、UMI配列が複数あるようなライブラリー構成の場合には、Position family内の範囲において、少なくとも一つのUMIが完全一致すれば同一のUMIを持つリード配列として分類する工程を含めている。すなわち、複数のUMIが存在するライブラリー構成の場合には、Position family内において、少なくとも一つのUMIが完全に一致すれば同一のUMIを持つリード配列としてまとめる。 Furthermore, WO 2016/176091 suggests a UMI substitution phenomenon that replaces a sequence different from the UMI sequence initially given in the PCR amplification and sequencing process (eg UMI Jumping, PCR Jumping [UMI -Tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy, Tom Smith et al. Http://dx.doi.org/ 10.1101 / gr.209601.116.], Index swapping [Characterization and remediation of ample index swaps by non-redundant dual indexing on massively parallel sequencing platforms, Maura Costello et al. (http://dx.doi.org/10.1101/200790)], Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms, Maura Costello et al. (http://dx.doi.org/10.1101/2007 Report / suggestion already at 0)). In order to cope with this possibility, in the case of a library configuration in which there are a plurality of UMI sequences, if at least one UMI is completely matched within the range within the Position family, the step is classified as a lead sequence having the same UMI. Is included. That is, in the case of a library configuration in which a plurality of UMIs exist, if at least one UMI completely matches in the Position family, they are put together as a read sequence having the same UMI.
本開示を何ら限定するものではないが、例えば、分子バーコードが6塩基からなる場合、一方または両方の分子バーコードにおいて2塩基以下のミスマッチ、または両末端いずれかの分子バーコード領域のリード配列6塩基が一致すればクラスタリングの対象となるリード配列とみなす。これにより、UMI配列自体に存在するエラーの可能性を考慮したクラスタリングが可能となる。 Although the present disclosure is not limited in any way, for example, when the molecular barcode consists of 6 bases, a mismatch of 2 bases or less in one or both molecular barcodes, or a lead sequence of the molecular barcode region at either end If the six bases match, it is regarded as a lead sequence to be subjected to clustering. This enables clustering that takes into account the possibility of errors present in the UMI sequences themselves.
ここで、該UMI配列自体に存在するエラーとは、ミスマッチ塩基、ベースコール不良塩基、マッピングエラーなどを指す。既存の解析手法ではミスマッチ塩基のみ考慮されていた。一方、本発明ではベースコール不良塩基やマッピングエラーなども考慮することにより、解析精度を向上することができた。 Here, the error present in the UMI sequence itself refers to a mismatched base, a base call bad base, a mapping error and the like. In the existing analysis methods, only mismatched bases were considered. On the other hand, in the present invention, the analysis accuracy can be improved by considering the base call defect base, the mapping error and the like.
本発明を何ら限定するものではないが、分子バーコード配列自体に存在するエラーの存在率は、分子バーコード塩基長の50%以下、好ましくは40%以下、さらに好ましくは35%以下、特に好ましくは30%以下であれば、該エラーの可能性を考慮したクラスタリングが可能である。 Although the present invention is not limited at all, the presence rate of errors present in the molecular barcode sequence itself is 50% or less, preferably 40% or less, more preferably 35% or less, particularly preferably the molecular barcode base length. Is less than or equal to 30%, clustering is possible in consideration of the possibility of the error.
分子バーコードがクローナル核酸の両末端に付加される場合、該UMI配列自体に存在するエラーが一方の分子バーコードのみに存在しても、両方の分子バーコードに存在しても、本発明の方法では、当該エラーの可能性を考慮したクラスタリングが可能である。 When molecular barcodes are added to both ends of a clonal nucleic acid, the error present in the UMI sequence itself may be present in only one molecular barcode or in both molecular barcodes according to the invention. In the method, clustering is possible in consideration of the possibility of the error.
本実施の形態において、UMIの分類結果は前述のようにSAM/BAM形式ファイルのデータに追記する。すなわち、図5(B)のSAM/BAM形式(出力形式)に示すように、一列目のリード配列の名称に[染色体番号]-[マッピング開始点]-[マッピング終了点]-[アライメントパターン]-[分類後のUMI配列]-[分類前のUMI配列]という形式としている。さらにBC:Z:sequence:sequence,XD:i:count,XP:i:statusに必要な情報を追加したものをSAM/BAM形式ファイルのオプション情報として出力する(以下の表1参照)。追記の形式はこの形式に限定されるものではない。
本実施の形態において、UMI分類結果を追加したSAM/BAM形式ファイルに基づき変異検出を実施する。 In this embodiment, mutation detection is performed based on a SAM / BAM format file to which UMI classification results are added.
4)ローカルアセンブリ(S17)
このローカルアセンブリの工程は、図4(b)に示す変異解析処理に含まれる工程である。ローカルアセンブリの工程では、参照配列にはない検体特有のリード配列長以上の配列領域を決定する処理が行われる。この配列と参照配列を比較することで、リード配列長以上の配列サイズを有する塩基置換や挿入や欠損が確認できる。なお本実施の形態のローカルアセンブリは、Rimmer et al. (Nature Genetics, 2014) に掲載されたPlatypusのアルゴリズムを参考にコーディングされており、参照配列とリード配列よりtwo-colored de Bruijn Graphを構築してアセンブル配列を構築する。なおローカルアセンブリにて使用可能なオプションおよびその例を、以下の表2に示す。
4) Local assembly (S17)
The step of this local assembly is a step included in the mutation analysis process shown in FIG. 4 (b). In the step of local assembly, processing is performed to determine a sequence region not smaller than the reference sequence but longer than the sample-specific read sequence length. By comparing this sequence with the reference sequence, it is possible to confirm base substitution, insertion or deletion having a sequence size equal to or greater than the length of the lead sequence. The local assembly of this embodiment is described in Rimmer et al. It is coded with reference to Platypus's algorithm published in (Nature Genetics, 2014), and a two-colored de Bruijn Graph is constructed from a reference sequence and a lead sequence to construct an assembly sequence. The options available for local assembly and their examples are shown in Table 2 below.
de Bruijn graphでは、まず塩基配列を長さkの塩基配列(k-mer)に切断して、Node(頂点)とする。さらに各Node(頂点)をEdge(辺)で結ぶ。複数のリード配列もしくは参照配列にてk-merのNode-Edgeを発生させて、同じk-merを有するNodeや同じNodeを合体させたグラフがde Bruijn graphとなる(図7参照)。本実施の形態では、de Bruijn graphのNodeとEdgeに対して、塩基配列の由来(参照配列もしくはリード配列から由来、分子バーコードの有無およびその配列情報)や信頼性スコア(シーケンスのPhredクオリティスコア)から算出される重みを付けることができる。 In de Bruijn graph, the base sequence is first cut into base sequences of length k (k-mer) to make it a Node (apex). Furthermore, each Node (vertex) is connected by Edge (edge). A k-mer Node-Edge is generated with a plurality of read sequences or reference sequences, and a graph in which nodes having the same k-mer and the same Node are combined becomes a de Bruijn graph (see FIG. 7). In this embodiment, the origin of the base sequence (derived from the reference sequence or the lead sequence, presence or absence of the molecular barcode and its sequence information) and the reliability score (the Phred quality score of the sequence) for Node and Edge of de Bruijn graph Can be given a weight calculated from
(4-1)de Bruijn graph(デブルーイングラフ)の作成
本実施の形態において、de Bruijn graphは解析遺伝子の一部領域を対象に、配列長(表2のwindowサイズオプションにより変更可能)を指定した対象領域に相当する参照配列と同領域にマッピングされたリード配列を使用して作成する。さらに使用するリード配列には、同領域にマッピングされたリード配列自体とペアリード(つまり、参照配列にマッピングされなかったunmappedリードまたは対象領域外にマップされたリード)も含まれる。また、本実施の形態では、グラフのEdge(辺)のcolorを、参照配列側を0、リード配列側を1として、両方が存在する部分は0.5と設定してEdge(辺)の由来を区別することが可能である。
(4-1) Creation of de Bruijn graph In this embodiment, de Bruijn graph targets the partial length of the analyzed gene and can change the sequence length (which can be changed by the window size option in Table 2). Create using the read array mapped to the same region as the reference sequence corresponding to the specified target region. Further, the read sequences to be used include the read sequences themselves mapped to the same region and paired reads (that is, unmapped reads not mapped to the reference sequence or reads mapped to the outside of the target region). Also, in this embodiment, the color of the Edge of the graph is set to 0 for the reference array side, 1 for the lead array side, and 0.5 for the portion where both exist is derived from the Edge (edge). It is possible to distinguish.
(4-2)グラフの再構築
本実施の形態において、(例えば、図7の場合、)構築したグラフの5’末端から探索して、環状化していた場合には、k-merの長さを伸長して、再度グラフを作成することができる。
(4-2) Restructuring of Graph In this embodiment (in the case of FIG. 7, for example), the length of the k-mer is searched in the 5 'end of the constructed graph and circularized. Can be stretched to create a graph again.
(4-3)最短経路の検出
本実施の形態において、グラフ上の経路探索に、辺の重みが非負数の場合の単一始点最短経路問題を解くための最良優先探索によるアルゴリズムであるダイクストラ法(優先度付きキュー)を採用する (https://ja.wikipedia.org/wiki/ダイクストラ法 または https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm)。つまり、個々の探索経路がアセンブル配列となり、さらに経路上のEdge(辺)のcolorの違いにより、参照配列特有もしくはリード配列特有のEdgeが判別可能であるため、変異候補を同定することができる。しかしながら、経路探索の方法やアルゴリズムは、上記に限定されるものではない。
(4-3) Detection of Shortest Path In the present embodiment, Dijkstra's algorithm, which is an algorithm based on the best priority search for solving the single-source shortest path problem when the weight of an edge is a non-negative number, in path search on a graph. Adopt (priority queue) (https://en.wikipedia.org/wiki/Dijkstra method or https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm). That is, individual search paths become assemble sequences, and further, Edges specific to the reference sequence or edge specific to the read sequence can be discriminated by the difference in color of Edges on the paths, so that mutation candidates can be identified. However, the route search method and algorithm are not limited to the above.
5)コンセンサスリード(S14)
上述の、特に「3)分子バーコード(UMI)のクラスタリング(S13)」までで述べたように、UMI配列に基づき、PCRエラーやシーケンスのエラーを考慮して、同じUMI配列(同じUMI family)をもつリード配列から共通のリード配列すなわちコンセンサスリードを出力する。当該コンセンサスリードは、単にコンセンサス、コンセンサス配列、またはコンセンサスリード配列と呼ぶこともある。図8(A)は、コンセンサスリードを求めるコンセンサスリード処理の処理工程を説明した図である。(以下、「(a)PCRエラーやシーケンスエラーを考慮したデータ解析」参照)。
一方、ローカルアセンブリを用いた解析では、シーケンスのエラーとリードのマッピング距離を考慮して、上述の、特に「4)ローカルアセンブリ(S17)」にて述べたように、同じUMI配列(同じUMI family)をもつリード配列からアセンブル配列を構築し、上述の「(4-3)最短経路の検出」に基づいて、その中から信頼性が高いアセンブル配列をコンセンサスリードとする。図9(A) は、コンセンサスリードを求める処理工程を説明した図である。(後の「(b)リードのアライメントを考慮したデータ解析(Large Indelの検出)」参照)。
更に、「(a)PCRエラーやシーケンスエラーを考慮したデータ解析」には、別手法もある(後の「(a’)PCRエラーやシーケンスエラーを考慮したデータ解析の別手法」参照)。
5) Consensus lead (S14)
As described above, especially in "3) Molecular Barcode (UMI) clustering (S13)", based on UMI sequences, taking into account PCR errors and sequence errors, the same UMI sequences (same UMI family) A common read sequence, ie, a consensus read, is output from the read sequence having. The consensus lead may be simply referred to as a consensus, a consensus sequence, or a consensus lead sequence. FIG. 8A is a view for explaining the processing steps of the consensus read process for obtaining a consensus lead. (See “(a) Data Analysis Taking Account of PCR Error and Sequence Error” below).
On the other hand, in the analysis using the local assembly, the same UMI sequence (the same UMI family as described in the above-mentioned "4) local assembly (S17)" is taken into consideration, taking into account the sequence error and the mapping distance of the lead. An assemble sequence is constructed from the read sequence having the above-mentioned), and based on the above-mentioned "(4-3) detection of the shortest path", among these, the highly reliable assemble sequence is regarded as a consensus read. FIG. 9A is a view for explaining the processing steps for obtaining a consensus lead. (See “(b) Data analysis considering lead alignment (Large Indel detection)” below).
Furthermore, there is another method in “(a) Data analysis in consideration of PCR error and sequence error” (see “(a ′) Another method of data analysis in consideration of PCR error and sequence error”).
(a)PCRエラーやシーケンスエラーを考慮したデータ解析
本実施の形態では、まず、UMI配列に基づき、PCRエラーやシーケンスエラーを考慮して、同じUMI配列を持つ(すなわち、同じUMI family内の)リード配列から共通のリード配列(コンセンサスリード)を求める。
(A) Data Analysis Considering PCR Error and Sequence Error In the present embodiment, first, based on the UMI sequence, having the same UMI sequence in consideration of PCR error and sequence error (that is, within the same UMI family) The common lead sequence (consensus lead) is determined from the lead sequence.
なお、特定の塩基種が配列解析時のエラーによるものかクローナル核酸中に存在する塩基に由来するものか判断するために、コンセンサスリード及び当該配列中の各塩基に対する信頼性スコア(UQ)を使用する。当該信頼性スコア(UQ)は各UMI family内で各塩基に対して算出され、信頼性スコア(UQ)は次式で表される。
信頼性スコア(UQ)を算出する際、単純な多数決ではなく、統計処理を行う。コンセンサスリードは、市販製品、例えばConnor(https://github.com/umich-brcf-bioinf/Connor)を利用して得ることができる。 When calculating the reliability score (UQ), statistical processing is performed instead of simple majority voting. Consensus leads can be obtained using commercial products such as Connor (https://github.com/umich-brcf-bioinf/Connor).
コンセンサスリード処理の開始前に、以下の工程a、bを行う。
工程a)各position(No = 1,2,3…k)にマッピングされたUMI family(No = 1,2,3…j)を抽出、
工程b)各UMI family内の全リード配列に対して、番号(No = 1,2,3…i)を割り振り、該当position上の塩基種類、各塩基のリード配列数、及びシーケンスのエラー頻度(Perror)のデータを取得する。
Before starting the consensus lead process, the following steps a and b are performed.
Step a) Extract UMI family (No = 1, 2, 3 ... j) mapped to each position (No = 1, 2, 3 ... k),
Step b) Assign numbers (No = 1, 2, 3 ... i) to all the read sequences in each UMI family, select the type of base on the corresponding position, the number of read sequences of each base, and the error frequency of the sequence ( Get the data of P error ).
これらの処理は、UMI分類後のSAM/BAMファイルのデータに対して実施する。図8(B)に示す例の場合、第3染色体の38,930番目の塩基対(bp)にマッピングされた特定の、すなわちj番目のUMI family (No=j)では、リード配列が合計7本(リード配列数No=1…7)存在する。第3染色体の38,930番目の位置の塩基がAとなるリード配列は4本であり、Cとなるリード配列は3本である。それぞれのエラー頻度(Perror)は、SAM/BAM形式ファイルの11列目のASCII情報から入手できる(図5(B),(C)参照)。
These processes are performed on data of SAM / BAM file after UMI classification. In the case of the example shown in FIG. 8 (B), in the specific, ie, the j-th UMI family (No = j) mapped to the 38,930th base pair (bp) of
各UMI family内で各塩基に対する信頼性スコア(UQ)は、特に限定されるものではないが、例えば、ベイズアプローチでは事後分布として推定することができる。具体的には各UMI familyにおいて、該当する場所での対象塩基(each base)毎の確度を推定する。対象塩基は全体集合θの部分集合であり、全4種類の塩基対(A,T,G,C)と、同参照配列上にマッピングされる全UMI familyで検出されるすべての挿入または欠損塩基(other obs. alleles)とから構成される。図8(B)の例では、UMI family(No=j)の該当箇所(chr3:38,930bp)において、A, Cの二種類の塩基が検出されているが、集合θの構成はA,T,G,Cの4種類となる(挿入と欠損塩基は検出されないため含まれない)。 Although the reliability score (UQ) for each base in each UMI family is not particularly limited, for example, in the Bayesian approach, it can be estimated as a posterior distribution. Specifically, in each UMI family, the likelihood for each target base (each base) at the corresponding place is estimated. The target bases are a subset of the whole set θ, and all insertions or deletions detected in all four types of base pairs (A, T, G, C) and all UMI families mapped onto the same reference sequence It is composed of (other obs. Alleles). In the example of FIG. 8B, two kinds of bases A and C are detected at the corresponding place (chr3: 38, 930 bp) of UMI family (No = j), but the configuration of the set θ is A, There are 4 types of T, G, C (insertion and defective bases are not included because they are not detected).
公知のソフトウエアSmCounterの統計モデルは、挿入または欠損塩基に対して検出感度が低い仕様となっている。一方、本発明の統計モデルでは、θにおいて挿入または欠損塩基も考慮することにより、塩基置換に加え、塩基挿入および塩基欠損についても検出感度が高くなった。 The statistical model of the well-known software SmCounter is a specification with low detection sensitivity to insertion or deletion bases. On the other hand, in the statistical model of the present invention, by taking into consideration the insertion or deletion base at θ, the detection sensitivity also increases for base insertion and base deletion in addition to base substitution.
信頼性スコア(UQ)の式において、P(UMIj|each base)は尤度関数であり、P(each base)は対象塩基の出現率を示す事前分布である。事前分布は1/θのような一様分布を採用することができる。しかし、事前データ等からθもしくはP(each base)に関する何らかの情報があれば変更することが可能である。 In the reliability score (UQ) formula, P (UMI j | each base) is a likelihood function, and P (each base) is a prior distribution showing the appearance rate of the target base. The prior distribution can adopt a uniform distribution such as 1 / θ. However, if there is any information about θ or P (each base) from prior data etc., it is possible to change.
尤度関数P(UMIj|each base)は次式で表される。
尤度関数P(UMIj|each base)は、シーケンスエラーとPCR増幅エラーを考慮した上で、対象塩基の事前分布から各UMI familyの構成するリード配列データが生じる確率を表現する。例えば、図8(B)に示す例で検討すると、対象塩基がAの場合(each base = A)の尤度関数は、UMI familyでは本来Aが正しいとした場合に、対象塩基がAとなるリード配列が4本、Cとなるリード配列が3本となる7本のリード配列が生じる尤度P(UMIj|A)を計算する。 The likelihood function P (UMI j | each base) expresses the probability of the generation of the read sequence data of each UMI family from the prior distribution of the target bases, in consideration of the sequence error and the PCR amplification error. For example, considering the example shown in FIG. 8B, when the target base is A (each base = A), the target base is A when the original A is correct in the UMI family. A likelihood P (UMI j | A) is calculated, in which seven lead sequences having four lead sequences and three C lead sequences are generated.
尤度関数では、シーケンスエラー(Serror)とPCR増幅エラーを表現している。尤度関数の式において、第一項がシーケンスエラー(Serror)の頻度を示し、第二項がPCR増幅エラーの発生頻度を示す。図8(B)の例において、対象塩基がAの場合(each base = A)、対象塩基がCとなる3本のリード配列では、シーケンスエラー(A→Cへのエラー)がSerrorの頻度で生じる。一方、対象塩基がAとなる4本のリード配列では(1-Serror)の頻度で正しくシーケンスされたことになり、この条件での尤度を求めることになる。 The likelihood function expresses a sequence error (S error ) and a PCR amplification error. In the likelihood function equation, the first term indicates the frequency of sequence errors (S error ) and the second term indicates the frequency of occurrence of PCR amplification errors. In the example of FIG. 8B, when the target base is A (each base = A), the sequence error (error from A to C) is S error frequency in the case of three read sequences where the target base is C. It occurs in On the other hand, in the case of four read sequences in which the target base is A, the sequence is correctly performed with the frequency of (1-S error ), and the likelihood under this condition is determined.
なお、第一項では、PCR増幅エラーが生じない場合を想定しているため、corCp(PCR増幅エラーが無しとなる確率)を乗算している。さらに、nyは集合θの各構成塩基(Y)のリード配列の数(冗長度)を示している。図8(B)の例では、Y=Aの場合にはnyは4本となる。一方、集合θの内で実際に検出できなかったT及びGについては、nyは0本となり、この第一項の計算は実施されないこととなる。すなわち、PCR増幅エラーによる尤度(第二項)のみが計算される。 In the first paragraph, it is assumed that no PCR amplification error occurs, so corC p (probability of no PCR amplification error) is multiplied. Furthermore, n y indicates the number (redundancy) of the read sequences of each constituent base (Y) of the set θ. In the example of FIG. 8B, in the case of Y = A, n y is four. On the other hand, for T and G which could not actually be detected in the set θ, n y will be zero, and the calculation of the first term will not be performed. That is, only the likelihood (second term) due to PCR amplification error is calculated.
本開示の方法において、シーケンスエラー(Serror)は上述のPhredクオリティスコアから算出されるPerrorに基づき算出している。しかし、挿入・欠損塩基がシーケンスエラーとして生じる可能性は一塩基置換(single nucleotide variant、SNV)より低くなるという報告もある(Marinier E等、BMC Bioinformatics. 2015 Jan 16;16:10. doi: 10.1186/s12859-014-0435-6; Schirmer M等、Nucleic Acids Res. 2015 Mar 31;43(6):e37. doi: 10.1093/nar/gku1341参照)。そのため本実施の形態では、対象塩基(each base)が挿入・欠損塩基の場合には、Perrorに補正値m(1/100)を乗算してエラー率を低く抑えるよう処理を加えて、検出性能を向上させている。
In the method of the present disclosure, the sequence error (S error ) is calculated based on P error calculated from the above-mentioned Phred quality score. However, there is also a report that the possibility that insertion and deletion bases will occur as a sequencing error is lower than single nucleotide substitution (SNV) (Marinier E et al., BMC Bioinformatics. 2015
尤度関数の第二項は、PCR増幅エラーの頻度(misCp)に基づいたコードとなる。misCpは、PCRerror(x)から算出される。
PCRerror(x)は、xサイクルのPCRサイクルで一つの変異が生じる確率と、xサイクルのPCR産物が最終サイクル(n)後のPCR産物に含まれる確率との積により求められる。
PCRerror(x)の算出において、変異が出現するPCRサイクル数xを推定することが重要であるが、本実施の形態では、この推定をUMI family内の各塩基のリード配列数に基づき実施する。図8(B)の例では、Aが対象塩基の場合ではCがPCR増幅エラーによって出現したケースに該当することから、その割合は3/7(43.8%)となる。このリード配列数の割合は、全PCR産物中に含まれるPCR増幅エラーに由来するPCR産物の割合の下記数式で近似的に示される。
Sm(1+λm)n-x/So(1+λw)n
ここで、Smは、PCR増幅エラーの初発核酸コピー数であり、1である。Soは、最初の初発核酸コピー数であり、1である。λmはPCRエラーの核酸産物に対するPCR増幅効率である。λwはPCRエラー無しの核酸産物に対するPCR増幅効率である。
In calculating PCR error (x), it is important to estimate the number of PCR cycles in which a mutation appears, but in this embodiment, this estimation is performed based on the number of lead sequences of each base in the UMI family. . In the example of FIG. 8 (B), since the case where A is a target base corresponds to the case where C appears due to a PCR amplification error, the ratio is 3/7 (43.8%). The proportion of the number of lead sequences is approximately shown by the following formula of the proportion of PCR products derived from PCR amplification errors contained in all PCR products.
Sm (1 + λ m ) n x / So (1 + λ w ) n
Here, Sm is an initial nucleic acid copy number of PCR amplification error and is 1. So is the first initial nucleic acid copy number, which is 1. λ m is the PCR amplification efficiency for the nucleic acid product of PCR error. λ w is the PCR amplification efficiency for nucleic acid products without PCR errors.
PCRエラー産物と正常産物のPCR増幅効率が同じ(λm=λw)であると仮定した場合、PCRサイクル数nは既知であることから、エラーが発生したPCRサイクル数xを推定可能である。
尤度関数P(UMIj|each base)の第二項では、PCR酵素のエラー出現率(μ)とPCR酵素の増幅効率(λ)が重要な設定パラメーターとなる。 In the second term of the likelihood function P (UMI j | each base), the error appearance rate (μ) of the PCR enzyme and the amplification efficiency (λ) of the PCR enzyme become important setting parameters.
本実施の形態では、尤度関数では、PCR増幅エラーとシーケンスエラーという二種類のみを考慮し、互いに排反なイベントとして扱っているが、相互作用の効果(PCR増幅エラーとシーケンスエラーの両方が生じるケースが該当)やその他の要因も式に追加することができる。その他の要因には、マッピングエラー(異なる箇所と違う場所にリード配列がマッピングされるケース)や酸化反応による塩基置換などが含まれる。 In this embodiment, in the likelihood function, only two types of PCR amplification error and sequence error are considered and treated as mutually exclusive events, but the effect of interaction (both PCR amplification error and sequence error Cases that arise may apply) and other factors may also be added to the equation. Other factors include mapping errors (in the case where the read sequence is mapped to different places and different places) and base substitution by oxidation reaction.
上記の工程を経て、各UMI family内で、各位置の塩基に対して信頼性スコア(UQ)が算出される。図8(B)のNo.jのUMI familyを例にすると、38,930番目の塩基に対して、A,T,G,Cのそれぞれに対する信頼性スコアが算出されることとなり、その中で最も信頼性スコアが高い塩基をコンセンサスリードの配列とする。図8(B)の例では、Aに対する信頼性スコアが最も高くなるため、Aをコンセンサスリードの配列とする。本手法はコンセンサスリードに対して信頼性スコアを算出する点が従来法と異なる。 Through the above steps, the reliability score (UQ) is calculated for each base in each UMI family. The numbers in FIG. Taking the UMI family of j as an example, the reliability score for each of A, T, G and C is calculated for the 38,930th base, and the base with the highest reliability score among them is calculated. Sequence of consensus reads. In the example of FIG. 8 (B), since the reliability score for A is the highest, A is taken as the consensus lead sequence. This method is different from the conventional method in that the reliability score is calculated for the consensus lead.
(b)リードのアライメントを考慮したデータ解析(Large Indelの検出)
本実施の形態では、ローカルアセンブリを用いる解析においては、シーケンスのエラーとリードのマッピング距離を考慮して、同じUMI配列(同じUMI family)を持つリード配列から共通のリード配列、即ちコンセンサスリードを求める。図9(A)は、コンセンサスリードを求める処理工程を示した図である。
(B) Data analysis considering lead alignment (Large Indel detection)
In this embodiment, in analysis using a local assembly, a common read sequence, that is, a consensus read is determined from a read sequence having the same UMI sequence (the same UMI family) in consideration of the mapping errors of the sequence and the reads. . FIG. 9A is a diagram showing processing steps for obtaining a consensus lead.
つまり、ローカルアセンブリの最短経路(アセンブル配列、ハプロタイプ候補, H)とそれぞれに由来する変異リストから、10base以上のInsertおよび/またはDeletionを持つ経路を対象に下記の手順に準じて、最大事後確率P{each haplotype|UMIj}を決定し、(後で説明する)変異検出に繋げる。本実施の形態における最大事後確率Pは、前述(a)の信頼性スコア(UQ)に相当する。
上述のそれぞれの変数の説明を、下記に示す。図9(B)を参照されたい。
[数7]の(2)において、前半部分はリードのマッピング(アライメント)距離の考慮部分を、後半部分はシーケンスのエラーの考慮部分を示す。ここで、前半部分の「P{distk|μ,σ}:Insertサイズ(正規分布μ,σ)に対するリード間の距離の確率」は、同じUMI配列(同じUMI Family、UMIj)を有する全リード配列(k∈ReadPair)をハプロタイプの配列に再アラインメントして、リード間の距離(distk)に対する確率を表すものである。なお、本実施の形態における母集団は、全リードの参照配列上のリード間の距離の平均値(μ)と分散(σ2)によって定義される正規分布を、Insertサイズの分布(正規分布)とする。
A description of each of the above mentioned variables is given below. Please refer to FIG. 9 (B).
In (2) of [Equation 7], the first half shows the consideration portion of the mapping (alignment) distance of the lead, and the second half shows the consideration portion of the error of the sequence. Here, “P {dist k | μ, σ}: Probability of the distance between the leads to the insert size (normal distribution μ, σ)” in the first half is all with the same UMI sequence (same UMI Family, UMI j ) The read sequence (k∈ReadPair) is realigned to the sequence of the haplotype to represent the probability for the distance between the reads (dist k ). The population in this embodiment is a normal distribution defined by the mean value (μ) and the variance (σ 2 ) of the distances between the leads on the reference sequence of all the leads, and the distribution of Insert size (normal distribution) I assume.
(b-1)リード配列の抽出
本実施の形態に使用されるリード配列は、参照配列に対するマッピング情報によって選抜することが可能である。例えば、本実施の形態では、SAM/BAM形式のマッピング結果から取得できるリード配列のCigar(図5(C)参照)にてI/D/S/Hを持つリード(参照配列にマッピングされないunmappedリードも含む)を抽出する。さらに同じUMIを持つリードも抽出する(上記条件に合致しない場合も抽出対象となる)。
(B-1) Extraction of Lead Sequence The lead sequence used in the present embodiment can be selected by the mapping information to the reference sequence. For example, in this embodiment, a read having I / D / S / H (unmapped read not mapped to the reference sequence) in Cigar (see FIG. 5C) of the read sequence which can be obtained from the mapping result of the SAM / BAM format Extract). Furthermore, the leads having the same UMI are also extracted (even when the above conditions are not met).
(b-2)再アラインメント
特定UMI(UMIj)の各リードをハプロタイプ(hi)に再アラインメントして、リード間の距離とマッチ/ミスマッチ塩基の情報を入手する。本実施の形態の再アラインメントでは、Smith-Waterman法を採用している。なお、再アラインメントのアルゴリズムは、Smith-Waterman法とは異なるものを採用してもよい。
(B-2) and re-align the leads of the realignment particular UMI (UMI j) haplotype (h i), to obtain the information of the distance matched / mismatched bases between the leads. In the realignment in this embodiment, the Smith-Waterman method is employed. The realignment algorithm may be different from the Smith-Waterman method.
(b-3)最大事後確率P{each haplotype|UMIj}の決定 (MAP推定によるハプロタイプ推定)
特定UMI(UMIj)に対して各ハプロタイプの事後確率を求めて、(全ハプロタイプの中で)最大事後確率を示すハプロタイプ(hi)から特定UMI(UMIj)が生み出されたと仮定する。つまり、P{each haplotype|UMIj}は特定UMI(UMIj)が特定ハプロタイプ(hi)に由来する確率を示す。後続の処理では、最大事後確率に基づき、ポアソン分布によって変異検出を実施する。
(B-3) Determination of maximum a posteriori probability P {each haplotype | UMI j } (haplotype estimation by MAP estimation)
Seeking posterior probability of each haplotype relative to the particular UMI (UMI j), it is assumed that the specific UMI (UMI j) is generated from (among all haplotypes) haplotypes showing a maximum a posteriori (h i). That is, P {each haplotype | UMI j } indicates the probability that a particular UMI (UMI j ) is derived from a particular haplotype (hi). In the subsequent processing, mutation detection is performed by Poisson distribution based on the maximum posterior probability.
(a’)PCRエラーやシーケンスエラーを考慮したデータ解析の別手法
更に、上述の「(a)PCRエラーやシーケンスエラーを考慮したデータ解析」の別手法について説明する。
(A ′) Another Method of Data Analysis Taking Account of PCR Error and Sequence Error Further, another method of the above-mentioned “(a) Data analysis taking into account PCR error and sequence error” will be described.
(a’-1)SNV,MNP,Small Indelsの検出
上述のLarge Indel(上記「(b)リードのアライメントを考慮したデータ解析」参照)と同じく、ベイズ推定によって実施する。P{Ai|UMIk}は、特定UMIがAiというアリルに由来する事後確率を求めることになる(数9)。その際の尤度は、1)PCRエラーが無くベースコールエラーのみ生じる場合、2)PCRエラーが生じてベールコールエラーが無い場合、の二種類の条件から構成されている(数10)。
上述のそれぞれの変数の説明は、以下のようになる。
(a’-2)リードの抽出
該当Positionにマッピングされるリード配列において、該当箇所に変異をもつリードをCigarとNMタグに基づいて選抜する。更に同じUMIのリードも変異の有無にかかわらず全て抽出する。該当箇所の全変異は集合θで表され、A、T、G、Cと該当箇所の全変異(obs Allele)から構成される要素Aiが属する。
(A'-2) Extraction of Leads In the lead sequences mapped to the corresponding Positions, leads having mutations at the corresponding positions are selected based on Cigar and NM tags. Furthermore, the same UMI reads are also extracted regardless of mutations. All mutations in the corresponding part are represented by a set θ, and an element Ai composed of A, T, G, C and all mutations (obs Allele) in the corresponding part belongs.
(a’-3)PCRエラー(Ce)の推定
ある特定のUMIをもつ全10リードについて、7本が“A”で3本が“T”となり、“T”がPCRエラーで生じたと仮定した場合、「Ce」は下記のように算出する。
また、それぞれの値の定義と算出方法は以下のものとした。
PCRエラーを持たないPCR産物の割合(noFq):
n:PCRサイクル数
λ:PCR増幅効率∈[0,1]
μ:PCRエラー率(PCRの1サイクルで1塩基に対して生じるエラー頻度)
G: テンプレートの長さ(本実施の形態では1bpである)
Moreover, the definition and calculation method of each value were as follows.
Percentage of PCR products without PCR errors (noFq):
n: PCR cycle number λ: PCR amplification efficiency ∈ [0, 1]
μ: PCR error rate (error frequency occurring for one base in one cycle of PCR)
G: Template length (1 bp in this embodiment)
PCRエラーを持たないPCR産物数(noPCR):
PCRエラーを持つPCR産物数(ePCR):
PCRエラーを持つPCR産物の割合(eFq):
(a’-4)最大事後確率P{each base|UMIj}の決定
特定UMI(UMIj)に対して各アリルの事後確率を求めて、(全アリルの中で)最大事後確率を示すアリル(each base)から特定UMI(UMIj)が生み出されたと仮定する。つまり、P{each base|UMIj}は特定UMI(UMIj)が特定アリル(each base)に由来する確率を示す。後続の処理では、最大事後確率に基づき、ポアソン分布によって変異検出を実施する。
(A'-4) Determination of maximum a posteriori probability P {each base | UMI j } An allyl showing the largest posterior probability (among all alleles) by calculating the posterior probability of each allyl for a specified UMI (UMI j ) Suppose that a specific UMI (UMI j ) is generated from (each base). That is, P {each base | UMI j } indicates the probability that a particular UMI (UMI j ) is derived from a particular allyl (each base). In the subsequent processing, mutation detection is performed by Poisson distribution based on the maximum posterior probability.
6)エラーモデル構築(S15)
続いて本実施の形態では、全UMI familyから得られたコンセンサスリード及び各塩基に対する信頼性スコア(UQ)を使用して特定位置の塩基種を推定する(図8(A)、図9(A)参照)。
6) Error model construction (S15)
Subsequently, in the present embodiment, a base type at a specific position is estimated using consensus reads obtained from all UMI families and a reliability score (UQ) for each base (FIG. 8 (A), FIG. 9 (A )reference).
該解析を行うために、まずコンセンサスリードの集合より、特定の位置に存在する特定の塩基種を任意に選抜する。ここで「任意」とは「あらゆる場合」、「すべての場合」という意味である。「特定の位置」とは、参照配列とコンセンサスリードの塩基種が異なる位置という意味である。「特定の塩基種」とは、参照配列の塩基と異なるコンセンサスリードの塩基という意味である。 In order to conduct the analysis, first, a specific base type present at a specific position is arbitrarily selected from a set of consensus leads. Here, "arbitrary" means "all cases" and "all cases". The "specific position" means that the base sequence of the reference sequence and the consensus lead is different. The “specific base type” means a base of a consensus lead different from the base of the reference sequence.
なお、クローナル核酸において確認したい位置と塩基種については、任意に設定(選択)すればよい。公知の変異塩基の存在を確認したい場合は、その変異に関する部位および塩基種をあらかじめ選択すればよい。また、未知の変異部位を確認したい場合は、参照配列とコンセンサスリードとの比較で参照配列とは異なる塩基種が存在する位置を抽出しても良い。また、塩基種も塩基置換タイプ(変異としてはSNP)の他、挿入部位に相当する塩基種や欠失相当の塩基種(この場合は該等塩基種無との処理)であっても良い。また、参照配列と異なるコンセンサスリードが存在しない位置については、以降の推定工程を行わなくともよい。 The position and base type to be confirmed in the clonal nucleic acid may be set (selected) arbitrarily. When it is desired to confirm the presence of a known mutant base, the site and base type for the mutation may be selected in advance. In addition, when it is desired to confirm an unknown mutation site, a position where a base type different from the reference sequence is present may be extracted by comparing the reference sequence with the consensus lead. In addition to the base substitution type (SNP as mutation), the base type may also be a base type corresponding to the insertion site or a base type equivalent to deletion (in this case, treatment with no such equivalent base type). In addition, for the position where there is no consensus lead different from the reference sequence, the subsequent estimation step may not be performed.
次いで、塩基種の選抜に用いたコンセンサスリードの総数に対する、特定の位置に特定の各塩基種を含有するコンセンサスリード数の比率を取得する。 Next, the ratio of the number of consensus reads containing each specific base species at a specific position to the total number of consensus reads used for selection of base types is obtained.
一方、信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値(または有意水準)で特定位置において解析エラーによって特定の塩基が出現する出現率を推定する。 On the other hand, a parametric error model using a reliability score estimates the occurrence rate of appearance of a specific base by analysis error at a specific position at a set threshold (or significance level).
次いで、上記のコンセンサスリード数を基にした比率が解析エラーによる出現率より有意に高い場合に、上記閾値(または有意水準)で特定塩基位置における特定塩基種がクローナル核酸混合物中に存在すると推定する。 Then, when the ratio based on the above consensus read number is significantly higher than the appearance rate due to analysis error, it is estimated that the specific base type at the specific base position is present in the clonal nucleic acid mixture at the above threshold (or significant level). .
塩基種推定の基本モデルはポアソン分布によるパラメトリックなエラーモデルを構築して、任意の箇所で塩基種推定を実施する。実施形態によっては、エラーモデルを変更することが可能である。ポアソン分布の他、例えば、多値変数(事前分布はディリクレ分布)や二値変数でのベータ二項分布での検定、教師付きの学習モデルなども、本発明のパラメトリックなエラーモデルとして使用できる。 The basic model of base type estimation builds a parametric error model by Poisson distribution, and performs base type estimation at arbitrary places. Depending on the embodiment, it is possible to change the error model. Besides Poisson distribution, for example, tests with multi-valued variables (prior distribution is Dirichlet distribution), tests with beta binomial distribution with binary variables, supervised learning models, etc. can be used as parametric error models of the present invention.
塩基種推定は、以下の工程a~dを含む。
工程a)対象となる箇所を選択する(図8(B)ではchr3:38,930)。
工程b)該当する箇所にマッピングされた全UMI familyを抽出する(図8(B)では、j個のUMI familyを抽出)。
工程c)各UMI familyの該当箇所におけるコンセンサスとUQのスコアを取得する。
工程d)確認したい塩基種を有する参照配列(A)とは異なるコンセンサスリードとUQのスコアのみを選択する。
工程e)工程a~dを経て得られた情報(選択した塩基種を有するコンセンサスリードのパターンとUQスコア)からエラーモデル(ポアソン分布)を構築する。
Base type estimation includes the following steps a to d.
Step a) Select a target location (chr3: 38, 930 in FIG. 8 (B)).
Step b) Extract all UMI families mapped to the corresponding part (in FIG. 8 (B), extract j UMI families).
Process c) Acquire the score of the consensus and UQ in the corresponding part of each UMI family.
Step d) Select only the consensus lead and UQ score different from the reference sequence (A) having the base type to be confirmed.
Step e) An error model (Poisson distribution) is constructed from the information obtained through steps a to d (pattern of consensus lead having selected base species and UQ score).
ポアソン分布のパラメーターλについては、好ましくは以下のように表現できる。 The parameter λ of the Poisson distribution can preferably be expressed as follows.
ある位置において、m個のUMI familyが得られ、そのうちn個のUMI familyで塩基X(=A,T,G or C)のコンセンサスリードが得られた場合、下記式によりポアソン分布のパラメーターλの期待値E[λ]が算出できる。
例えば、chr3:38,930の箇所で1000個のUMI familyが得られ、そのうち5個のUMI familyでCのコンセンサスリードが得られた場合、下記式によりポアソン分布のパラメーターλが算出できる。
λ=1000*(1-argminP{C|UMI})
ここで、argminP{C|UMI}は、塩基がCとなった5個のコンセンサスリードに対する5個のUQスコアUQ(c)(P{C|UMI})の中の最小値を示す。
For example, when 1000 UMI families are obtained at chr3: 38, 930, and a consensus read of C is obtained for 5 UMI families among them, the parameter λ of Poisson distribution can be calculated according to the following equation.
λ = 1000 * (1−argminP {C | UMI})
Here, argminP {C | UMI} indicates the minimum value among five UQ scores UQ (c) (P {C | UMI}) for five consensus reads where the base is C.
上記のようにしてλを求めることで、ポアソン分布を用いたエラーモデルを構築できる。 By obtaining λ as described above, an error model using a Poisson distribution can be constructed.
7)変異検出(S16)
上記のようにして構築したエラーモデル(ポアソン分布)を利用して変異を検出する。すなわち、エラーモデルから、m個のUMIを取得した場合に、エラーによってn個のコンセンサスリードが得られる確率が推定できる。
7) Mutation detection (S16)
Mutations are detected using the error model (Poisson distribution) constructed as described above. That is, when m UMIs are obtained from the error model, it is possible to estimate the probability that n consensus leads can be obtained by the errors.
例えば、1,000個のUMIを取得した場合に、エラーによって5個のコンセンサスリードが得られる確率が推定できる。ここで、前述の閾値は有意水準とも呼ばれ、一般的には0.05(5%)に設定されるが、必要に応じて0.01(1%)や0.001(0.1%)に設定される。そして、エラーモデルから推定した確率を所定の閾値(1%水準又は5%水準等)と比較する。推定した確率が所定の閾値より低い場合には、エラーによる仮説を棄却することになるため、選択した塩基種であると決定する。これにより、1000分子の中の1分子のみ存在する塩基種の存在の推定が可能となる。したがって、本開示は低頻度な変異の検出に好適に使用できる。 For example, when 1,000 UMIs are obtained, it is possible to estimate the probability that an error will yield 5 consensus leads. Here, the above-mentioned threshold is also called a significance level, and is generally set to 0.05 (5%), but if necessary, 0.01 (1%) or 0.001 (0.1%) Set to). Then, the probability estimated from the error model is compared with a predetermined threshold (such as 1% level or 5% level). If the estimated probability is lower than a predetermined threshold value, the hypothesis due to an error will be rejected, so it is determined to be the selected base type. This makes it possible to estimate the presence of a base type in which only one of 1000 molecules is present. Thus, the present disclosure can be suitably used for detection of low frequency mutations.
実施形態によっては、エラーモデルは適宜変更することが可能である。 Depending on the embodiment, the error model can be changed as appropriate.
前述のとおり推定された特定塩基位置における特定塩基種がクローナル核酸混合物中に存在する比率は、下記計算式によって算出することができる。また、該比率は、変異頻度とも呼ばれる。
比率(変異頻度)=(変異箇所にマッピングされた変異を有するコンセンサスリードのUMI数)/(変異箇所にマッピングされたコンセンサスリードの全UMI数)
The ratio that the specific base species at the specific base position estimated as described above is present in the clonal nucleic acid mixture can be calculated by the following formula. The ratio is also called mutation frequency.
Ratio (mutation frequency) = (number of UMIs of consensus reads with mutations mapped to mutation sites) / (total UMI number of consensus reads mapped to mutations sites)
算出された該比率は、クローナル核酸混合物中に存在する低頻度な変異の割合(存在率)と見なすことができる。 The calculated ratio can be regarded as the percentage (low abundance) of low frequency mutations present in the clonal nucleic acid mixture.
各UMI family内において、解析対象とする特定領域の全塩基に対して、特定塩基位置における特定塩基種がクローナル核酸混合物中に存在すると推定できるか、あるいは特定塩基位置における塩基種が不明であると推定できるか、検証し、当該特定塩基種が存在する情報を集約することにより、当該集約された配列情報はクローナル核酸混合物中に存在する低頻度な変異体の配列とすることができる。即ち、特定塩基位置における特定塩基種がクローナル核酸混合物中に存在すると推定できるか、あるいは特定塩基位置における塩基種が不明であると推定できるか、検証することにより得られる結果を、当該塩基のコンセンサスリードに適用して二次配列情報を作成し、更に塩基位置を変更しつつ上述の処理を繰り返すことにより得られる特定領域の全二次配列情報を統合して、クローナル核酸混合物を構成する各拡散分子の特定領域の塩基配列として採用することができる。なお、参照配列と異なるコンセンサスリードが存在しない(参照配列とすべてのコンセンサスリードが同じ)位置については、特定塩基種は存在しないと判断し、クローナル核酸の当該位置の塩基は参照配列と同一であるとする。 Within each UMI family, it can be presumed that the specific base type at the specific base position can be present in the clonal nucleic acid mixture for all bases in the specific region to be analyzed, or the base type at the specific base position is unknown By verifying whether or not it can be estimated and aggregating information in which the specific base type is present, the aggregated sequence information can be the sequence of low frequency variants present in the clonal nucleic acid mixture. That is, it can be estimated that the specific base type at the specific base position can be estimated to be present in the clonal nucleic acid mixture, or the base type at the specific base position can be estimated unknown, the result obtained by verifying It applies to the read to create secondary sequence information, and further integrates all secondary sequence information of a specific region obtained by repeating the above-mentioned process while changing the base position, and each diffusion constituting the clonal nucleic acid mixture It can be employed as the base sequence of a specific region of the molecule. In the position where there is no consensus read different from the reference sequence (the reference sequence and all the consensus reads are the same), it is judged that the specific base type does not exist, and the base at that position of the clonal nucleic acid is identical to the reference sequence. I assume.
必要に応じて、さらに変異フィルタリングを行うことができる。例えば、P. Flaherty等、Nucleic Acids Res.、第40巻、第1号、e2ページ、2012年(DOI: 10.1093/nar/gkr861)に記載の方法を利用し、正常検体で検出された変異を腫瘍検体で検出された変異から除外してもよい。これにより、腫瘍検体に特異的な変異を抽出(フィルタリング)することができる。 If necessary, mutation filtering can be further performed. For example, P.I. Flaherty et al., Nucleic Acids Res. Vol. 40, No. 1, page e2, 2012 (DOI: 10.1093 / nar / gkr861), using the method described above, the mutations detected in normal samples are excluded from the mutations detected in tumor samples You may Thereby, mutations specific to the tumor sample can be extracted (filtered).
3.実施例
以下、本実施の形態の方法を適用した実施例を説明する。本開示の適用は以下の実施例に限定されるものではない。
3. EXAMPLE Hereinafter, an example to which the method of the present embodiment is applied will be described. The application of the present disclosure is not limited to the following examples.
(1)実施例1:断片化された核酸混合物中の低頻度な変異の検出
核酸混合物中に存在する稀な変異の検出感度を調査するために、EGFR、KRAS、NRAS、PIK3CAの4種のヒト遺伝子群において、野生型遺伝子のコピー、さらに同じ遺伝子に稀な変異が生じている8種類の変異型遺伝子のコピーを異なる混合比で有するヒト由来のクローナル核酸として、MultiplexI cfDNA Reference Standard Set(horizon社製)を用いた。本実施例では混合比として、変異型遺伝子のコピー数が1%(Part No.: HD778)、0.1%(Part No.: HD779)の二種類のクローナル核酸を使用した。
(1) Example 1: Detection of Infrequent Mutations in Fragmented Nucleic Acid Mixture In order to investigate the detection sensitivity of rare mutations present in the nucleic acid mixture, 4 types of EGFR, KRAS, NRAS and PIK3CA are used. Multiplex I cfDNA Reference Standard Set (horizon) as a human-derived clonal nucleic acid having, at different mixing ratios, a copy of a wild-type gene and a copy of eight mutant genes in which rare mutations occur in the same gene in a human gene group. Company company) was used. In this example, two types of clonal nucleic acids, in which the copy number of the mutant gene was 1% (Part No .: HD778) and 0.1% (Part No .: HD779), were used as the mixing ratio.
MultiplexI cfDNA Reference Standard Setは、血漿中の無細胞核酸(cell-free DNA)を模して、人工的に160bp程度に断片化したクローナル核酸の標準品である。 The Multiplex I cfDNA Reference Standard Set is a standard product of clonal nucleic acid artificially fragmented into about 160 bp by resembling cell-free nucleic acid (cell-free DNA) in plasma.
それぞれのクローナル核酸から50ngを用いて、後述の操作を実施した。なお、ヒトゲノムDNAは、約30億塩基対から構成され、1塩基の平均分子量は330となるため、10ngのクローナル核酸には約3000コピーのゲノムDNAが存在する。変異型遺伝子の混合比が1%の場合には、30コピーの変異型遺伝子が存在し、混合比が0.1%の場合には3コピーの変異遺伝子が存在することになる。 The following procedure was performed using 50 ng of each clonal nucleic acid. Since human genomic DNA is composed of about 3 billion base pairs and the average molecular weight of one base is 330, about 3000 copies of genomic DNA are present in 10 ng of clonal nucleic acid. When the mixing ratio of mutant genes is 1%, 30 copies of mutant genes are present, and when the mixing ratio is 0.1%, 3 copies of mutant genes are present.
ThruPLEX Tag-seq 6S (12) Kit(タカラバイオ社製)のプロトコールを使用して、50ngのクローナル核酸の両末端にStem-Loopアダプターをライゲーション法により付加し、テンプレート核酸とした。次いで、得られたテンプレート核酸を鋳型とし、アダプター配列を有するユニバーサルプライマーによりPCR増幅を実施してイルミナ社のシーケンサー用の増幅核酸物(アダプター+テンプレート核酸)を調製した。ThruPLEX技術では、6塩基から成るランダムな配列のUMIがStem-Loopアダプター中に含まれているため、増幅前に個々のテンプレート核酸にUMIを付加することになる(合計2個のUMIが含まれる)。 Using a protocol of ThruPLEX Tag-seq 6S (12) Kit (manufactured by Takara Bio Inc.), a Stem-Loop adapter was added to both ends of 50 ng of clonal nucleic acid by a ligation method to obtain a template nucleic acid. Then, using the obtained template nucleic acid as a template, PCR amplification was performed with a universal primer having an adapter sequence to prepare an amplified nucleic acid product (adapter + template nucleic acid) for Illumina's sequencer. In the ThruPLEX technology, UMI is added to each template nucleic acid before amplification because UMI of random sequence consisting of 6 bases is contained in the Stem-Loop adapter (total 2 UMI is included) ).
対象領域となる8個の遺伝子領域に由来するテンプレート核酸のみをシーケンスするために、カスタムのSureSelectのキャプチャプローブ(アジレント社製)を用いて、増幅核酸物から8個の遺伝子領域を含む特定ターゲット領域を濃縮した。具体的な操作は、ThruPLEX Tag-seq Kitにて推奨されているマニュアルに従って行った。 In order to sequence only the template nucleic acid derived from the eight gene regions that become the target region, using a custom SureSelect capture probe (manufactured by Agilent), a specific target region including eight gene regions from the amplified nucleic acid Was concentrated. The specific operation was performed according to the manual recommended by ThruPLEX Tag-seq Kit.
イルミナ社の次世代シーケンサーHiSeqを使用する超並列シーケンシングアプローチを用いて、各検体に対する両末端100bpのシーケンスを実施して塩基配列データ(リード配列データ)を取得した。 Using a massively parallel sequencing approach using Illumina's next-generation sequencer HiSeq, sequences at both ends 100 bp for each sample were performed to obtain base sequence data (read sequence data).
取得したリード配列データに対して、Trimmomatic(version:0.32)を使用して低品質なリード配列データを除去した。 Low quality read sequence data was removed from the obtained read sequence data using a Trimmomatic (version: 0.32).
リード配列データを、BWA-MEM(version:0.7.15)を用いてヒトの参照配列(hg19)にマッピングを実施した。得られたSAM/BAM形式ファイルより、各リード配列について参照配列に対するマッピング情報とUMIの配列情報を入手した。 The read sequence data was mapped to the human reference sequence (hg19) using BWA-MEM (version: 0.7.15). From the resulting SAM / BAM format file, mapping information for the reference sequence and UMI sequence information were obtained for each read sequence.
上記の手法に従って、マッピングの位置情報とUMIの配列情報に基づいて、特定ターゲット領域に対して分子バーコードのクラスタリングを実施した。分類の条件は下記のとおりである。
条件1: ミスマッチ2塩基以内のものは同一UMIとする。
条件2: 条件1を満たさない場合でも、少なくとも一つのUMI配列が同一であれば、同じUMIとする。
Molecular barcode clustering was performed on a specific target area based on the mapping position information and UMI sequence information according to the method described above. The conditions for classification are as follows.
Condition 1: Two or less mismatched bases are identical UMI.
Condition 2: Even if the
上記の手法に従って、コンセンサスリードの作成を実施した。その後、第3染色体に座乗するPIK3CA遺伝子のエキソン10に該当する178,935,997-178,936,121bpの各塩基について、1)イルミナ社シーケンサーのリード配列に付与されるPhredクオリティスコアから算出されるエラー率Perrorと、2)UQから算出されたエラー率1-UQ(base)を算出して、二種類のエラー率を比較した。その結果を図10に示す。図10では、分子バーコードのクラスタリング後のSAM/BAMファイルを用いて、各箇所でUMI数が最も多い塩基を対象にして上記二種類のエラー率を算出し、Phredスケール(-10log10(エラー率))でプロットした。
Consensus lead generation was performed according to the method described above. Thereafter, the bases of 178, 935, 997-178, 936, 121 bp corresponding to the
図10に示すように、Phredクオリティスコアから算出されたエラー率(図10中の×)は、Phredスケールで30~40の間となっているが、UQに基づいたエラー率(図10中の◆)は50を超える値を示しており、全領域でエラー率が約1/10倍に抑えられた。 As shown in FIG. 10, the error rate (× in FIG. 10) calculated from the Phred quality score is between 30 and 40 on the Phred scale, but the error rate based on UQ (FIG. 10) ◆) indicates a value over 50, and the error rate was reduced to about 1/10 times over the entire area.
本開示の方法にて、8種類の変異型遺伝子が検出できるかどうかを検証した。さらに従来法との比較として、1) LoFreq(version: 2.1.2 http://csb5.github.io/lofreq/)と、2) SmCounter (version: https://github.com/xuchang116/smCounter 170823取得)とを用いて変異検出を実施した。その結果を図11に示す。図11において、グレーのハッチング部の変異は、検出された変異を示し、REFは参照配列のリード配列数を示し、ALTは変異配列を有するリード配列数を示し、AFはリード配列比から算出されるアリル頻度を示す。 It was verified by the method of the present disclosure whether eight mutant genes could be detected. Furthermore, as comparison with the conventional method, 1) LoFreq (version: 2.1.2 http://csb5.github.io/lofreq/) and 2) SmCounter (version: https://github.com/xuchang116/ Mutation detection was performed using smCounter 170823 acquisition). The results are shown in FIG. In FIG. 11, the gray hatched mutations indicate the detected mutations, REF indicates the number of lead sequences of the reference sequence, ALT indicates the number of read sequences having mutant sequences, and AF is calculated from the ratio of read sequences. Allele frequency is shown.
ここで、LoFreqは、論文(Wilm A等、Nucleic Acids Res. 2012 40(22): 11189-11201)で報告された変異検出ソフトであり、Phred quality scoreから算出されたシーケンスエラー率よりベルヌーイ試行を実施して変異を検出するソフトである。図11では、BWAでマッピングしたSAM/BAMファイルに対して、Connor(version: 0.5.1 https://github.com/umich-brcf-bioinf/Connor)を用いてコンセンサスリードを取得した後に変異検出を実施した。 Here, LoFreq is a mutation detection software reported in a paper (Wilm A et al., Nucleic Acids Res. 2012 40 (22): 11189-11201), and Bernoulli trial is performed based on the sequence error rate calculated from Phred quality score. It is software that is implemented to detect mutations. In FIG. 11, after acquiring consensus leads using Connor (version: 0.5.1 https://github.com/umich-brcf-bioinf/Connor) for SAM / BAM files mapped by BWA Mutation detection was performed.
ここで、SmCounterは、ベイズアプローチにより分子バーコードに基づいてエラー補正を実施して変異を検出するソフトである(非特許文献3)。図11では、分子バーコードのクラスタリング(本開示手法)を実施したSAM/BAM形式ファイルを入力ファイルとして、SmCounterを用いて変異検出を行った結果を示した。 Here, SmCounter is software that performs error correction based on molecular barcodes by the Bayesian approach to detect mutations (Non-Patent Document 3). FIG. 11 shows the result of mutation detection using the SmCounter, with the SAM / BAM format file subjected to molecular barcode clustering (the disclosed method) as an input file.
図11に示すとおり、LoFreqでは混合比1.0%の8種の変異中、ハッチングで表示した、EGFR遺伝子のΔ746-750の欠損変異(Deletion)、T790Mの一塩基置換(SNV)、NRAS遺伝子のQ61KおよびA59TのSNVの4種類が検出され、検出感度は50%(4種類/8種類)であった。一方、SmCounterでは混合比1.0%の8種の変異中、ハッチングで表示した、EGFR遺伝子のΔ746-750の欠損変異とV769-D770の挿入変異以外の6種類の変異が検出され、検出感度は75%(6種類/8種類)であった。しかしながら、混合比0.1%の変異にて調査した際には、両ソフトとも8種類のいずれの変異遺伝子も検出できなかった。 As shown in FIG. 11, among the 8 mutations with a 1.0% mixing ratio in LoFreq, the deletion mutation (Deletion) of Δ746-750 of the EGFR gene, the single nucleotide substitution (SNV) of T790M, and the NRAS gene are indicated by hatching. 4 types of SNVs of Q61K and A59T were detected, and the detection sensitivity was 50% (4 types / 8 types). On the other hand, in SmCounter, among 8 types of mutations at a mixing ratio of 1.0%, 6 types of mutations other than insertion mutations of Δ746-750 and V769-D770, which are indicated by hatching, are detected by hatching, detection sensitivity Was 75% (6 types / 8 types). However, when examined with a 0.1% mixing ratio mutation, neither of the eight types of mutant genes could be detected in either software.
図11に記載の通り、本発明の解析コード(図中:「独自パイプライン」)は、ハッチングで表示した、混合比1.0%の8種類の変異型遺伝子を全て検出し、検出感度は100%(8種類/8種類)であった。さらに混合比0.1%の変異中、ハッチングで表示した、EGFR遺伝子のL858RのSNVおよびΔ746-750のDeletion、NRAS遺伝子のG12DおよびA59TのSNV、ならびにPIK3CAのE545KのSNVの5種類の変異を検出し、検出感度は62.5%(5種類/8種類)であった。検出できた該変異のp値はいずれもE-06 (10-6)オーダーであり、閾値(1%水準)より低かった。 As described in FIG. 11, the analysis code of the present invention (in the figure: “unique pipeline”) detects all eight mutant genes with a mixing ratio of 1.0% indicated by hatching, and the detection sensitivity is It was 100% (8 types / 8 types). Furthermore, while the mixing ratio is 0.1%, five kinds of mutations are shown: hatched SNV of EGFR gene L858R and Deletion of Δ746-750, SN12 of NRAS gene G12D and A59T, and SNK of PIK3CA E545K. The detection sensitivity was 62.5% (5 types / 8 types). The p-values of the detected mutations were all on the order of E-06 (10 -6 ), which were lower than the threshold (1% level).
以上、実施例1の結果より、本実施形態の検出手法は従来法より優れた検出感度を示し、さらには従来法では検出できなかった極めて低い変異においても、変異を検出することができた。 As described above, according to the results of Example 1, the detection method of the present embodiment showed a detection sensitivity superior to that of the conventional method, and furthermore, the mutation could be detected even in a very low mutation which could not be detected by the conventional method.
(2)実施例2:核酸混合物中における変異頻度の定量
核酸混合物中に存在する稀な変異の混合比率(変異頻度)を調査するため、19種の遺伝子群において、野生型遺伝子のコピー、さらに同じ遺伝子に稀な変異が生じている35種類の変異型遺伝子のコピーを1.0%-30%の様々な混合比で有するTru-Q7(1.3% Tier)Reference Standard(Part No.: HD734)(horizon社製)をテンプレート核酸に用いた。
(2) Example 2: Quantification of mutation frequency in a nucleic acid mixture In order to investigate the mixing ratio (mutation frequency) of rare mutations present in a nucleic acid mixture, in 19 gene groups, a copy of a wild-type gene, further Tru-Q7 (1.3% Tier) Reference Standard (Part No. :) with copies of 35 mutant genes in which rare mutations occur in the same gene at various mixing ratios of 1.0% -30%. HD734) (manufactured by Horizon) was used as a template nucleic acid.
実施例1と同じく、ThruPLEX Tag-seq 6S (12) Kitを用いてイルミナ社シーケンサー用のライブラリーを作製して、HiSeqによりシーケンスリードを取得した。 As in Example 1, a library for Illumina sequencer was prepared using ThruPLEX Tag-seq 6S (12) Kit, and sequence reads were obtained by HiSeq.
実施例1と同様に変異検出を実施した結果、35種類の変異中34種類の変異を検出できた(検出感度 97.1%)。結果を図12に示した。これらの検出された変異の頻度を分子バーコードのクラスタリング後のSAM/BAMファイルより算出した。算出方法は、下記のとおりである。 As a result of performing mutation detection in the same manner as in Example 1, 34 types of mutations could be detected out of 35 types of mutations (detection sensitivity 97.1%). The results are shown in FIG. The frequency of these detected mutations was calculated from SAM / BAM file after clustering of molecular barcodes. The calculation method is as follows.
変異頻度は以下の式によって算出される。
変異頻度=(変異箇所にマッピングされた変異を有するコンセンサスリードのUMI数)/(変異箇所にマッピングされたコンセンサスリードの全UMI数)
The mutation frequency is calculated by the following equation.
Mutation frequency = (number of UMIs of consensus reads with mutations mapped to mutation sites) / (total UMI number of consensus reads mapped to mutations sites)
上記計算式により算出された変異頻度(Observed Allele Frequency)とDroplet Digital PCRTM(Bio-Rad社製)によって求められた変異頻度(Expected Allele Frequency、製品情報に明記された値)を比較して算出した。結果を図12に示した。その結果、本開示の方法により推定された頻度は、Digital PCRで求められた値と極めて近く相関係数R2は0.9945であった。 Calculated by comparing the mutation frequency calculated by the above equation (the Observed Allele Frequency) and Droplet Digital PCR TM mutation frequency obtained by (manufactured by Bio-Rad) (Expected Allele Frequency, the stated value in the product information) did. The results are shown in FIG. As a result, the frequency estimated by the method of the present disclosure was very close to the value determined by Digital PCR, and the correlation coefficient R2 was 0.9945.
以上、実施例2の結果より、本開示の方法での分子バーコードのクラスタリングにより精度よくUMI分類が実現できた。また、分解が進んだ(断片化された)低品質のゲノムDNAであっても低頻度な変異を高精度に検出することができた。 As described above, according to the result of Example 2, UMI classification can be realized with high accuracy by clustering of molecular barcodes according to the method of the present disclosure. In addition, low-frequency mutations could be detected with high accuracy even with degraded (fragmented) low-quality genomic DNA.
4.変異検出装置
図13は、上述した変異検出方法を実施する変異検出装置10の具体的な内部構成を示したブロック図である。変異検出装置10はパーソナルコンピュータのような情報処理装置で構成される。変異検出装置10は、その全体動作を制御する制御部11と、画面表示を行う表示部13と、ユーザが操作を行う操作部15と、データやプログラムを記憶するデータ記憶部17と、外部機器やネットワークと接続し、通信を行なうインタフェース部18と、制御部11が処理を行うためのメインメモリであるRAM(Random Access Memory)16とを備える。
4. Mutation Detection Device FIG. 13 is a block diagram showing a specific internal configuration of the
表示部13は例えば、液晶ディスプレイや有機ELディスプレイで構成される。操作部15はキーボード、マウス、タッチパネル等を含む。
The
インタフェース部18は、USB、HDMI(登録商標)、SCSI等のインタフェースに準拠した種々の機器(プリンタ、通信装置、入力装置等)を接続可能であり、接続した機器と変異検出装置10間のデータや制御コマンドの通信を可能とする。特に、変異検出装置10はインタフェース部18を介して次世代シーケンサー50からリード配列データを取得する。
The
制御部11は変異検出装置10全体の動作を制御するものであり、プログラムを実行することで所定の機能を実現するCPUやMPUで構成される。制御部11で実行されるプログラムは通信回線や、CD、DVD、メモリカード等の記録媒体を介して提供されることができる。または、制御部11は、所定の機能を実現するように設計された専用のハードウェア回路(FPGA,ASIC等)で構成されてもよい。
The
該プログラムは、例えば、JAVA(登録商標)、C、C++、C#、OBJECTIVE-C、SWIFTなどのコンピュータ言語、またはPERL、PYTHON、BIO-RUBYなどのスクリプト言語など任意の適切な言語を用いて、例えば、従来型又はオブジェクト指向技術を用いて、ソフトウェアコードとして実行することができる。 The program uses, for example, a computer language such as JAVA (registered trademark), C, C ++, C #, OBJECTIVE-C, SWIFT, or a script language such as PERL, PYTHON, BIO-RUBY, or any suitable language. For example, it may be implemented as software code using conventional or object oriented techniques.
該ソフトウェアコードは、RAM16など記憶及び/又は送信のためのコンピュータ読み取り可能な媒体において、一連の命令として記憶することができる。あるいは、コード化され、インターネットを含む種々のプロトコールに適合する有線、光、及び/又は無線通信回線を介して送信するために適合されたキャリア信号を用いて送信することができる。
The software code may be stored as a sequence of instructions on a computer readable medium for storage and / or transmission, such as
データ記憶部17はデータやプログラムを記憶する装置であり、例えばハードディスク(HDD)、SSD(SOLID STATE DRIVE)、半導体メモリ素子、光ディスクで構成することができる。データ記憶部21は、制御部11で実行される制御プログラムや、リード配列データ等を格納する。
The
図14(a)(b)は、図13の変異検出装置10の制御部11により実行される処理を示すフローチャートである。このうち図14(a)のフローチャートは、PCRエラーやシーケンスのエラーを考慮したデータ解析に係る処理を示している。図14(b)のフローチャートは、リードのアライメントを考慮したデータ解析に係る処理を示している。
FIGS. 14A and 14B are flowcharts showing processing executed by the
まず、図14(a)に示すPCRエラーやシーケンスエラーを考慮したデータ解析に係る処理を説明する。 First, processing relating to data analysis in consideration of PCR errors and sequence errors shown in FIG. 14A will be described.
変異検出装置10はインタフェース部18を介してシーケンサー50からリード配列データを取得する(S51)。変異検出装置10の制御部11は、各リード配列データについて参照配列へのマッピング処理を行う(S52)。制御部11は、上述の方法にしたがい参照配列上の位置及びUMIに基づきリード配列データの分類を行う(S53)。その後、制御部11は、コンセンサスリード処理を行い、各UMI familyについてコンセンサスリードを作製する(S54)。
The
制御部11はエラーモデルを構築する(S55)。具体的には、制御部11は、信頼性スコアUQを計算し、コンセンサスリードのパターンと信頼性スコアUQからエラーモデル(ポアソン分布)のパラメーターλを求めて、エラーモデルを構築する。
The
そして、制御部11は、エラーモデルを参照して変異を検出する(S56)。具体的には、制御部11は、変異を確認したい位置と塩基種を選択し、エラーモデルを参照してその変異の発生確率を推定する。そして、制御部11は、推定値が閾値より小さいときは、変異が発生していないと判断し、選択した塩基種が正しいと判断する。図14(a)に示すフローチャートでは、小さいサイズの変異が検出され得る。
Then, the
任意の工程として、必要であれば、制御部11は変異フィルタリングを行い(S57)、所定の条件を満たす変異を変異検出結果から除外する。
As an optional step, if necessary, the
次に、図14(b)に示す、リードのアライメントを考慮したデータ解析に係る処理を説明する。変異検出装置10はインタフェース部18を介してシーケンサー50からリード配列データを取得する(S51)。変異検出装置10の制御部11は、各リード配列データについて参照配列へのマッピング処理を行う(S52)。制御部11は、上述の方法にしたがい参照配列上の位置及びUMIに基づきリード配列データの分類を行う(S53)。
Next, processing related to data analysis in consideration of lead alignment shown in FIG. 14B will be described. The
図14(b)に示すフローチャート(処理)では、「分子バーコードのクラスタリング(S53)」と(後続の)「コンセンサスリード処理(S54)」との間に、制御部11は、参照配列にはない検体特有のリード配列長以上の配列領域を決定するローカルアセンブリ処理(S58)を行う。
In the flowchart (process) shown in FIG. 14B, the
その後、制御部11は、コンセンサスリード処理を行い、各UMI familyについてコンセンサスリードを作製する(S54)。次に、制御部11はエラーモデルを構築する(S55)。具体的には、制御部11は、信頼性スコアUQを計算し、コンセンサスリードのパターンと信頼性スコアUQからエラーモデル(ポアソン分布)のパラメーターλを求めて、エラーモデルを構築する。更に、制御部11は、エラーモデルを参照して変異を検出する(S56)。具体的には、制御部11は、変異を確認したい位置と塩基種を選択し、エラーモデルを参照してその変異の発生確率を推定する。そして、制御部11は、推定値が閾値より小さいときは、変異が発生していないと判断し、選択した塩基種が正しいと判断する。更に必要であれば、制御部11は変異フィルタリングを行い(S57)、所定の条件を満たす変異を変異検出結果から除外する。
Thereafter, the
図14(b)に示すフローチャート(処理)では、「分子バーコードのクラスタリング(S53)」と(後続の)「コンセンサスリード処理(S54)」との間に、「ローカルアセンブリ処理(S58)」が存在すること以外は、図14(a)の処理と同様である。但し、図14(b)に示すフローチャートでは、大きいサイズの変異が検出され得る。 In the flowchart (process) shown in FIG. 14 (b), the “local assembly process (S58)” is between the “molecular bar code clustering (S53)” and the (following) “consensus read process (S54)”. The processing is the same as the processing of FIG. However, in the flow chart shown in FIG. 14 (b), mutations of large size can be detected.
なお、図15のフローチャートに示すように、
・リード配列処理(S51)、
・参照配列のマッピング(S52)、及び、
・分子バーコードのクラスタリング(UMIの分類)(S53)
の処理以降に、
・コンセンサスリード処理(S54a)、
・エラーモデル構築(S55a)、及び、
・変異検出(小さい変異)(S56a)
の、小さい変異の検出のための処理フローと、
・ローカルアセンブリ処理(S58)
・コンセンサスリード処理(S54b)、
・エラーモデル構築(S55b)、及び、
・変異検出(大きい変異)(S56b)
の、大きい変異の検出のための処理フローとを並行して行い、その後、「小さい変異」の検出と「大きい変異」の検出との和集合の出力データを作成する(「和集合処理(S56c)」、という処理の流れであってもよい。
As shown in the flow chart of FIG.
Read arrangement processing (S51),
Mapping of reference sequence (S52), and
· Molecular bar code clustering (classification of UMI) (S53)
After the processing of
Consensus lead processing (S54a),
Error model construction (S55a), and
· Mutation detection (small mutation) (S56a)
Process flow for the detection of small mutations,
・ Local assembly processing (S58)
Consensus lead processing (S 54 b),
Error model construction (S55 b), and
-Mutation detection (large mutation) (S56b)
And the processing flow for the detection of large mutations in parallel, and then output data of the union of the detection of “small mutation” and the detection of “large mutation” is created (“combination processing (S 56 c "," May be a process flow.
以上のようにして変異検出装置10は、シーケンサー50からのリード配列のデータに基づきDNA配列の変異を検出することができる。
As described above, the
(本開示)
上記の実施の形態は以下の技術思想を開示している。
(This disclosure)
The above embodiments disclose the following technical ideas.
(1)コンピュータを用いて、クローナル核酸混合物を構成する核酸分子の塩基配列における特定位置に存在する特定塩基種の存在を推定する方法であって、
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特定の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで選抜した特定位置において解析エラーによって特定の塩基が出現する出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在すると推定する工程、
を含む方法。
(1) A method of using a computer to estimate the presence of a specific base type present at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture,
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each specific base species at the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) Estimate the occurrence rate of appearance of a specific base due to an analysis error at a specified position selected in step e at a set threshold or significance level by a parametric error model using the reliability score obtained in step d Process,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or estimating at a significant level as present in the clonal nucleic acid mixture,
Method including.
(2)コンピュータを用いて、クローナル核酸混合物を構成する核酸分子の塩基配列における特定位置に存在する特定塩基種がクローナル核酸混合物中で存在する割合を推定する方法であって、
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特有の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで選抜した特定位置において解析エラーによって特有の塩基が出現する出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在すると推定する工程、及び
i)工程hにおいて工程eで選抜した塩基種が存在すると判断された場合、工程fで得られた特定位置において塩基種を含有するコンセンサスリードの数から、工程gで設定された閾値または有意水準における工程eで選抜した特定位置における特定塩基種の比率を算出し、クローナル核酸混合物中で存在する割合として採用する工程、
を含む方法。
(2) A method of using a computer to estimate the proportion of a specific base species present at a specific position in a base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture in the clonal nucleic acid mixture,
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each base species specific to the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) Estimate the occurrence rate of a distinctive base due to an analysis error at a specific position selected in step e at a set threshold or significance level, using a parametric error model using the reliability score obtained in step d Process,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or a step of presuming presence in clonal nucleic acid mixture at a significant level, and i) containing the base species at the specific position obtained in step f, when it is judged that the base species selected in step e is present in step h Calculating the ratio of the specific base species at the specific position selected in step e at the threshold or significance level set in step g from the number of consensus leads, and adopting it as the ratio present in the clonal nucleic acid mixture,
Method including.
(3)コンピュータを用いて、クローナル核酸混合物を構成する各核酸分子の特定領域の塩基配列推定方法であって、
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特定の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程eで選抜した特定位置の塩基種に対応する工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで確認した特定位置における塩基種の解析エラーによる出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在する、あるいは該出現率より低い場合は特定位置の塩基種は不明、と判断する工程、
i)工程hで得られた結果を該塩基の各コンセンサスリードに適用し二次配列情報を作成する工程、及び
j)工程eにおける塩基位置を変更し、工程e~iを繰り返すことにより得られた特定領域の全二次配列情報を統合し、クローナル核酸混合物を構成する各核酸分子の特定領域の塩基配列として採用する工程、
を含む方法。
(3) A method of estimating the base sequence of a specific region of each nucleic acid molecule constituting a clonal nucleic acid mixture, using a computer,
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each specific base species at the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) The base at the specific position confirmed in step e at a set threshold or significance level by the parametric error model using the reliability score obtained in step d corresponding to the base type at the specific position selected in step e Estimating the rate of occurrence due to analysis errors of the species,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or judging that the base species at the specific position is unknown if it is present in the clonal nucleic acid mixture at a significant level or lower than the occurrence rate,
obtained by applying i) the result obtained in step h to each consensus lead of the base to create secondary sequence information, and j) changing the base position in step e and repeating steps e to i. Integrating all secondary sequence information of the specific region and adopting it as the base sequence of the specific region of each nucleic acid molecule constituting the clonal nucleic acid mixture,
Method including.
(4)(3)の方法において、工程bで採用した参照配列とコンセンサスリード上の塩基配列が同じである場合は工程e~工程hを経ずにそのまま工程iで作成する二次情報の塩基種として採用してもよい。 (4) In the method of (3), when the reference sequence adopted in step b and the base sequence on the consensus lead are the same, the base of the secondary information prepared in step i as it is without passing through steps e to h You may adopt as a seed.
(5)(1)~(3)のいずれかの方法において、工程eで選抜する特定の塩基位置が、工程bで採用した参照配列とコンセンサスリード上の塩基種が異なる位置であってもよい。 (5) In the method according to any one of (1) to (3), the specific base position selected in step e may be a position different from the reference sequence adopted in step b and the base type on the consensus read .
(6)(1)~(3)のいずれかの方法において、核酸分子が少なくとも一方の末端に分子バーコードを有してもよい。 (6) In the method of any of (1) to (3), the nucleic acid molecule may have a molecular barcode at at least one end.
(7)(1)~(6)のいずれかの方法において、各核酸分子が異なる分子バーコードを有してもよい。 (7) In any of the methods (1) to (6), each nucleic acid molecule may have a different molecular barcode.
(8)(7)の方法において、各核酸分子が一つ以上の異なる分子バーコードを有してもよい。 In the method of (8) (7), each nucleic acid molecule may have one or more different molecular barcodes.
(9)(1)~(8)のいずれかの方法において、工程cにおけるクラスタリングをリード配列における分子バーコード領域の配列を指標に実施してもよい。 (9) In any of the methods (1) to (8), the clustering in step c may be performed using the arrangement of the molecular barcode region in the lead sequence as an index.
(10)(9)の方法において、分子バーコードがランダムな並びの塩基配列であってもよい。 (10) In the method of (9), the molecular barcode may be a random sequence of base sequences.
(11)(10)の方法において、分子バーコード領域のリード配列におけるミスマッチが、該分子バーコード領域のリード配列間の配列類似度が許容範囲内であってもよい。 (11) In the method of (10), the mismatch in the lead sequence of the molecular barcode region may be within the allowable range of the sequence similarity between the lead sequences of the molecular barcode region.
(12)(1)~(11)のいずれかの方法において、信頼性スコアが、クラスタリングされたリード配列の情報から、ベイズアプローチによって取得されてもよい。 (12) In any of the methods (1) to (11), the reliability score may be obtained by the Bayesian approach from the information on the clustered lead sequences.
(13)(12)の方法において、リード配列が核酸分子の増幅産物のリード配列であってもよい。 (13) In the method of (12), the lead sequence may be a lead sequence of the amplification product of the nucleic acid molecule.
(14)(13)の方法において、増幅物がポリメラーゼ連鎖反応により得られてもよい。 In the method of (14) (13), an amplification product may be obtained by polymerase chain reaction.
(15)(13)または(14)の方法において、ベイズアプローチの尤度関数はシーケンサーエラーと増幅エラーを考慮した関数であってもよい。 (15) In the method of (13) or (14), the likelihood function of the Bayesian approach may be a function taking into account sequencer errors and amplification errors.
(16)(15)の方法において、工程bにおけるマッピング工程において参照配列に対しリード配列で異なる塩基が挿入または欠失に相当すると判断された場合、シーケンサーエラーに関し更に補正を加えた関数を採用してもよい。 (16) In the method of (15), if it is determined that different bases in the lead sequence to the reference sequence correspond to insertions or deletions in the mapping step in step b, a function with further correction for sequencer error is adopted. May be
(17)(1)~(3)のいずれかの方法において、工程gのパラメトリックなエラーモデルがポアソン分布に基づいてもよい。 (17) In any of the methods (1) to (3), the parametric error model of step g may be based on a Poisson distribution.
(18)(1)または(2)の方法において、特定位置の特定塩基種が低頻度な変異の対象塩基であってもよい。 (18) In the method of (1) or (2), the specific base type at the specific position may be a target base of low frequency mutation.
(19)(1)~(3)のいずれかの方法において、工程cから工程dへ直接進んでもよい。 (19) In the method of any one of (1) to (3), step c may proceed directly to step d.
(20)(1)~(3)のいずれかの方法において、工程cの次に、
k)工程cで得られた各クラスタリングに含まれるリード配列を用いて、ローカルアセンブリを実行し、アセンブル配列のクラスタリング結果を取得する工程、
を行ってもよい。
(20) In the method of any of (1) to (3), after the step c,
k) performing local assembly using the read sequences included in each clustering obtained in step c, and obtaining clustering results of the assemble sequences;
You may
(21)(19)及び(20)の方法によって得られた結果を基に、クローナル核酸混合物を構成する拡散分子の塩基配列における特定位置に存在する特定塩基種を決定する工程を含む方法であってもよい。 (21) A method comprising the step of determining a specific base species present at a specific position in the base sequence of the diffusion molecule constituting the clonal nucleic acid mixture based on the results obtained by the methods of (19) and (20). May be
(22)(1)ないし(21)のいずれかの方法をコンピュータで実行させるためのコンピュータ読み取り可能なプログラム。 (22) A computer readable program for causing a computer to execute the method according to any one of (1) to (21).
(23)クローナル核酸混合物を構成する核酸分子の塩基配列における特定位置の特定塩基種の存在を推定するための装置であって、
各核酸分子由来のリード配列を取得するリード配列取得部と、
取得したリード配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、
マッピングしたリード配列のクラスタリング結果を得るクラスタリング結果取得部と、
クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報取得部と、
を備えた、推定装置。
(23) An apparatus for estimating the presence of a specific base type at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture,
A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule;
A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device;
A clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence;
A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result;
An estimation device equipped with
(24)クローナル核酸混合物を構成する核酸分子の塩基配列における特定位置の特定塩基種の存在を推定するための装置であって、
各核酸分子由来のリード配列を取得するリード配列取得部と、
取得したリード配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、
マッピングしたリード配列のクラスタリング結果を得る第1のクラスタリング結果取得部と、
各クラスタリングに含まれるリード配列を用いて、ローカルアセンブリを実行し、アセンブル配列のクラスタリング結果を得る第2のクラスタリング結果取得部と、
第2のクラスタリング結果取得部のクラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報取得部と、
を備えた、推定装置。
(24) An apparatus for estimating the presence of a specific base type at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture,
A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule;
A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device;
A first clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence;
A second clustering result acquisition unit that executes a local assembly using a lead sequence included in each clustering and obtains a clustering result of the assembly sequence;
A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result of the second clustering result acquisition unit;
An estimation device equipped with
10 変異検出装置
11 制御部
13 表示部
15 操作部
16 RAM
17 データ記憶部
18 インタフェース部
50 シーケンサー
100 変異検出システム
DESCRIPTION OF
17
Claims (24)
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特定の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで選抜した特定位置において解析エラーによって特定の塩基が出現する出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在すると推定する工程、
を含む方法。 A method of using a computer to estimate the presence of a specific base species present at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture,
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each specific base species at the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) Estimate the occurrence rate of appearance of a specific base due to an analysis error at a specified position selected in step e at a set threshold or significance level by a parametric error model using the reliability score obtained in step d Process,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or estimating at a significant level as present in the clonal nucleic acid mixture,
Method including.
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特有の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで選抜した特定位置において解析エラーによって特有の塩基が出現する出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在すると推定する工程、及び
i)工程hにおいて工程eで選抜した塩基種が存在すると判断された場合、工程fで得られた特定位置において塩基種を含有するコンセンサスリードの数から、工程gで設定された閾値または有意水準における工程eで選抜した特定位置における特定塩基種の比率を算出し、クローナル核酸混合物中で存在する割合として採用する工程、
を含む方法。 A method of using a computer to estimate the proportion of a specific base species present at a specific position in a nucleotide sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture in the clonal nucleic acid mixture,
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information consisting of a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each base species specific to the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) Estimate the occurrence rate of a distinctive base due to an analysis error at a specific position selected in step e at a set threshold or significance level, using a parametric error model using the reliability score obtained in step d Process,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or a step of presuming presence in clonal nucleic acid mixture at a significant level, and i) containing the base species at the specific position obtained in step f, when it is judged that the base species selected in step e is present in step h Calculating the ratio of the specific base species at the specific position selected in step e at the threshold or significance level set in step g from the number of consensus leads, and adopting it as the ratio present in the clonal nucleic acid mixture,
Method including.
a)各核酸分子に由来するリード配列を取得する工程、
b)工程aで取得したリード配列を参照配列上にマッピングしマッピング結果を取得する工程、
c)工程bで得られたマッピング結果を基にリード配列のクラスタリング結果を取得する工程、
d)クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報を取得する工程、
e)クローナル核酸混合物を構成する各核酸分子から工程a~dを経て得られるコンセンサスリードの集合に含まれる各コンセンサスリードの配列情報より、参照配列上の特定位置に対応する位置に存在する塩基種とその信頼性スコアを選抜する工程、
f)工程eで使用した特定位置を含むコンセンサスリードの総数に対する、工程eで選抜した特定位置に特定の各塩基種を含有するコンセンサスリード数の比率を取得する工程、
g)工程eで選抜した特定位置の塩基種に対応する工程dで取得した信頼性スコアを用いたパラメトリックなエラーモデルにより、設定された閾値または有意水準での工程eで確認した特定位置における塩基種の解析エラーによる出現率を推定する工程、
h)工程fで得られた比率が工程gにより得られた出現率より有意に高い場合に、工程eで選抜した特定位置に特定の塩基種を有する核酸分子が、工程gで設定された閾値または有意水準でクローナル核酸混合物中に存在する、あるいは該出現率より低い場合は特定位置の塩基種は不明、と判断する工程、
i)工程hで得られた結果を該塩基の各コンセンサスリードに適用し二次配列情報を作成する工程、及び
j)工程eにおける塩基位置を変更し、工程e~iを繰り返すことにより得られた特定領域の全二次配列情報を統合し、クローナル核酸混合物を構成する各核酸分子の特定領域の塩基配列として採用する工程、
を含む方法。 A method of estimating the base sequence of a specific region of each nucleic acid molecule constituting a clonal nucleic acid mixture, using a computer, comprising:
a) obtaining a lead sequence derived from each nucleic acid molecule,
b) mapping the read sequence obtained in step a onto a reference sequence to obtain a mapping result;
c) obtaining a clustering result of the lead sequence based on the mapping result obtained in step b,
d) obtaining consensus lead information comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result,
e) From the sequence information of each consensus lead contained in the set of consensus reads obtained from each nucleic acid molecule constituting the clonal nucleic acid mixture through steps a to d, the base type present at the position corresponding to the specific position on the reference sequence And the process of selecting their reliability score,
f) obtaining a ratio of the number of consensus reads containing each specific base species at the particular position selected in step e to the total number of consensus leads including the particular position used in step e,
g) The base at the specific position confirmed in step e at a set threshold or significance level by the parametric error model using the reliability score obtained in step d corresponding to the base type at the specific position selected in step e Estimating the rate of occurrence due to analysis errors of the species,
h) When the ratio obtained in step f is significantly higher than the appearance rate obtained in step g, the threshold value set in step g is a nucleic acid molecule having a specific base type at a specific position selected in step e. Or judging that the base species at the specific position is unknown if it is present in the clonal nucleic acid mixture at a significant level or lower than the occurrence rate,
obtained by applying i) the result obtained in step h to each consensus lead of the base to create secondary sequence information, and j) changing the base position in step e and repeating steps e to i. Integrating all secondary sequence information of the specific region and adopting it as the base sequence of the specific region of each nucleic acid molecule constituting the clonal nucleic acid mixture,
Method including.
k)工程cで得られた各クラスタリングに含まれるリード配列を用いて、ローカルアセンブリを実行し、アセンブル配列のクラスタリング結果を取得する工程、
を行うことを特徴とする請求項1~3のいずれかに記載の方法。 Next to step c
k) performing local assembly using the read sequences included in each clustering obtained in step c, and obtaining clustering results of the assemble sequences;
The method according to any one of claims 1 to 3, characterized in that
各核酸分子由来のリード配列を取得するリード配列取得部と、
取得したリード配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、
マッピングしたリード配列のクラスタリング結果を得るクラスタリング結果取得部と、
クラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報取得部と、
を備えた
推定装置。 An apparatus for estimating the presence of a specific base type at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture, comprising:
A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule;
A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device;
A clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence;
A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result;
Estimation device equipped with
各核酸分子由来のリード配列を取得するリード配列取得部と、
取得したリード配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、
マッピングしたリード配列のクラスタリング結果を得る第1のクラスタリング結果取得部と、
各クラスタリングに含まれるリード配列を用いて、ローカルアセンブリを実行し、アセンブル配列のクラスタリング結果を得る第2のクラスタリング結果取得部と、
第2のクラスタリング結果取得部のクラスタリング結果を基にコンセンサスリードとコンセンサスリード中の各塩基に対応する信頼性スコアからなるコンセンサスリード情報取得部と、
を備えた
推定装置。 An apparatus for estimating the presence of a specific base type at a specific position in the base sequence of a nucleic acid molecule constituting a clonal nucleic acid mixture, comprising:
A lead sequence acquisition unit for acquiring a lead sequence derived from each nucleic acid molecule;
A mapping information acquisition unit that supplies the acquired lead array to a mapping device and acquires the mapping result on the reference array by the mapping device;
A first clustering result acquisition unit for acquiring a clustering result of the mapped lead sequence;
A second clustering result acquisition unit that executes a local assembly using a lead sequence included in each clustering and obtains a clustering result of the assembly sequence;
A consensus lead information acquisition unit comprising a consensus lead and a reliability score corresponding to each base in the consensus lead based on the clustering result of the second clustering result acquisition unit;
Estimation device equipped with
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2019562510A JPWO2019132010A1 (en) | 2017-12-28 | 2018-12-28 | Methods, devices and programs for estimating base species in a base sequence |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2017-253510 | 2017-12-28 | ||
| JP2017253510 | 2017-12-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019132010A1 true WO2019132010A1 (en) | 2019-07-04 |
Family
ID=67063856
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2018/048502 Ceased WO2019132010A1 (en) | 2017-12-28 | 2018-12-28 | Method, apparatus and program for estimating base type in base sequence |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JPWO2019132010A1 (en) |
| WO (1) | WO2019132010A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114150047A (en) * | 2020-12-29 | 2022-03-08 | 阅尔基因技术(苏州)有限公司 | Method for evaluating base damage, mismatching and variation in sample DNA by one-generation sequencing |
| JP2022551261A (en) * | 2019-10-01 | 2022-12-08 | コーニンクレッカ フィリップス エヌ ヴェ | Systems and methods for efficient identification and extraction of sequence pathways in genome graphs |
| CN115831233A (en) * | 2023-02-07 | 2023-03-21 | 杭州联川基因诊断技术有限公司 | A method, device and medium for mTag-based targeted sequencing data preprocessing |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20120245039A1 (en) * | 2009-09-22 | 2012-09-27 | President And Fellows Of Harvard College | Entangled Mate Sequencing |
| US20130217006A1 (en) * | 2008-11-20 | 2013-08-22 | Pacific Biosciences Of California, Inc. | Algorithms for sequence determination |
| US20160273049A1 (en) * | 2015-03-16 | 2016-09-22 | Personal Genome Diagnostics, Inc. | Systems and methods for analyzing nucleic acid |
| JP2017506875A (en) * | 2013-12-28 | 2017-03-16 | ガーダント ヘルス, インコーポレイテッド | Methods and systems for detecting genetic variants |
| JP2017099400A (en) * | 2014-07-02 | 2017-06-08 | 株式会社Dnaチップ研究所 | Nucleic acid molecule counting method |
| JP2017517282A (en) * | 2014-05-23 | 2017-06-29 | ユニバーシティ オブ テクノロジー,シドニー | Sequencing process |
| WO2017191073A1 (en) * | 2016-05-01 | 2017-11-09 | Genome Research Limited | Mutational signatures in cancer |
-
2018
- 2018-12-28 JP JP2019562510A patent/JPWO2019132010A1/en active Pending
- 2018-12-28 WO PCT/JP2018/048502 patent/WO2019132010A1/en not_active Ceased
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130217006A1 (en) * | 2008-11-20 | 2013-08-22 | Pacific Biosciences Of California, Inc. | Algorithms for sequence determination |
| US20120245039A1 (en) * | 2009-09-22 | 2012-09-27 | President And Fellows Of Harvard College | Entangled Mate Sequencing |
| JP2017506875A (en) * | 2013-12-28 | 2017-03-16 | ガーダント ヘルス, インコーポレイテッド | Methods and systems for detecting genetic variants |
| JP2017517282A (en) * | 2014-05-23 | 2017-06-29 | ユニバーシティ オブ テクノロジー,シドニー | Sequencing process |
| JP2017099400A (en) * | 2014-07-02 | 2017-06-08 | 株式会社Dnaチップ研究所 | Nucleic acid molecule counting method |
| US20160273049A1 (en) * | 2015-03-16 | 2016-09-22 | Personal Genome Diagnostics, Inc. | Systems and methods for analyzing nucleic acid |
| WO2017191073A1 (en) * | 2016-05-01 | 2017-11-09 | Genome Research Limited | Mutational signatures in cancer |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2022551261A (en) * | 2019-10-01 | 2022-12-08 | コーニンクレッカ フィリップス エヌ ヴェ | Systems and methods for efficient identification and extraction of sequence pathways in genome graphs |
| CN114150047A (en) * | 2020-12-29 | 2022-03-08 | 阅尔基因技术(苏州)有限公司 | Method for evaluating base damage, mismatching and variation in sample DNA by one-generation sequencing |
| CN115831233A (en) * | 2023-02-07 | 2023-03-21 | 杭州联川基因诊断技术有限公司 | A method, device and medium for mTag-based targeted sequencing data preprocessing |
| CN117116348A (en) * | 2023-02-07 | 2023-11-24 | 杭州联川基因诊断技术有限公司 | Methods, equipment and media for correcting mTag sequences of targeted sequencing data |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2019132010A1 (en) | 2021-01-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7684708B2 (en) | Non-invasive prenatal molecular karyotyping of maternal plasma | |
| Smith et al. | UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy | |
| EP3143537B1 (en) | Rare variant calls in ultra-deep sequencing | |
| AU2022203114A1 (en) | Detecting mutations for cancer screening and fetal analysis | |
| KR102638152B1 (en) | Verification method and system for sequence variant calling | |
| CN104937598B (en) | The accurate and quick positioning of the sequencing reading value of targeting | |
| US20220130488A1 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
| CN107229841B (en) | A kind of genetic mutation appraisal procedure and system | |
| US20190108311A1 (en) | Site-specific noise model for targeted sequencing | |
| WO2019132010A1 (en) | Method, apparatus and program for estimating base type in base sequence | |
| Prjibelski et al. | IsoQuant: a tool for accurate novel isoform discovery with long reads | |
| WO2019213810A1 (en) | Method, apparatus, and system for detecting chromosome aneuploidy | |
| Hu et al. | Processing UMI Datasets at High Accuracy and Efficiency with the Sentieon ctDNA Analysis Pipeline | |
| Smith et al. | Considerations of Depth, Coverage, and Other Read Quality Metrics | |
| US20240412808A1 (en) | Detection of cystic fibrosis transmembrane conductance regulator polytg/polyt variations by an ngs-based method | |
| US20220399079A1 (en) | Method and system for combined dna-rna sequencing analysis to enhance variant-calling performance and characterize variant expression status | |
| KR20250092241A (en) | Nucleic acid error suppression | |
| WO2025221998A1 (en) | Systems and methods for variant calling | |
| Prodanov | Read Mapping, Variant Calling, and Copy Number Variation Detection in Segmental Duplications | |
| Lim | Copy number estimation for high-throughput short read shotgun sequencing de novo whole-genome assembly contigs | |
| WO2025155949A1 (en) | Liquid biopsy assay for genomic profiling of circulating tumor dna | |
| WO2025237449A2 (en) | Method, device, and program product for predicting met14 skipping mutation | |
| CN114708905A (en) | Chromosome aneuploidy detection method, device, medium and equipment based on NGS | |
| HK40004815A (en) | Method and device for acquiring fetal free dna concentration | |
| HK40004815B (en) | Method and device for acquiring fetal free dna concentration |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18895469 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2019562510 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18895469 Country of ref document: EP Kind code of ref document: A1 |