WO2004015609A2 - Methode et systeme de recherche de mutations dans des sequences d'adn et d'interpretation de leurs consequences - Google Patents
Methode et systeme de recherche de mutations dans des sequences d'adn et d'interpretation de leurs consequences Download PDFInfo
- Publication number
- WO2004015609A2 WO2004015609A2 PCT/IB2003/003195 IB0303195W WO2004015609A2 WO 2004015609 A2 WO2004015609 A2 WO 2004015609A2 IB 0303195 W IB0303195 W IB 0303195W WO 2004015609 A2 WO2004015609 A2 WO 2004015609A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- base
- bases
- sequence
- sequences
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- the present invention relates with a specific computer aided diagnosis of genetic diseases.
- it helps identifying single point mutations and frame shift mutations on specific genes or loci investigated in a living entity. It gives information (such as functional impact) stored in a database about the mutations found, and gives a quantitative measure of the quality of the analyzed sequence, thus helping the user to decide whether he needs to re-sequence part or the entire genomic sequence. If a new mutation is found, the user is informed and requested to give further information to increment the knowledge of the database. Further applications are for example identifying genetically modified organisms and determining the abnormal nature of diseased cells.
- This specification describes a method and a system for finding SNP in DNA sequences.
- the method uses a neural network approach to identify locations with a significant difference between one reference signal and one signal being analyzed.
- the neural network has been trained on examples of mutations.
- the specification does not describe methods or systems to translate known sequences into amino acids, the evaluation of a functional impact of the mutations found nor a search in a database.
- Genetic testing using high throughput sequencing of DNA with electrophoresis instruments is becoming a routine job in large public hospitals, as well as in commercial facilities.
- the objectives of such tests are many: diagnosing patients with genetic diseases to adapt treatments (e.g. familial hypercholesterolemia), pre- natal tests for future parents and offspring for genetic diseases (e.g. CFTR), defining the nature of cells found in tumors.
- the principles of the Sanger method are well known: fragments of DNA with a length varying between a few dozens and a few hundred bases are synthesized by copying the patient's DNA. Each of the fragments is terminated by a fluorescent dye, having a different color for each of the bases: A, T, C, G.
- the files When the gene or loci have been sequenced on the genomic DNA, the files contain a part that is called promoter, a part that is called exon which mostly encodes for the protein, a part which is called intron, which connects the exons to one another and intergenic sequences.
- the files In case the gene has been sequenced on the cDNA or for mitochondrial DNA (called mtDNA), the files contain only the protein coding part, and are overlapping. For each patient or tumor, a set of files will be generated, covering the gene or important parts of the genes or loci to be analyzed.
- the steps followed by an expert in molecular genetic for the diagnosis of a disease are shown in figure 4, the true diagnostic work (made manually by the expert) starting from step 3 up to step 9:
- Step 1 is taking a sample containing DNA, typically either from white blood cells for a patient or diseased cells with mutated DNA.
- Step 2 is extracting the DNA by standard methods, performing chemical reactions that consist in amplifying it with PCR and finally sequencing the part of the DNA containing the gene (or genes or loci) to be analyzed.
- Step 3 is making a base calling and printing out the set of files to be analyzed putting the base called on top of the curves, checking the completeness for each set and sorting them set by set. This step and the following ones have to be repeated as many times as there are files in a set for an analysis.
- Step 4 is to check visually the quality of the signal by looking at the graphs and looking for relatively crude signs such as saturated peaks and too short or aborted sequences. It is also a comparison between similar files from different patients in order to detect variations, preferably using alignment software. If a run shows signs of poor quality, it will be re-sequenced (step 2).
- Step 5 is to find the beginning of the exon or exons in a file, i.e. to align the test file with the reference file. This can be done either manually by comparing the bases indicated above the graph of the run with a reference sequence, or getting help from alignment software such as SeqScape from Applied BioSystems where a user can enter a reference sequence and have it compared with the file to be analyzed.
- Step 6 is to compare the run to be analyzed with the reference, finding all discrepancies and evaluating their potential for being real mutations
- Step 7 on the potential SNP the expert will first define which codon (set of three consecutive bases) corresponds to the changed base and whether the found change of base in this codon leads to a change in the amino acid. For frame shifts, he will check whether it is an addition or a loss of a base on at least one of the strands of DNA. He will also check whether this SNP affects the probability of having an efficient splicing.
- Step 8 is looking up in personal and worldwide databases whether the mutations found are known in diseased individuals, and not described in healthy individuals. It is also checking whether the SNP modifies an evolutionarily conserved residue (across animal species or across gene families) or to which group/manufacturer of GMO (genetically modified organism) the analyzed living organism belongs. o If there are further files in the set, the expert will have to start again with the next run at step 5.
- Step 9 after all the runs of a given set have been analyzed, the expert summarizes his findings in a report. As the reader can see, a large part of the experts work today is manually done, leaving room for human errors, misinterpretation, variation of analysis, etc..
- the objective of the invention is to offer a method and a system to assist the expert in his task of diagnosing a genetic disease.
- the invention integrates step 4 through step 9 and guides the expert through the entire set of files having to be analyzed for a given patient.
- the method and system are locus specific, the analysis is done on known genes, sequences or loci.
- the major novelty comes from integrating a knowledge base containing known mutations and their functional impact into a method for finding mutations. Said knowledge base being supplied with new information every time a mutation is found by the user.
- the novelty comes also from treating a set of files related to a patient or organism versus treating a set of files from different patients containing the same part of a sequence.
- the method and system provide the expert with quantitative data on which he will establish his diagnosis:
- the invention relates thus to a method for computing in a computing system from:
- a base caller file Fj of a part of a DNA sequence of a patient said file giving for each base position of the base sequence of the patient a plurality of characteristics selected among the group consisting of at least the most intense signal (such as a peak), the intensity thereof, the position thereof, the second most intense peak or signal, the intensity thereof, the position thereof, parameters function of the intensity of said most intense peak, parameters function of the quality of said most intense signal, parameters function of the intensity of said second most intense peak, parameters function of the quality of said second most intense peak (signal) and parameters function of the intensity of said most intense peak and of said second most intense peak or signal, and
- the most intense signal such as a peak
- the possible mutated bases are determined by comparing the base sequences of the patient list with the corresponding reference sequence, and by comparing the base sequences of the list of corresponding reference sequences with the corresponding base sequences of the patient.
- the method further comprises the computing step of comparing the determined mutation with a list of known mutations.
- the method computes in a computing system from:
- a first base caller file Fj of a base DNA sequence of a patient said file giving for each base position of base sequence of the patient, at least the most intense peak and a parameter function of the intensity of said most intense peak; • a second base caller file Fj- selected from the group consisting of the base caller file corresponding substantially to the reverse sequence of the first base caller file and the base caller file corresponding substantially to the reverse sequence of a base caller file different from the first base caller file; and • a list of reference sequences selected from the group consisting of wild type sequences, and non mutated reference sequences with a list of bases and parameter function of the intensity of the signal of the bases detected by a machine; whereby the method comprises the following computing steps: i.
- the first and second base caller files comparing a series of successive bases with the most intense peak or signal with the bases of the reference sequences so as to determine at least a portion of the base sequence of the patient substantially corresponding to a portion of at least one reference sequence. ii. Determining a patient ' list of bases of the first and second base caller files substantially corresponding to at least one reference sequence, and a list of corresponding reference sequences to which correspond bases of one of the first and second base caller files iii. Searching mutated bases between the base sequences of the patient list and the corresponding reference sequences.
- the method computes in a computing system from: • a first base caller file Fj of a DNA base sequence of a patient, said file giving for each base position of the base sequence of the patient, at least the most intense peak or signal and a parameter function of the intensity of said most intense peak or signal;
- a second base caller file F selected from the group consisting of the base caller file corresponding substantially to the reverse sequence of the first base caller file and the base caller file corresponding substantially to the reverse sequence of a base caller file different from the first base caller file;
- the method comprises the following computing steps: i. for the first base caller file, comparing a series of successive bases with the most intense peak or signal with the bases of the reference sequences in the first and second lists so as to determine at least a portion of the base sequence of the patient substantially corresponding to a portion of a reference sequence in either list of reference sequences; for the second base caller file, comparing a series of successive bases with the most intense peak with the bases of the reference sequences in the first and second lists of reference sequences so as to determine at least a portion of the base sequence of the patient substantially corresponding to a portion of a reference sequence in either list of reference sequences; ii.
- said method further comprises the computing step of editing the base sequences with possible mutation in at least one predetermined orientation.
- the computing step of determining a patient list of base sequences of the base caller file substantially corresponding to at least one reference sequences comprises a step for withdrawing a determined base sequence corresponding to a reference sequence, when said determined base sequence has at least twice been determined.
- a quality factor is preferably determined for the base sequences corresponding to the reference sequences, while the step of withdrawing a determined base sequence uses at least one selection criterion consisting in comparing the quality of the base sequence at least twice determined so as to determine and withdraw one base sequence of said twice determined base sequence which has the lowest quality factor.
- a quality factor is determined for the.
- the method further comprises a step of recomposing a base sequence by using at least one determined element selected from the group consisting of single bases, base sequence portions, base sequences and combination thereof uses at least one selection criterion consisting of comparing the quality of the corresponding elements in the base sequences at least twice determined so as to determine and reformat an adapted base sequence including the elements having the best quality factor.
- the base sequence has an exon portion, while the determined mutated bases of at least the exon portion are each characterized by an element selected from the group consisting of amino acids and stop codons coded by the codon of mutated bases.
- Said method further comprises a computing step of comparing the element characterizing each mutated base with a corresponding element characterizing a reference base for mutations found at least in the exon portion.
- the base sequence has sequence motifs of functional importance, while the determined mutated bases are each characterized by a probability of efficient splicing, said method further comprising a computing step of comparing the probability of efficient splicing characterizing each mutated base with a probability of efficient splicing characterizing the corresponding reference base.
- the base sequence has sequence motifs of functional importance (in the promoter and in the intergenic region), while the determined mutated bases are each characterized by a transcription efficiency
- said method further comprises a computing step of comparing the transcription efficiency characterizing each mutated base with a transcription efficiency characterizing the corresponding reference base.
- the method further comprises the computing step of comparing at least one element consisting of mutated bases, amino acids and combinations thereof with one element consisting of bases, amino acids and combinations thereof of a database containing a list of elements selected from the group consisting of bases, amino acids and combinations thereof with known mutations.
- the method further comprises the computing step of comparing at least one element consisting of mutated bases, amino acids and combinations thereof with one element consisting of bases, amino acids and combinations thereof of a database containing a list of elements selected from the group consisting of bases, amino acids and combinations thereof with known mutations, and the step of communicating the determined mutation to an external database for determining whether or not the mutation is known.
- the method further comprises the computing step of determining at least one mutation from the mutated base and comparing the at least one determined mutation with a list of known mutations and editing a table giving for each determined mutation at least three parameters selected from the group consisting of gene name, intergenic sequence, promoter, exon number, intron number, mutation type, base changed to, base position in the intergenic/promoter/exon/intron, amino acid, amino acid number, codon, codon number, graphs, sequencing technology used to detect mutation, source, name and address of the researcher who found this mutation, haplotype configuration, hyperlinks to other databases, information about conservation through different species and genes of nucleotide or amino acid, and general comments.
- the method of the invention can further comprises the step of reversing automatically at least one portion of a base caller file Fj.
- the determination of possible mutated bases comprises at least the following steps : . determination of homozygous mutation for elements selected from the group consisting of bases, amino acids and combination thereof at determined positions of the base sequences of the patient list, and determination of heterozygous mutation for elements selected from the group consisting of bases, amino acids and combination thereof at determined positions of the base sequences of the patient list, and/or
- the determination of possible mutated bases comprises at least the step of determining mutation for bases at determined positions of the base sequence of the patient list, whereby said step comprises at least the following instructions for each of said determined positions:
- the determination of possible mutated bases comprises at least the step of determining homozygous mutation for bases at determined positions of the base sequence of the patient list, whereby said step comprises at least the following instructions for each of said determined positions:
- the determination of possible mutated bases comprises at least the step of determining heterozygous mutation for bases at determined positions of the base sequence of the patient list, whereby said step comprises at least the following instructions for each of said determined positions : determining at least three different criteria, and - deciding on basis of a decision table a mutation probability of the base at the position considered.
- Said at least three criteria are preferably :
- the factor is function of at least three ratios between the peak or signal intensity of the base considered and the peak or signal intensity of three neighboring bases, and function of at least three ratios between the peak or signal intensity of the reference base corresponding to the base considered and the peak or signal intensity of the three corresponding neighboring reference bases.
- At least three criteria are : • the ratio (average quality of an element selected from the group consisting of the considered base, a first sequence portion including the considered base / average quality of a base environment which is larger than the element, said base environment being selected from the group consisting of environment including the element, environment not including the element and environment including partly the element;
- the factor is function of at least three ratios between the peak intensity of the base considered and the peak intensity of three neighboring bases, and function of at least three ratios between the peak intensity of another patient's sequence base corresponding to the base considered and the peak intensity of three neighboring bases of said other patient's sequence, corresponding to said neighboring bases.
- the method when at least three substantially successive mutated bases are detected, the method further comprises a control step: • Edit patient file by inserting a number of consecutive blanc bases before the first mutated base • Compare the such edited sequence from the insertion point onwards with the reference file. If the number of bases (most intense or second most intense) equal to the corresponding reference bases found has increased significantly, the method gives a signal of a frame shift corresponding to a deletion of said number of bases and indicates the nature of the deleted bases.
- the method further comprises a step of finding a stop codon in the mutated allele.
- the method translates the codons in the mutated allele from the mutation point onwards into the group consisting of amino acids and stop codons, and reports the position of the first stop codon it detects.
- the invention relates also to a machine programmed for executing computing steps of a method of the invention as disclosed here above.
- the invention further relates to a support readable by a computing system, said support being provided with readable instructions for computing in a computing machine, one or more steps of the method of the invention.
- the invention still relates to a computer program comprising instructions for carrying computing steps of the method of the invention.
- Fig. 1 shows schematically a gene that has been sequenced on genomic DNA from an eucaryote, wherein the three boxes symbolize the files, the gray part the exons and the white part the introns., a file being able to contain either one or a few exons.
- Fig. 2 shows schematically a gene that has been sequenced on mtDNA or cDNA, whereby the cross hatched part is the overlap where both files contain data about the same part of the gene.
- Fig. 3 is an example of a graph where four curves show the signal intensity as a function of migration distance and length of the DNA fragments.
- Fig. 4 shows schematically today's manual process for diagnosing a genetic disease.
- Fig. 5 is an example of a wild type sequence compared with a heterozygous mutation. Note the double peak, as well as the reduction in surface area of the wild type base G in Fig. 5 b relatively to its neighbors.
- Fig. 6 is a list of parameters used by the system to search for mutations.
- Fig. 7 gives a possible criteria for homozygous mutation based on the selection of peaks having a base different from the reference with sufficient intensity compared with the second most intense peak.
- Fig. 8 gives a possible criteria for heterozygous mutation based on three criteria, the first one being a drop in quality around the considered peak, coming from the fact that a heterozygous mutation is similar to an ambiguous signal, the second criteria comparing the area of the two most intense signals (if a heterozygous mutation is present, the signals are comparable in intensity), while the third criteria compares the reduction in intensity of the wild type base at the locus relatively to its neighboring peaks with the ratio of intensity of the reference sequence.
- Fig. 9 is an overview flow chart of the method showing the main steps in the analysis of a set of data.
- Fig. 10 is a detailed flow chart of the method showing step 2 of the flow chart in Fig. 9.
- Fig. 11 is a graph of the quality of a file: the black bar BB indicating the position of the exon within the file, three color levels being defined: red for low quality (Bl), orange for medium (B2) and green for good quality (B3).
- Fig. 12 is a quality histogram of the file of figure 11, showing the numbers of bases above a certain quality within the exon as a function of quality (expressed in 'Phred' quality units, a commonly used logarithmic scale: a quality of 20 corresponds to 99% probability of a correct base call, 30 to 99.9% and 40 to 99.99%).
- Fig. 13 shows tables suitable for the representation of the results, in which Table 1 gives the Structure of the database: the table shows the part of the database that contains the mutation information and which will be updated as the method and system is used; Table 2 is giving a list of amino acids and codons in various species of animals having a similar gene as well as in different similar genes in humans, said table being suitable for verifying whether a change in amino acid on the test set changes an important part of the gene when it has been conserved across several species, The first row of every gene containing the name of the different genes and species considered, as they might be different from gene to gene.
- Table 3 gives the position of different generally accepted synonymous SNP constituting the haplotype of each gene, whereby said table can be extended to add new positions from time to time. Description of a preferred embodiment
- the preferred embodiment consists of a computer program, which runs on a general purpose computer, such as a PC or a work station.
- the program can also be running partly on a PC for the data analysis and partly on a central server for the database handling.
- the software can be made available to the user through magnetic supports (e.g. diskettes), optical support (e.g. CD-ROM, DND), paper listing or remotely via an Internet download.
- This system made of a software, a support and a machine, executes a method which is described below.
- Step 1 The system provides means for the user to input the data either through a Graphical User Interface with routine checks of the formats of the data submitted (e.g. valid numbers and ranges for the number parameters, correct file formats for the test sets). Alternatively, the data can be submitted by batch processes in case the user has a large number of test sets to analyze with the same set-up.
- the system provides a means of either reading and analyzing a set of files which have been pre-processed by a generic base-caller (e.g. Phred), which are widely available and described in various literature.
- a generic base-caller e.g. Phred
- the system can receive a set of files directly issued by a high throughput sequencing equipment such as the capillary electrophoresis machines from ABI, Amersham or other manufacturers or novel chip technology used for D ⁇ A sequencing which are then processed by an integrated base caller.
- a high throughput sequencing equipment such as the capillary electrophoresis machines from ABI, Amersham or other manufacturers or novel chip technology used for D ⁇ A sequencing which are then processed by an integrated base caller.
- the system receives files directly from a high through put sequencing equipment, sends them to an outside basecaller and finally, the system retrieves automatically the base caller files.
- the system furthermore provides a mean to sort the submitted test set and associate each test file F; with one or more of the reference files, as well as each reference file with one or more F; files in case a part of the gene or loci has been sequenced several times.
- FIGURE 1 for example shows that for a gene that has been sequenced on genomic DNA from an eucaryote, the files contain one or more exons, separated by introns.
- FIGURE 2 shows a different possible configuration where the files submitted are overlapping.
- the files contain the nucleotide sequence of a part of a gene or loci, from a patient, from cancerous cells, from animals or plants.
- the nucleotide sequence is generally represented by an ordered list of bases or nucleotides (A, C, G, T for DNA).
- the files contain a representation of the signal corresponding to the four different bases A,C,G,T (see FIGURE 3). From this raw data files outputted by a sequencer, a quality signal can be deduced as well as the sequence of bases.
- this location is a candidate for a possible mutation (see FIGURE 5 - the curve Cl shows a possible mutation with respect to curve CO, said mutation being the possible presence of base A, where only base G was present in curve CO).
- the user inputs all necessary parameters for the mutation search (see FIGURE 6 for a description of the variables):
- HomoSurfaceCutOff with a range between 0.1 and 0.9, acceptable values are for example: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8; said parameter enabling a differentiation between heterozygous and homozygous mutations •
- RatioCutOffNalue with a range between 0.1 and 0.9, acceptable values are for example: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, this parameter enabling a differentiation between normal variations in quality and variations due to heterozygous mutations
- HeteroSurfaceCutOff with a range between 0.1 and 0.9, acceptable values are for example: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, this parameter enabling a differentiation between heterozygous and homozygous base position
- units are 'Phred' style quality measuring the quality on a logarithmic scale (20 corresponding to a probability of 99% of having the correct base call; 30 corresponding to a probability of 99.9% of having the correct base call, 40 corresponding to a probability of 99.99% of having the correct base
- the selected minimum quality defines a minimum acceptable level to start or to initiate computing steps for analyzing mutations.
- PercentageAllowedUnderMinimumQuality with a range of 0 to 100, acceptable values are for example 1, 5, 10, 20, 30, 40, 50, 60, 70.
- This parameter allows to have or maintain a certain number of bases with a quality lower than the minimum required quality level, whereby maintaining these low quality bases for further analyzes.
- One or more parameters or variables can be adapted or selected by the user. However, advantageously, the method of the invention is provided with minimal value for said parameters or variables.
- Step 2 To search for mutations, i.e. differences between the test file Fi and the reference file, the system compares the bases found in the test file Fi with the corresponding bases in the reference files.
- Step 3 Once mutations have been found, the system can look up in a list or in a database whether the mutation is known.
- Steps 4-5-6 The system also provides means to recognize whether a file in the test set has been sequenced in forward or reverse orientation (the relation between forward and reverse sequence is a reverse order of the bases and a complement of the bases: A is complement to T and C to G).
- the preferred method for this is to store the reference in forward and reverse sense, allowing to store information on the signal which can be different in forward and reverse.
- An alternative is to store the reference files only in one sense (either forward or reverse) and when comparing test files to reference files to take the complement of the test file before doing the comparison.
- Yet another alternative is to store the reference files only in one sense (either forward or reverse) and when comparing test files to reference files to take the complement of the reference file before doing the comparison.
- Step 7 If the user has submitted several test files corresponding to a given reference, the system can chose to analyze only one of them or more of them, for example the last one to be submitted.
- Step 8 If the user has submitted several test files corresponding to a given reference, the system can also select the file with the best overall quality, i.e. confidence in the base calls.
- the overall quality can be calculated in many ways, for example the arithmetic mean of the quality of all or a selected part of the bases in the test file submitted. Another example, to determine an overall quality could be the minimum quality found in all or a selected part of the bases in the test file submitted.
- Step 9 If the user has submitted several test files corresponding to a given reference, the system can also create a new test file from the submitted ones. This can be done by keeping only the parts of the test files with the better quality. For example if two test files have been submitted, the first one having a better quality than the second one in the first "x" (for example 40, or a number lesser than 40 or greater than 40, such as 20, 30, 50, 60, etc.) corresponding bases, but the second file has a better quality after base x (40 in this example), the system would create a new test file from the first x (40 in the present example) corresponding bases of the test file one and the bases after base x (40 in this example) from test file two.
- the system would create a new test file from the first x (40 in the present example) corresponding bases of the test file one and the bases after base x (40 in this example) from test file two.
- Step 10 If the test file contains an exonic portion, the system will determine for the mutations found whether the codon containing the mutation codes a different amino acid from the corresponding amino acid in the reference file. To do this, the system looks up in a table giving the correspondence between codons on one hand and amino acids and stop codons on the other hand.
- Step 11 If the test file contains sequence motifs of functional importance (such as for example the introns). The system will determine for each mutated bases a probability of efficient splicing. This can be done for example through neural network programs accessible through the internet (e.g. www.fruitfly.org ' ). The system will furthermore compare this probability of efficient splicing with the probability of the corresponding reference sequence motifs.
- test file contains a part or the entire promoter or part or the entire intergenic region, the system will determine for each mutated base the impact on the transcription efficiency and compare it with the transcription efficiency of the reference base.
- Step 13 The system can then look up in a list or a database, whether the mutation of a base, or the change of an amino acid or the appearance of stop codons have been previously recorded or published.
- Step 14 The system can also look up external public or private databases accessible through the internet/intranet to check whether the mutations found (change in bases, in amino acids or the appearance of stop codons) have been previously recorded or published.
- Step 15 The system can look up a database containing for each record (see FIGURE 13 showing advantageous record with mutation remarks): gene name: commonly accepted name of the gene, example LDLR intergenic sequence: part of the genome between genes promoter: part of the genome which controls the transcription of the gene exon number: number of the exon if mutation was found in an exon, for
- intron number number of the intron if mutation was found in an intron
- for LDLR between 1 and 19 mutation type heterozygous, homozygous, frame shift base changed to: nature of the mutated base, for example A instead of C, or deletion of G base position in the intergenic/promoter/exon/intron: position of the mutated base in the intergenic region, the promoter, the exon or the intron, depending on where the mutation was localized.
- amino acid for mutations in exons: name of the mutated amino acid, for example Cys amino acid number: position of the amino acid in the gene, for example 56 codon: nature of the codon (three base or nucleotide letters), for example CGA codon number: position of the codon in the gene, for example 33 graphs: graphical data showing the mutation found (for example FIGURE
- sequencing technology used to detect mutation for example capillary electrophoresis or DNA sequencing chip, equipment reference from manufacturer.
- source name of the database where data record has been found if public. name and address of the researcher who found this mutation haplotype: configuration of synonymous SNP.
- hyperlinks with other databases allows user to directly go to source database if available information about conservation through different species and genes of nucleotide or amino acid: allows user to evaluate functional impact of mutation.
- general comments free text area to include comments, such as other data about patient, for example cholesterol levels, etc.
- Step 16 In case some of the test files submitted have been sequenced in reverse sense, the system can reverse the relevant part or the entire files, before analyzing them.
- Step 17 The system can determine for each position of the patient sequence whether a mutation is present and whether the mutation is a heterozygous or a homozygous one.
- Step 18 To find mutations at determined positions in the test files submitted, the sequences of the patient are compared to the corresponding reference sequences and comparing the bases in the patient file corresponding to the bases in the corresponding reference file.
- Step 19 The system determines homozygous mutations by executing following instructions: comparing the second most intense signal with the most intense one at a given position in the test file. The comparison can be done by comparing the intensity of the signal, the surface of the peaks associated with the signal, the height of the peaks associated with the signal, or any other parameter function of the intensity of the signal. If the second most intense signal is not lower than a fraction (called HomoCutOff) of the most intense signal, then the system considers there is no homozygous mutation at this position.
- HomoCutOff a fraction of the most intense signal
- the system checks whether the base corresponding to the most intense signal is different from the corresponding base of the reference sequence. If the bases are the same, the system considers that there is no homozygous mutation at this position. If the bases are different, than the system evaluates the local quality factor, in case the local quality is higher than a defined minimum the system outputs a signal corresponding to a homozygous mutation with a high likelihood, if the local quality is below a defined minimum, the system outputs a signal corresponding to a homozygous mutation with a medium likelihood and warns the user that the quality is below the defined minimum.
- Step 20 The system checks for heterozygous mutations by determining three criteria which will indicate the likelihood of having a heterozygous mutation.
- the probability (or likelihood) of having a heterozygous mutation at a given position is function of the three criterias, this function can be summarized in a table.
- An alternative would be to define a mathematical function of three variables (the three criteria) which would give a likelihood of mutation as a function of the three criteria.
- Step 21 The three criteria to determine the likelihood of having a heterozygous mutation at a given position are the following: a reduction in quality around the given position, relatively to the neighboring positions.
- RatioCutOffValue This can be evaluated for example by comparing the ratio of the average (arithmetic mean) of the quality of the given position and its two closest neighbors divided by the average of the quality of the second and third closest neighbors with a user specified number, RatioCutOffValue. Alternatively, a ratio can be computed from the quality of the given position divided by the average of the quality of the second nearest neighbors.
- a second signal (peak) at significantly the same position as the most intense signal, with a significant intensity can be evaluated for example by comparing the ratio of the second most intense peak intensity divided by the most intense peak intensity at the given position with a user specified number, HeteroCutOff. Alternatively, this can be done by replacing the intensity with a function of the intensity, such as peak surface or peak height.
- Step 22 In particular, this can be done by comparing the change in intensity to three different close neighbors, relatively to the intensity of the corresponding peak in the reference with the corresponding neighboring peaks.
- Step 23 an alternative to comparing the change in intensity to different neighbors with the intensity of a fixed reference, is to allow the user to compare the change with a corresponding sequence of another patient which has been sequenced in the same batch, allowing to check for artifacts due to the sequencing run on an equipment.
- Step 24 In particular, this third criterion can be done with three close neighbors.
- Step 25 In case several substantially consecutive mutations have been discovered, an additional step checks for frame shifts, i.e. the insertion or deletion of several bases in the test file. As the number of inserted or deleted bases is unknown a priori, the system will edit the test file, using a trial and error method, adding or subtracting a variable number X of bases at the position of the first mutation of the substantially consecutive mutations and comparing the edited test file with the corresponding reference file. In case an edited test file has apparently much less mutations than the original test file, the system outputs a signal to inform the users that the frame shift has a high likelihood of having X bases either inserted or deleted.
- the individual steps are as follows:
- test file is edited by inserting X consecutive blanc bases before the first mutated base of the substantially consecutive mutations.
- the such edited test file is then compared from the insertion point onwards, i.e. after the last inserted blanc base, with the corresponding reference file.
- the comparison is done in following manner: if either the most intense or second most intense base of the edited test file is equal to the corresponding base in the reference file, the base is said to be not mutated. If the number of mutated bases is significantly reduced for a given X, the method gives a signal corresponding to a frame shift deletion of X bases at the given position. If for a user selected range of X (for example 1 to 20), no significant reduction in mutations is detected, the system will try the following step:
- the test file is edited by deleting X consecutive bases starting from the first mutation of the substantially consecutive mutations.
- the such edited test file is then compared from the deletion point onwards, i.e. after the last deleted base, with the corresponding reference file.
- the comparison is done in following manner: if either the most intense or second most intense base of the edited test file is equal to the corresponding base in the reference file, the base is said to be not mutated. If the number of mutated bases is significantly reduced for a given X, the method gives a signal corresponding to a frame shift insertion of X bases at the given position. If for a user selected range of X (for example 1 to 20), no significant reduction in mutations is detected, the system informs the user that the mutation configuration could not be analyzed. Step 26: Once a ' frame shift has been detected and analyzed, i.e.
- Step 27 To detect the position of the new stop codon, the system translates the mutated allele from the mutation point onwards into amino acids or stop codons and issues a signal to the user when the first stop codon has been detected. When one test file has run through the system, the next one is loaded, until no further test files are available. When a complete set of files has been analyzed, the system sorts the mutations found in decreasing order of probability. The user can then decide which mutations and synonymous SNP are real and which are artifacts and inform the system of his decisions, the system then re-summarizes the findings in a final report.
- Figure 7 shows possible computing steps for determining a possible homozygous mutation at position i.
- the first test operated by the computer is to determine the ratio between the second most intense peak(signal) intensity at position i and the most intense peak (signal) intensity at position i. If said ratio is lower than the HomoCutOff value, the computer starts to check for possible heterozygous mutation. If said ratio is higher to said HomoCutOff value, the search of possible homozygous mutation at position i is continued.
- the computer determines now whether the most intense peak base at position i differs from the reference base. If not (NO difference), the computer emits a signal that there is no homozygous mutation at position i and further searches for possible heterozygous mutation.
- the local quality at position i is compared with the minimum quality required.
- the computer emits a signal corresponding to a high likelihood of homozygous mutation at position i , while when said local quality is lower than the required quality, a signal corresponding to medium or mean likelihood of mutation is emitted by the computer. It is obvious that the computer advantageously emits said signal with a parameter conesponding to the local quality.
- Figure 8 shows computing steps for searching possible heterozygous mutation. Said determination is carried out by determining three criteria "true” or “false” and a decision table of the mutation probability.
- the criteria are :
- the first criteria for the position i is to determine the ratio between the average quality at the positions 1-2,1-3, 1+2,1+3 and the average quality at position 1+1,1, and 1-1.
- the average quality at different positions can be made with or without weight factor, for example for giving more weight to a position, and/or with use of minimum quality value for each position. If said ratio is greater or equal to ratio cut off value, the computer gives the "true" value for the first criteria, if not, the "false” criteria is given by the computer to said first criteria.
- the second criteria consist to determine the ratio between the second most intense peak intensity at position I and the most intense peak intensity and to compare said ratio to a hetero Cut Off. If said ratio is higher than or equal to said hetero cut off, a "true” value is given to the second criteria, while in the negative a "false” value is given to the second criteria.
- the reference ratios are calculated on the peak intensities of the sequence of reference, while the test ratios are determined on basis on intensities of the sequence to be analyzed.
- the ratios are ratio between peak intensity at position I / peak intensity at position i-2, ratio between peak intensity at position I / peak intensity at position 1-1 , ratio between peak intensity at position I / peak intensity at position 1+2, ratio between peak intensity at position I / peak intensity at position 1+1.
- test ratio and reference ratio are used for determining a ratio between test ratio and test ratio and for comparing said ratio with a neighbor Peak Ratio Cut Off.
- the values of the first, second and third criteria are then compared with data of a table, so as to determine a probability of heterozygous mutation.
- the probability of mutation is medium when at least two criteria are true or when the third criteria is trae, while the probability is high when the three criteria are true. In the other cases, the probability is low.
- Figure 10 is a flow chart of step 2 of the method of the invention.
- the method will search first sub-chain and second sub chain of a reference file R(k) in F(i), as well as in the reverse test file F r (i).
- the first step is the search of a possible first sub chain of R(k) in the test file F(I). If no first sub-chain is found, the second sub-chain of R(k) is searched in file F(I). If no second sub chain is found, the computer searches the first sub chain in the reverse file Fr(I) (said reverse file being for example determined by the computer or being determined by tests), and if no such first sub chain is found, the computer searches for a possible second sub chain.
- the reference file having a sub chain corresponding to a sub chain of the test file F(I) is kept in list (memory), advantageously with mention of sense or reverse and/or with mention of the found sub chain.
- the memory is advantageously associated with a means so that in case several test files F(I) correspond to one reference file R(k), the test file F(I) with the best quality is kept.
- test file F(I) After checking one test file F(I) with all the reference file R(k), another test file is computed so as to search possible sub chain of reference file.
- Figure 11 is a graph showing the quality of a file, in which the black bar BB indicates the position of the exon, while various colors Bl, B2, B3 are used for defining the quality level (respectively low level, medium level and high level).
- the number of cumulated bases having a quality lower than a certain quality (expressed in 'Phred' quality unit).
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2003282562A AU2003282562A1 (en) | 2002-08-02 | 2003-07-09 | Method and system for finding mutations in dna sequences and interpreting their consequences |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US40074902P | 2002-08-02 | 2002-08-02 | |
| US60/400,749 | 2002-08-02 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2004015609A2 true WO2004015609A2 (fr) | 2004-02-19 |
| WO2004015609A3 WO2004015609A3 (fr) | 2005-05-19 |
Family
ID=31715698
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2003/003195 Ceased WO2004015609A2 (fr) | 2002-08-02 | 2003-07-09 | Methode et systeme de recherche de mutations dans des sequences d'adn et d'interpretation de leurs consequences |
Country Status (2)
| Country | Link |
|---|---|
| AU (1) | AU2003282562A1 (fr) |
| WO (1) | WO2004015609A2 (fr) |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2006007648A1 (fr) * | 2004-07-20 | 2006-01-26 | Conexio 4 Pty Ltd | Procede et appareil d'analyse de sequence d'acide nucleique |
| JP2009527258A (ja) * | 2006-02-24 | 2009-07-30 | テルモ株式会社 | Pfo閉鎖デバイス |
| WO2016025818A1 (fr) * | 2014-08-15 | 2016-02-18 | Good Start Genetics, Inc. | Systèmes et procédés pour une analyse génétique |
| US10370710B2 (en) | 2011-10-17 | 2019-08-06 | Good Start Genetics, Inc. | Analysis methods |
| US10429399B2 (en) | 2014-09-24 | 2019-10-01 | Good Start Genetics, Inc. | Process control for increased robustness of genetic assays |
| US10851414B2 (en) | 2013-10-18 | 2020-12-01 | Good Start Genetics, Inc. | Methods for determining carrier status |
| CN113380325A (zh) * | 2021-05-26 | 2021-09-10 | 杭州电子科技大学 | 一种基于密码子突变位点检测氨基酸突变的方法 |
| CN115458052A (zh) * | 2022-08-16 | 2022-12-09 | 珠海横琴铂华医学检验有限公司 | 基于一代测序的基因突变分析方法、设备和存储介质 |
| CN116935959A (zh) * | 2023-04-25 | 2023-10-24 | 山东省农业科学院畜牧兽医研究所 | Sanger基因测序结果快速判读方法、系统及介质 |
| CN117116342A (zh) * | 2023-09-01 | 2023-11-24 | 北京优迅医疗器械有限公司 | 一种校正碱基变异检测结果的方法及其应用 |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB0008899D0 (en) * | 2000-04-11 | 2000-05-31 | Isis Innovation | DNA analysis |
-
2003
- 2003-07-09 AU AU2003282562A patent/AU2003282562A1/en not_active Abandoned
- 2003-07-09 WO PCT/IB2003/003195 patent/WO2004015609A2/fr not_active Ceased
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2006007648A1 (fr) * | 2004-07-20 | 2006-01-26 | Conexio 4 Pty Ltd | Procede et appareil d'analyse de sequence d'acide nucleique |
| US7617054B2 (en) | 2004-07-20 | 2009-11-10 | Conexio 4 Pty Ltd | Method and apparatus for analysing nucleic acid sequence |
| JP2009527258A (ja) * | 2006-02-24 | 2009-07-30 | テルモ株式会社 | Pfo閉鎖デバイス |
| US10370710B2 (en) | 2011-10-17 | 2019-08-06 | Good Start Genetics, Inc. | Analysis methods |
| US10851414B2 (en) | 2013-10-18 | 2020-12-01 | Good Start Genetics, Inc. | Methods for determining carrier status |
| WO2016025818A1 (fr) * | 2014-08-15 | 2016-02-18 | Good Start Genetics, Inc. | Systèmes et procédés pour une analyse génétique |
| US12386895B2 (en) | 2014-08-15 | 2025-08-12 | Laboratory Corporation Of America Holdings | Systems and methods for genetic analysis |
| US10429399B2 (en) | 2014-09-24 | 2019-10-01 | Good Start Genetics, Inc. | Process control for increased robustness of genetic assays |
| CN113380325A (zh) * | 2021-05-26 | 2021-09-10 | 杭州电子科技大学 | 一种基于密码子突变位点检测氨基酸突变的方法 |
| CN115458052A (zh) * | 2022-08-16 | 2022-12-09 | 珠海横琴铂华医学检验有限公司 | 基于一代测序的基因突变分析方法、设备和存储介质 |
| CN116935959A (zh) * | 2023-04-25 | 2023-10-24 | 山东省农业科学院畜牧兽医研究所 | Sanger基因测序结果快速判读方法、系统及介质 |
| CN117116342A (zh) * | 2023-09-01 | 2023-11-24 | 北京优迅医疗器械有限公司 | 一种校正碱基变异检测结果的方法及其应用 |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2004015609A3 (fr) | 2005-05-19 |
| AU2003282562A1 (en) | 2004-02-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| AU784645B2 (en) | Method for providing clinical diagnostic services | |
| US6532462B2 (en) | Gene expression and evaluation system using a filter table with a gene expression database | |
| Templeton et al. | A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila | |
| CN109767810B (zh) | 高通量测序数据分析方法及装置 | |
| KR101542529B1 (ko) | 대립유전자의 바이오마커 발굴방법 | |
| JP5650083B2 (ja) | 多重プローブターゲット相互作用パターンの自動分析:パターンマッチング及び対立遺伝子同定 | |
| KR20200011471A (ko) | 심층 신경망에 기반한 변이체 분류자 | |
| KR101460520B1 (ko) | 차세대 시퀀싱 데이터의 질병변이마커 검출 방법 | |
| US20050209787A1 (en) | Sequencing data analysis | |
| CN108913776B (zh) | 放化疗损伤相关的dna分子标记的筛选方法和试剂盒 | |
| WO2004015609A2 (fr) | Methode et systeme de recherche de mutations dans des sequences d'adn et d'interpretation de leurs consequences | |
| CN112669903A (zh) | 基于Sanger测序的HLA分型方法及设备 | |
| US20090228213A1 (en) | Display method and display apparatus of gene information | |
| KR20150024232A (ko) | 질병에 대한 약물 내성 유전체로부터 내성 원인 마커의 발굴 방법 | |
| CN117275575A (zh) | 一种基于液相芯片对snp的猪品种鉴定的深度学习判别方法 | |
| CN112489727B (zh) | 一种快速获取罕见病致病位点的方法和系统 | |
| US6927779B2 (en) | Web-based well plate information retrieval and display system | |
| US6203990B1 (en) | Method and system for pattern analysis, such as for analyzing oligonucleotide primer extension assay products | |
| CN119359841B (zh) | 一种通过组合图形直观展示生物个体间遗传差异及组合图形生成方法 | |
| WO2013171565A2 (fr) | Procédé et système pour évaluer des molécules dans des échantillons biologiques en utilisant des images dérivées de micropuce | |
| US20030194724A1 (en) | Mutation detection and identification | |
| US20050221353A1 (en) | Data processing and display method for gene expression analysis system and gene expression analysis system | |
| CN119418764A (zh) | 基于SNaPshot的基因单核苷酸多态位点分析方法 | |
| CN119170097B (zh) | 基于高通量转录组测序的ikzf1基因外显子缺失识别系统及方法 | |
| Scott et al. | Designing a Study for Identifying Genes in Complex Traits |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |
|
| WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |