[go: up one dir, main page]

WO2001053529A9 - RAPID DETERMINATION OF GENE STRUCTURE USING cDNA SEQUENCE - Google Patents

RAPID DETERMINATION OF GENE STRUCTURE USING cDNA SEQUENCE

Info

Publication number
WO2001053529A9
WO2001053529A9 PCT/US2001/001461 US0101461W WO0153529A9 WO 2001053529 A9 WO2001053529 A9 WO 2001053529A9 US 0101461 W US0101461 W US 0101461W WO 0153529 A9 WO0153529 A9 WO 0153529A9
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
primers
cdna
gene
seq
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2001/001461
Other languages
French (fr)
Other versions
WO2001053529A3 (en
WO2001053529A2 (en
Inventor
Hans-Ulrich Thomann
Michael S Fitzgerald
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oscient Pharmaceuticals Corp
Original Assignee
Genome Therapeutics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genome Therapeutics Corp filed Critical Genome Therapeutics Corp
Priority to AU29532/01A priority Critical patent/AU2953201A/en
Priority to EP01942674A priority patent/EP1294943A2/en
Priority to CA002398683A priority patent/CA2398683A1/en
Publication of WO2001053529A2 publication Critical patent/WO2001053529A2/en
Anticipated expiration legal-status Critical
Publication of WO2001053529A9 publication Critical patent/WO2001053529A9/en
Publication of WO2001053529A3 publication Critical patent/WO2001053529A3/en
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • eukaryotic genes comprise sequences (exons) destined to be part of the mature RNA interrupted by sequences that are not destined to be part of the mature RNA. Such interrupting sequences are known as intervening sequences or introns.
  • the exons comprise coding sequence and 5' regulartory sequence.
  • the combination of coding sequence and introns is transcribed into a primary RNA transcript.
  • Genes also comprise non-coding sequence 5' of the transcribed region; such upstream regions are known as enhancers and promoters.
  • genomic sequence that is not present in the mature RNA product be it mRNA, rRNA or tRNA comprises enhancer and promoter sequences 5' of the translated region as well as introns interspersed within the translated region.
  • RNA messenger RNA
  • mRNA messenger RNA
  • Primary mRNA is processed into mature mRNA by 5' capping, removal of intervening sequences, and addition of a polyA tail on the 3' terminus of the mRNA.
  • the human genome as well as those of most other mammals is in the range of 3 xlO 9 base pairs.
  • the average size of a gene or primary transcript is 16.6 kilobase pairs, of which 2.2 kilobase pairs is the average size of the mature mRNA. Therefore, non-coding regions make up the vast majority of the size of genes (about 87%).
  • allelic variations comprises more than the study of variations within the exons and requires the information present in the genomic version of the gene of interest, information that does not ultimately end up in the mRNA or final RNA product. Typically, this information is available only when the complete sequence of a chromosomal copy of a gene of interest is obtained. Therefore, not all sequence pertinent to gene structure and phenotypic variation is available in cDNA or EST sequence, because these sequences are derived from mature transcribed copies of the genes where introns have been removed.
  • a typical method for obtaining the desired genetic information comprises cloning and sequencing the entire chromosomal copy of a gene of interest. This method is very costly and time consuming and involves sequencing many thousand kilobases of DNA in order to obtain enough sequence coverage to assemble a given gene.
  • the present invention is drawn to a method of determining gene structure including boundaries between exons and introns of a gene and between 5' or 3' termini of mature RNA transcripts and the adjacent genomic sequence, including intron termini 5' and 3' untranslated regions (UTR) and promoter and enhancer sequence.
  • gene structure refers to the order of exons and introns in the chromosomal copy of a gene as well as about 50 to about 300 nucleotides of sequence 5' and 3' of each exon terminus.
  • non-exon regions refers to 5' untranscribed regions of the gene, 3' untranscribed regions of the gene and introns.
  • the present invention further provides genomic sequence 5' and 3' of the mature RNA termini, as well as sequence of 5' and 3' ends of introns. Furthermore, the sequence provided herein can be used to obtain additional sequence 5' and 3' of the mature RNA termini as well as additional intron sequence if desired, e.g. using primer walking with sequence obtained by the present method.
  • regions of a chromosomal copy of a gene or fragments thereof are sequenced using a set of primers.
  • the sequence of the mature transcript is known.
  • the primers cover both strands of the cDNA, at evenly spaced or similarly spaced intervals.
  • the present invention provides information necessary to determine gene structure and phenotypic expression without the need to sequence the entire chromosomal copy of the gene or fragment thereof. As a result of the method of the present invention, gene structure can be determined without the need to sequence the entire gene.
  • the present invention is useful, for example, in germ line sequence variation analysis.
  • the method of the present invention is drawn to determining gene structure, where at least some portion of the genomic sequence of the gene of interest is unknown.
  • the method involves sequencing the gene across exon-intron boundaries using evenly spaced primers, or tiled primers.
  • the tiled primers comprise nucleic acids that hybridize to the known cDNA sequence of the gene at about 100 to about 300 base intervals and the gene comprises the template.
  • the present invention is drawn to a method of determining boundaries between at least one exon and at least one non-exon of a gene.
  • the method comprises the steps of conducting one or more sequencing reactions, comprising a template and a primer or set of primers.
  • the template comprises a gene or fragment thereof and the primer or set thereof comprises at least one oligonucleotide, wherein said oligonucleotide hybridizes to the cDNA encoded by said gene or fragment thereof and wherein said cDNA has known sequence.
  • the set of primers of the present invention comprises oligonucleotides that hybridize to the coding and non-coding strand of said cDNA.
  • sequence obtained as described above is compared with the known sequence of said cDNA, thereby determining the boundaries between the sequence corresponding to exons (cDNA) and the sequence corresponding to non-exons, wherein sequence obtained as described above that is not within the sequence of the cDNA is non-exon sequence.
  • sequence obtained as described above that is not within the sequence of the cDNA is non-exon sequence.
  • the present invention has several advantages.
  • the present invention does not require prior knowledge of "genomic sequence" including boundaries between exon and non-exon sequence, nor knowledge of any sequence within the non-exon regions.
  • the present invention requires much less work and therefore saves time and money than traditional methods of determining gene structure because the entire chromosomal copy of a gene need not be sequenced.
  • the cost would be at least 20 times more than the method of the present invention.
  • the 150 kb BAC clone contains coding sequence for a 2 kb cDNA
  • the method of the present invention could provide the gene structure from 37 sequencing reactions using 30 primers. This includes 20 primers designed for a first round of sequencing reactions where the primers hybridize at 200 base intervals on both strands of the cDNA.
  • This estimate also includes a 25% failure rate in first round sequencing reactions such that 5 sequencing must be repeated as well as a 50% failure of primes such that 10 new primers must be synthesized an used in sequencing reactions and synthesized to fill in any the gaps.
  • the gene structure can be determined using the method of the present invention in two rounds of sequencing with a total of 25 primers and 25 sequencing reactions.
  • One of ordinary skill in the art can readily determine if and when additional primers need to be designed for additional rounds of sequencing and how to design the additional primers.
  • to sequence the entire 150 kb BAC clone if each sequencing reaction yields 500 bases of sequence, a minimum of 300 sequencing reactions must be conducted with 300 primers. The time involved to sequence the entire BAC clone is also an important factor and is estimated at 2 months in contrast to the estimated two weeks required in the present invention.
  • the present invention is also drawn to human cytochrome P450 2C19 sequence. More particularly, the present invention is drawn to SEQ ID NOS: 59, 61, 63, 65, 67, 71, 73, 75, 77, 79, 81, 84, 86, 89 and 91.
  • Figure 1 is a schematic diagram of the present invention.
  • Figure 2 is a schematic diagram of the hybridization pattern of primers of Tables m and IV with the p53 cDNA, SEQ ID NO: 96.
  • Figure 3 is alignment of primers on the P450 2C19 cDNA,SEQ ID NO: 58.
  • Figure 4 shows a sequence obtained using the P450 2C19 gene as template and the cDNA specific primers according to Example II.
  • Figure 5 is the gene structure of human P450 2C19 as determined by the present invention in the form of a composite sequence, SEQ ID NOS: 59 and 97, where the underlined sequence is novel sequence and the primer hybridization sites and starting ATG are boxed.
  • Figure 6 is a schematic diagram of the human P450 2C19 gene.
  • the term “gene” refers to a contiguous stretch of deoxynucleotides comprising the basic unit of heredity of an organism, encoding a given protein or RNA.
  • the terms “gene” and also “genomic DNA” comprises one or more exon or part thereof, one or more intron or part thereof, all or a portion of the 5' untranslated region, and all or a portion of the 3' untranslated region.
  • the term “gene structure” includes the coding regions or exons together with the exon-intron boundaries with at least 50 nucleotides of sequence of all intron termini as well as 5' and 3' UTR.
  • the gene structure as determined by the present invention can also include promoter and enhancer sequences.
  • cDNA refers to complementary DNA of an mRNA molecule.
  • cDNA can represent the complete mRNA or a fragment thereof.
  • RNA product can be mRNA, tRNA, rRNA or other structural RNA.
  • polymorphism is an allelic variation in nucleic acid sequence between two or more samples.
  • polymorphisms can be, for example, restriction fragment length polymorphism (RFLP), a variation in DNA sequence that alters the length of a restriction fragment (Botstein et al., Am. J. Hum. Genet. 32, 314-331 (1980)).
  • RFLP restriction fragment length polymorphism
  • Other polymo ⁇ hisms include of short tandem repeats (STRs) that include tandem di-, tri- and tetra-nucleotide repeated motifs. These tandem repeats are also referred to as variable number tandem repeat (VNTR) polymorphisms.
  • VNTRs have been used in identity and paternity analysis (US 5,075,217; Armour et al, FEBS Lett. 307, 113-115 (1992); Horn et al, WO 91/14003; Jeffreys, EP 370,719), and in a large number of genetic mapping studies.
  • Other polymo ⁇ hisms include single nucleotide variations between individuals of the same species. Such polymo ⁇ hisms are far more frequent than RFLPs, STRs and VNTRs.
  • SNP single nucleotide polymo ⁇ hisms
  • cSNP protein-coding sequences
  • genes in which polymo ⁇ hisms within coding sequences give rise to genetic disease include ⁇ - globin (sickle cell anemia), apoE4 (Alzheimer's Disease), Factor V Leiden (thrombosis), and CFTR (cystic fibrosis).
  • cSNPs can alter the codon sequence of the gene and therefore specify an alternative amino acid.
  • sequences provide different levels of information regarding the structure of the gene of interest and the variations of the gene sequence that affect phenotype in the organism from which the sequence is derived.
  • the term "variation" or "polymo ⁇ hism” implies that more than one version of the gene has been sequenced for comparison.
  • Information on allelic variation is sometimes available for chromosomal copies of genes, if more than one example of a chromosomal copy has been sequenced, though often only one version is sequenced.
  • Information on allelic variation for cDNA (and thus only the coding portion of a gene) is also sometimes available if more than one version has been sequenced.
  • this information clearly does not include any information for example from regions upstream or downstream of the mature RNA nor the introns and therefore does not provide complete information of the gene structure nor the expression phenotype of a gene, where expression phenotype includes both expression level and structure of the gene product.
  • information on allelic variation is also likely to be available for EST sequences, as it is possible that more than one example of a given EST has been sequenced.
  • EST sequence does not provide information from for example regions nor the introns.
  • the currently available sequence information does not readily provide complete information on phenotypic allelic variation or where phenotypic variation could be available, the information is incomplete and lacks genetic structure information.
  • Typical methods for determining complete or additional gene structure include generating PCR products based in part on known gene structure (Shiinoki et al, Metabolism, 48:581-584 (1999)) or sequencing PCR products wherein one primer is derived from cDNA sequence and the other primer is derived from Alu repetitive element sequence (Monani and Burgess, Genome Res. 6:1200-1206 (1996)).
  • These methods have the disadvantage of requiring some prior knowledge of the gene structure and the additional step of PCR amplification of portions of the genomic sequence. Therefore, a rapid, cost effective method is needed to determine the useful sequence of a chromosomal copy of a gene of interest, for gene structure determination wherein prior knowledge of the gene structure and/or the sequencing of the entire nontranscribed portions of the gene are not required.
  • the method of the present invention provides gene structure, wherein gene structure includes the coding regions or exons together with the exon-intron boundaries (a point on a line that separates two regions) with at least about 50 nucleotides of sequence of all intron termini determined as well as boundaries between the mature transcript and 5' and 3' UTR and at least about 50 nucleotides of sequence of the 5' and 3' UTR adjacent to the mature transcript.
  • gene structure includes the coding regions or exons together with the exon-intron boundaries (a point on a line that separates two regions) with at least about 50 nucleotides of sequence of all intron termini determined as well as boundaries between the mature transcript and 5' and 3' UTR and at least about 50 nucleotides of sequence of the 5' and 3' UTR adjacent to the mature transcript.
  • two regions separated by a boundary are adjacent or contiguous in the genomic copy of the gene of interest.
  • the present invention is drawn to a method of determining boundaries between at least one exon and at least one non-exon (where non-exon includes introns as well as sequence 5' and 3' of that which ultimately becomes the mature RNA sequence) region of a gene.
  • boundary therefore, refers to the junction between exon and non-exon sequence.
  • sequence refers to the arrangement of specific nucleotides within the specified polynucleic acid.
  • exon refers to a segment or region of nucleotides within a eukaryotic gene that is retained in the mature RNA transcript such as mRNA, tRNA and rRNA.
  • exons comprise coding sequences that encode part of the final gene product, and regulatory sequences, such as leader sequences.
  • leader sequence refers to nucleotide sequence at the 5' end of a gene of interest that is transcribed but is not part of the final gene product.
  • the method of the present invention comprises the steps of conducting one or more sequencing reactions, comprising a template and a primer or set of primers.
  • the template comprises a gene or fragment thereof (e.g.
  • the primer, or set thereof comprises one or more oligonucleotides, wherein said oligonucleotides hybridize to the cDNA or RNA product of interest, wherein said cDNA or RNA product has known sequence.
  • the primers of the present invention comprise one or a set of oligonucleotides that hybridize to the coding and non-coding strand of said cDNA or to said RNA product.
  • the primers are hybridized to the gene of interest or fragment thereof and used to prime template dependent nucleic acid polymerization. Sequence obtained is compared with the known sequence of said cDNA or RNA product. Sequence obtained that is not within the sequence of the cDNA or RNA product reveals the boundaries between said exons and said non-exon regions and reveals non-exon sequence.
  • One of ordinary skill in the art can readily assemble the sequence and boundary information thus obtained to generate the gene structure of said gene of interest.
  • cDNA or RNA sequence of interest can be obtained from commercial or public databases, such as GenBank.
  • cDNA or RNA can be obtained by standard laboratory protocols, such as those described in Chapters 7 and 8 in Molecular Cloning, a Laboratory Manual (Sambrook et al, Cold Spring Harbor Laboratory Press, (1989)).
  • standard laboratory protocols such as those described in Chapters 7 and 8 in Molecular Cloning, a Laboratory Manual (Sambrook et al, Cold Spring Harbor Laboratory Press, (1989)).
  • One of ordinary skill in the art would readily be able to either construct the necessary cDNA libraries and/or screen libraries for the desired cDNA or RNA using standard laboratory techniques.
  • libraries can be screened using antibodies specific for the encoded protein of interest or oligonucleotide probes that hybridize the cDNA or RNA of interest as described in Chapter 12 of Sambrook et al.
  • antibodies specific for the encoded protein of interest or oligonucleotide probes that hybridize the cDNA or RNA of interest as described in Chapter 12 of Sambrook et al.
  • one of ordinary skill in the art can readily obtain a sequence of the cDNA or RNA from commercial sequencing companies; commercial sequencing apparatusi or by the following standard laboratory techniques, such as that provided in Chapter 13 of Sambrook et al.
  • primers suitable for the methods described herein can be designed and produced using techniques well-known to those of skill in the art.
  • the term "primer” refers to an oligonucleotide suitable for the pu ⁇ ose of initiating template dependent nucleic acid synthesis.
  • Said primer can comprise, for example, deoxyribonucleotides.
  • the primers are about 5 to about 50 nucleotides in length.
  • the primer is about 20 nucleotides in length.
  • the primers have a T m of about 42 to about 55°C. The primers do not have to be exactly complementary to the cDNA, as long as they specifically hybridize to one location of the template to be sequenced.
  • primer picking programs can be used, such as “Oligo 5.0" (MedProbe AS, Norway).
  • the term “set of primers” comprises one or more primers, such that the primers hybridize to a polynucleotide strand of interest.
  • the primers hybridize to the polynucleotide of interest at discrete intervals.
  • the primers hybridize to the polynucleotide of interest as intervals of about 50 to about 500 nucleotides.
  • the primers hybridize at intervals of about 100 to about 300 nucleotides. In still another embodiment, the primers hybridize at intervals of about 100 to about 200 nucleotides. In another embodiment of the present invention, the primers hybridize to said cDNA at evenly spaced intervals. In another embodiment of the present invention, the set of primers hybridize at similarily or evenly spaced internals on both strands of a double-stranded polynucleotide of interest. Primers that hybridize with said cDNA or RNA product at similarly or evenly spaced intervals are referred to herein as "even spaced” or "tiled primers".
  • the primers are designed such that sequence information generated from one primer extends at least until the 5' terminus of the next downstream primer, if there is no intervening sequence. In this way, no intervening sequences are missed by this method.
  • primers could be designed such that they hybridize at about nucleotides 1-20, 120-140, 240-260, 360-380, 480- 500, 600-620, 720-740, 840-860 and 960-980 of one strand and bases 1000-980, 880-860, 760-740, 640-620, 520-500, 400-380, 280-260, 160-140 and 40-20 of the opposite stand.
  • One of ordinary skill in the art can optimize the number of primers necessary based on known information about the cDNA. To further reduce costs for example, the number of primers can be reduced. For example, domains of the cDNA that are thought to be typically encoded by one exon or a known number of exons in a given pattern may not require multiple internal primers. In another example, the gene structure of the cDNA may be known for another organism. Therefore, primers can be designed to hybridize near putative boundaries.
  • the primers are used to prime sequencing reactions of at least one template comprising all or a portion of the gene of interest.
  • the present invention can be used with any nucleic acid sequence from eukaryotic archaebacterial or viral sources wherein regions of said sequence have been processed, e.g. joined together by excising intervening sequences present in the original parent molecule.
  • the primers are designed using the processed molecule and the template is the original parent molecule.
  • the eukaryotic source comprises fungal, plant, mammalian and non-mammalian sources.
  • the gene or fragment thereof to be used as template can be isolated from any tissue, fluid or extract from an organism comprising said polynucleic acid of interest.
  • said polynucleic acid of interest can be derived from a libraries in the form of artificial chromosome libraries.
  • libraries contain chromosomal DNA in excess of 100 kilobases in length.
  • libraries can be in yeast artificial chromosome (YAC) libraries, bacterial artificial chromosome (BAC) or PI artificial chromosome (PAC) libraries.
  • BAC and PAC libraries are especially useful because these are bacterial plasmid- based vectors that be easily isolated, manipulated and amplified.
  • Such libraries are well known in the art and commercially available.
  • one of ordinary skill in the art can isolate the template as described in Example 2. Templates of various lengths can be used. Uncloned genomic DNA can be used as template.
  • the template is about 10 to about 500 kilobases in length and about 250 nanograms to 2.5 micrograms is used.
  • the method of the present invention is useful to determine the boundaries between regions of nucleic acid that were separated by intervening sequence wherein said intervening sequence has been removed. For example, cDNA can be analyzed, wherein the boundaries between the exons comprising the cDNA and the introns present in the gene are determined. In addition, the method of the present invention is useful for the determination of boundaries present in genes containing group 1 type introns such as Tetrahymena rRNA, where self-splicing occurs in the presence of guanosine cofactor.
  • the method of the present invention provides sequence extending into the non-exon regions of the gene of interest, h one embodiment, the present invention provides sequence information of the promoter and enhancer upstream of the 5' UTR of the cDNA. In a one embodiment, the present invention provides sequence in the upstream of the 5' most exon wherein the 5' most exon is up to about 500 base pairs before the transcription initiation site. In another embodiment of the present invention, sequence upstream of the transcription initiation site is provided, comprising the promoter of the gene of interest.
  • the sequencing reactions can be conducted simultaneously in a multiplex assay so long as the sequence information can be unambiguously assigned to a given primer.
  • the non-exon regions comprise sequence upstream and downstream of the mature RNA, as well as intron sequence. It is well known in the art that eukaryotic gene structure comprises promoter and enhancer sequences 5' to the coding sequence, followed by a terminator sequence on the 3' side of the coding sequence.
  • eukaryotic genes are transcribed into a primary RNA transcript which comprises untranslated region (5' UTR) with introns upstream of the start codon or ATG, followed by the coding sequence, interrupted by introns, followed by a stop codon such as TAA, followed by 3' untranslated region (3' UTR) and ending in polyA tail.
  • Said primary RNA transcripts are also refened to herein as "pre-mRNA” and as "heterogeneous nuclear RNA or hnRNA".
  • Introns if present, are removed from the 5' UTR and from the coding sequence to generate a mature transcript.
  • the intronic sequences of a gene generally do not contain sequence useful in the removal of introns except for near the 5' and 3' termini (e.g. within 50 bases of the boundary).
  • the 5' and 3' temiini of the intron sequences contain the donor and the acceptor sites for splicing or removal of the introns. These sites are known to contain consensus sequences that are required by the splicing machinery of the cell to properly excise the intron sequences in order to generate mature RNA product such as mRNA. Mutations to such consensus sequences prevent the accurate removal of introns. Therefore, not only are the sequences of the exons important, but the sequences of the consensus sequences within the introns are also important.
  • Donor consensus sequence for example, comprises SEQ ID NO: 1, AGGTAAGT, wherein the first two nucleotides, AG, are present within the exon and the last six nucleotides, GTAAGT, are present within the intron.
  • GTAAGT last six nucleotides
  • the 3' terminus of an intron comprises the sequence 12Py NCAGN, wherein 12Py stands for 12 pyrimidine bases and N stands for A, G, C or T and wherein the last nucleotide is present in the exon and the remaining nucleotides are present at the 3' terminus of the intron.
  • the method of the present invention provides this sequence without the need to sequence the entire intron or the entire gene.
  • 18-24 nucleotides upstream of the 3' splice site within the intron comprises a "branch site.”
  • This branch site is another consensus site necessary for the proper removal of the intron.
  • this consensus sequence is highly conserved and comprises SEQ ID NO: 2, UACUAAC.
  • other eukaryotic branch site consensus sequences are not highly conserved and comprises a sequence of 7 nucleotides in length having a sequence according to Table HI. Table m *
  • the adenosine residue at position 6 in the branch point sequence is required for proper intron removal.
  • the adenine at position 6 is the site at which the lariat between the 5' end of the intron and the internal portion of the intron is formed through a 5'-2' phosphodiester bond.
  • the method of the present invention also provides this sequence without having to sequence the entire intron or the entire gene.
  • the present invention also provides sequence present on the 3' side of the mature RNA of interest.
  • the method of the present invention provides at least about 50 nucleotides of sequence from each primer. Therefore, if the primer hybridizes near the boundary between an exon and non-exon, then at least about 50 nucleotides of non-exon sequence is provided. This sequence is sufficient in length to define genomic consensus sequences that are required for transcription, proper removal of introns and translation of said gene to generate functional gene product.
  • Example 1 p53 In Silico Experiment p53 was chosen as an in Silico test for the present invention.
  • the cDNA of p53 is approximately 1.3 kilobases in length.
  • Primers were designed using software for primer design, (Oligo 5.0 MedProbe AS, Oslo, Norway). The parameters used for the software were: 50 mM monovalent salt and the T , was chosen to be between 42 and 55°C. Oligonucleotides of 20 bases in length were generated using both the coding and the non-coding strand of p53 cDNA.
  • oligos that were separated on the respective strand of cDNA by about 90 to about 195 nucleotides were chosen for further analysis (Tables IV and V).
  • Oligos were user-defined.
  • the selected primers were aligned on the genomic sequence of p53 as shown in Figure 2.
  • Each of primers 1-5, 7 and 9 from Table IV hybridized completely within an exon.
  • Primers 6 and 8 hybridized at an intron/exon boundary and are therefore not expected to result in a successful sequencing reaction. It can readily be seen by one of ordinary skill in the art that sequencing reactions using these primers and a genomic copy of p53 reveal all intron exon boundaries and all useful intronic sequence.
  • relevant sequence information from the p53 gene is extracted from about 350 bases of sequence information, including exon intron boundaries, enhancer, promoter and intron consensus sequences. When added to the sequence of the cDNA (1.3 kilobases), the complete gene structure and relevant sequence information for phenotypic allelic variations is obtained.
  • Example 2 Boundary Determination of Human Cytochrome P450 2C19 Screening and Isolation of a Bacterial Artificial Chromosome encoding the Human Cytochrome P450 2C19 Gene (CYP450 2C19 gene).
  • CYP450 2C8 '2C9, '2C18 and '2C19 Four members of the Cytochrome P450 2C subfamily are known: CYP450 2C8, '2C9, '2C18 and '2C19 (leiri and Higuchi, J. Toxicol. Sci. 23:129-131, (1998)).
  • the CYP450 2C19 gene is flanked by two other members of the CYP450 2C family, CYP450 2C18 and CYP450 2C9 (Gray, et al, Genomics, 28:328-332 (1995)).
  • gene specific primers were designed such that amplicons would be generated from the 5' end, the middle and the 3' end of the coding region (Table VI). For an amplicon from the putative boundary between intron 4 and exon 5, primers were taken as published in the partial gene structure de Morais, et al, Mol. Pharmacol, 46:594-598 (1994)).
  • the primers were used for primary PCR screening of 48 human BAC DNA pools from Research Genetics (Huntsville, Alabama). PCR reactions were carried out using 100 ⁇ M dNTP, 1.5 mM units AmplitaqTM (PE BioSystems, Foster City, California) in a final volume of 14 ⁇ l. Cycling conditions were are follows: 94° for 2 minutes then 35 cycles of 94° for 30 seconds, 30 seconds at the appropriate annealing temperature (T m , Table VI) and 45 seconds at 72° C, followed by a final extension at 72° for 7 minutes. For each primer, the positive pools from the primary screening were subjected to secondary screening. For a secondary screening, each pool was split into 48 samples, which consisted of 10 plate pools, 14 row pools and 24 column pools.
  • SEQ ID NO: 21 3.
  • SEQ ID NO: 23 5 SEQ ID NO: 2 5
  • BAC-DNA was isolated on a large scale as follows.
  • a single BAC colony was picked and inoculated in a starter culture of 5 ml medium (LB medium containing 12.5 ug/ml chloramphenicol). The culture was shaken vigorously at 37°C until the OD 600 nm read between 1.0-1.5 (6-8 hrs). OD 600 should be maintained at 1.0-2.0; however, if the growth exceeds the limit less pre-culture volume per 500 ml culture can be used.
  • the resuspended bacteria were centrifuged at 4500 x g (GSA rotor at 5100 rpm) for 20 min at 4°C.
  • Steps 4 to 7 were repeated one time.
  • Each bacterial pellet was gently and completely resuspend in 50 ml of ice-cold QiagenTM Buffer PI (containing RNAse A (100 ug/ml) as per Qiagen instructions) (Valencia, California) and incubated for 10 min. at room temperature (e.g. 24.0°C).
  • Buffer P2 is the most critical step to keep the E. coli contamination low. Buffer P2 must be quickly and completely distributed throughout the cell suspension after its addition.
  • the bottle was incubated undisturbed at room temperature for 15 minutes. 12. 50 ml ice-cold Buffer P3 was added to each bottle, mixed immediately by gently inverting 4-6 times, and incubated on ice for 30 min.
  • the bottles were centrifuged at 20,000 x g (GSA rotor at 11 ,000 ipm) for 30 min. at 4°C.
  • the bottles were VERY GENTLY recovered from the centrifuge without disturbing the pellet.
  • the bottles were placed in such a way that they did not move at all while the supernatant was recovered.
  • the supernatant was removed promptly using a 25 ml pipette and transferred to a fresh 250 ml bottle.
  • the supernatant was re-centrifuge at 20,000 x g for 15 min. at 4°C. The supernatant was promptly removed and kept on ice. The total volume was about 150 ml. Note: Filter through cheesecloth to remove cell debris if necessary.
  • Buffer QF Qiagen
  • the eluted DNA (20 ml total) was transfened to a 45 ml centrifuge tube.
  • the tubes were centrifuged immediately at >15,000 x g (SA 600 rotor at 11 ,000 ⁇ m) for 30 min. at 4°C. The supernatant was carefully discard by decanting. This was done as soon as the centrifuge came to a stop. Note: The pellet will be BOTH at the bottom of the tube AND as a streak on the tube wall, so be very gentle. The pellet may become detached.
  • the tube was vortexed gently to ensure that most of the DNA was dissolved.
  • the tube was spun for 5 min. to collect the solution to the bottom of the tube.
  • the tube was left at 4°C overnight to allow for the DNA to completely dissolve.
  • the BAC -DNA was directly sequenced as follows, except that 250 nanograms of
  • BAC-DNA template was used instead of 2.5 micrograms of genomic DNA.
  • Genomic DNA should be of high quality (eg , OD 26 o 28 o ⁇ ' 7-1 9) and be quantitated accurately, e g , by fluorometry and by agarose gel electrophoresis The DNA does not need to be of a certain size
  • the PCR tubes were capped tightly and then quickly spun to collect all the reagents.
  • thermocycler In a thermocycler, the following program was run: 1) 95°C for 5 minutes
  • the sample was transfered from the collection tube into a plate format so that it was easier to load them onto the sequencing gel.
  • the samples were dried in a plate vacuum centrifuge, applying medium heat and checked after 30 minutes. High temperature was not used. The sample was completely dry before transfer to sequencing gel.
  • the plate was quickly spun in a plate centrifuge for up to 800g. 4. The plate was throughly vortexed with a plate shaker for 5 minutes.
  • Primers were chosen such that they were spaced approximately 150 bp apart. One set of primers was complementary to the non-coding strand and an a second set was complementary to the coding strand of the CYP450 2C19 cDNA sequence (lower, L and upper, U respectively).
  • the primers were chosen with the software Oligo 5.0, using the same parameters as described in the p53 in silico experiment.
  • additional primers were chosen manually. All primers are listed in Table VIII.
  • the non-underlined sequence is provided herein for the first time and includes sequence belonging to 5' and 3' untranslated regions or intronic regions.
  • the novel sequences are assigned SEQ ID Nos. as follows.
  • Figure 5 shows the 5' and 3' intron and 5' and 3' untranslated sequences provided by the method of the present invention, assembled with the published cDNA sequence (Romkes et al), published sequence is in capital letters, the ATG start codon is boxed, and positions of the primers are boxed. All newly discovered sequence are in lower case and underlined. Missing intron sequence is shown as a string of underlined "n.”

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention is drawn to a method of identifying boundaries between exon and non-exon regions of genes. In addition to the boundary between said regions, the boundary between sequence on the 5' and 3' termini of the non-exon region is determined. Furthermore, sequence within the non-exon regions, e.g., sequence at the 5' and 3' termini of the non-exon regions sequence upstream and downstream of the coding regions of the gene (i.e. within the 5' and 3' non-translated regions, respectively) is determined. Therefore, as a result of the method of the present invention, gene structure can be determined without the need to sequence the entire gene. The present invention is useful, for example, in germ line sequence variation analysis.

Description

RAPID DETERMINATION OF GENE STRUCTURE USING cDNA SEQUENCE
RELATED APPLICATIONS
This application is a Continuation of U.S. Application No.: 09/488,127, filed January 20, 2000 entitled "Rapid Determination of Gene Structure Using cDNA Sequence," the teachings of which are incorporated herein in their entirety.
BACKGROUND OF THE INVENTION
Typically, eukaryotic genes comprise sequences (exons) destined to be part of the mature RNA interrupted by sequences that are not destined to be part of the mature RNA. Such interrupting sequences are known as intervening sequences or introns. The exons comprise coding sequence and 5' regulartory sequence. The combination of coding sequence and introns is transcribed into a primary RNA transcript. Genes also comprise non-coding sequence 5' of the transcribed region; such upstream regions are known as enhancers and promoters. Thus, genomic sequence that is not present in the mature RNA product, be it mRNA, rRNA or tRNA comprises enhancer and promoter sequences 5' of the translated region as well as introns interspersed within the translated region. The intervening sequences must be removed to generate mature RNA as a product itself or for translation into protein. For example, generation of messenger RNA (mRNA) requires processing to remove introns. Primary mRNA is processed into mature mRNA by 5' capping, removal of intervening sequences, and addition of a polyA tail on the 3' terminus of the mRNA.
The human genome as well as those of most other mammals is in the range of 3 xlO9 base pairs. The average size of a gene or primary transcript is 16.6 kilobase pairs, of which 2.2 kilobase pairs is the average size of the mature mRNA. Therefore, non-coding regions make up the vast majority of the size of genes (about 87%).
The study of allelic variations comprises more than the study of variations within the exons and requires the information present in the genomic version of the gene of interest, information that does not ultimately end up in the mRNA or final RNA product. Typically, this information is available only when the complete sequence of a chromosomal copy of a gene of interest is obtained. Therefore, not all sequence pertinent to gene structure and phenotypic variation is available in cDNA or EST sequence, because these sequences are derived from mature transcribed copies of the genes where introns have been removed.
A typical method for obtaining the desired genetic information comprises cloning and sequencing the entire chromosomal copy of a gene of interest. This method is very costly and time consuming and involves sequencing many thousand kilobases of DNA in order to obtain enough sequence coverage to assemble a given gene.
SUMMARY OF THE INVENTION
The present invention is drawn to a method of determining gene structure including boundaries between exons and introns of a gene and between 5' or 3' termini of mature RNA transcripts and the adjacent genomic sequence, including intron termini 5' and 3' untranslated regions (UTR) and promoter and enhancer sequence. As used herein, the term "gene structure" refers to the order of exons and introns in the chromosomal copy of a gene as well as about 50 to about 300 nucleotides of sequence 5' and 3' of each exon terminus. As used herein, the term "non-exon regions" refers to 5' untranscribed regions of the gene, 3' untranscribed regions of the gene and introns. The present invention further provides genomic sequence 5' and 3' of the mature RNA termini, as well as sequence of 5' and 3' ends of introns. Furthermore, the sequence provided herein can be used to obtain additional sequence 5' and 3' of the mature RNA termini as well as additional intron sequence if desired, e.g. using primer walking with sequence obtained by the present method. In the method of the present invention, regions of a chromosomal copy of a gene or fragments thereof are sequenced using a set of primers. In the method of the present invention the sequence of the mature transcript is known. In one embodiment, the primers cover both strands of the cDNA, at evenly spaced or similarly spaced intervals. The present invention provides information necessary to determine gene structure and phenotypic expression without the need to sequence the entire chromosomal copy of the gene or fragment thereof. As a result of the method of the present invention, gene structure can be determined without the need to sequence the entire gene. The present invention is useful, for example, in germ line sequence variation analysis.
The method of the present invention is drawn to determining gene structure, where at least some portion of the genomic sequence of the gene of interest is unknown. The method involves sequencing the gene across exon-intron boundaries using evenly spaced primers, or tiled primers. The tiled primers comprise nucleic acids that hybridize to the known cDNA sequence of the gene at about 100 to about 300 base intervals and the gene comprises the template.
More specifically, the present invention is drawn to a method of determining boundaries between at least one exon and at least one non-exon of a gene. The method comprises the steps of conducting one or more sequencing reactions, comprising a template and a primer or set of primers. In the method of the present invention, the template comprises a gene or fragment thereof and the primer or set thereof comprises at least one oligonucleotide, wherein said oligonucleotide hybridizes to the cDNA encoded by said gene or fragment thereof and wherein said cDNA has known sequence. The set of primers of the present invention comprises oligonucleotides that hybridize to the coding and non-coding strand of said cDNA. In the method of the present invention, sequence obtained as described above is compared with the known sequence of said cDNA, thereby determining the boundaries between the sequence corresponding to exons (cDNA) and the sequence corresponding to non-exons, wherein sequence obtained as described above that is not within the sequence of the cDNA is non-exon sequence. As described herein, because much of the DNA sequence of a gene is not likely to contain gene structure information or phenotypic allelic variation information, the vast portion need not be sequenced to determine gene structure or even to determine most of sequence of the gene that can affect phenotype. Other than regulatory sequences in the 5' non-transcribed region, such as the enhancer and the promoter, 5', 3' non- translated regions and consensus sequences necessary for correct removal of introns, much of the non-coding sequence does not appear to affect gene expression phenotype. Therefore, all that is required for analysis of expression of a functional gene product is the sequence of each exon together with that portion of the intron that encompasses the consensus splice sequences, as well as conserved promoter and terminator sequences required for the minimal regulation of gene expression and stability of the mRNA product.
The present invention has several advantages. The present invention does not require prior knowledge of "genomic sequence" including boundaries between exon and non-exon sequence, nor knowledge of any sequence within the non-exon regions. The present invention requires much less work and therefore saves time and money than traditional methods of determining gene structure because the entire chromosomal copy of a gene need not be sequenced.
For example, using conventional techniques, to determine the gene structure of a gene contained in a 150 kb BAC clone, including a 6 fold sequencing and labor, the cost would be at least 20 times more than the method of the present invention. If the 150 kb BAC clone contains coding sequence for a 2 kb cDNA, for example, the method of the present invention could provide the gene structure from 37 sequencing reactions using 30 primers. This includes 20 primers designed for a first round of sequencing reactions where the primers hybridize at 200 base intervals on both strands of the cDNA. This estimate also includes a 25% failure rate in first round sequencing reactions such that 5 sequencing must be repeated as well as a 50% failure of primes such that 10 new primers must be synthesized an used in sequencing reactions and synthesized to fill in any the gaps. Thus the gene structure can be determined using the method of the present invention in two rounds of sequencing with a total of 25 primers and 25 sequencing reactions. One of ordinary skill in the art can readily determine if and when additional primers need to be designed for additional rounds of sequencing and how to design the additional primers. On the other hand, to sequence the entire 150 kb BAC clone, if each sequencing reaction yields 500 bases of sequence, a minimum of 300 sequencing reactions must be conducted with 300 primers. The time involved to sequence the entire BAC clone is also an important factor and is estimated at 2 months in contrast to the estimated two weeks required in the present invention.
The present invention is also drawn to human cytochrome P450 2C19 sequence. More particularly, the present invention is drawn to SEQ ID NOS: 59, 61, 63, 65, 67, 71, 73, 75, 77, 79, 81, 84, 86, 89 and 91.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a schematic diagram of the present invention.
Figure 2 is a schematic diagram of the hybridization pattern of primers of Tables m and IV with the p53 cDNA, SEQ ID NO: 96.
Figure 3 is alignment of primers on the P450 2C19 cDNA,SEQ ID NO: 58.
Figure 4 shows a sequence obtained using the P450 2C19 gene as template and the cDNA specific primers according to Example II.
Figure 5 is the gene structure of human P450 2C19 as determined by the present invention in the form of a composite sequence, SEQ ID NOS: 59 and 97, where the underlined sequence is novel sequence and the primer hybridization sites and starting ATG are boxed.
Figure 6 is a schematic diagram of the human P450 2C19 gene.
DETAILED DESCRIPTION OF THE INVENTION As used herein, the term "gene" refers to a contiguous stretch of deoxynucleotides comprising the basic unit of heredity of an organism, encoding a given protein or RNA. As used herein, the terms "gene" and also "genomic DNA" comprises one or more exon or part thereof, one or more intron or part thereof, all or a portion of the 5' untranslated region, and all or a portion of the 3' untranslated region. As used herein, the term "gene structure" includes the coding regions or exons together with the exon-intron boundaries with at least 50 nucleotides of sequence of all intron termini as well as 5' and 3' UTR. The gene structure as determined by the present invention can also include promoter and enhancer sequences.
As used herein, the term "cDNA" refers to complementary DNA of an mRNA molecule. cDNA can represent the complete mRNA or a fragment thereof. RNA product can be mRNA, tRNA, rRNA or other structural RNA.
As used herein, the term "polymorphism" is an allelic variation in nucleic acid sequence between two or more samples. Such polymorphisms can be, for example, restriction fragment length polymorphism (RFLP), a variation in DNA sequence that alters the length of a restriction fragment (Botstein et al., Am. J. Hum. Genet. 32, 314-331 (1980)). Other polymoφhisms include of short tandem repeats (STRs) that include tandem di-, tri- and tetra-nucleotide repeated motifs. These tandem repeats are also referred to as variable number tandem repeat (VNTR) polymorphisms. VNTRs have been used in identity and paternity analysis (US 5,075,217; Armour et al, FEBS Lett. 307, 113-115 (1992); Horn et al, WO 91/14003; Jeffreys, EP 370,719), and in a large number of genetic mapping studies. Other polymoφhisms include single nucleotide variations between individuals of the same species. Such polymoφhisms are far more frequent than RFLPs, STRs and VNTRs. Some single nucleotide polymoφhisms (SNP) occur in protein-coding sequences (coding sequence SNP (cSNP)), in which case, one of the polymoφhic forms may give rise to the expression of a defective or otherwise variant protein and, potentially, a genetic disease. Examples of genes in which polymoφhisms within coding sequences give rise to genetic disease include β- globin (sickle cell anemia), apoE4 (Alzheimer's Disease), Factor V Leiden (thrombosis), and CFTR (cystic fibrosis). cSNPs can alter the codon sequence of the gene and therefore specify an alternative amino acid. Such changes are called "missense" when another amino acid is substituted, and "nonsense" when the alternative codon specifies a stop signal in protein translation. When the cSNP does not alter the amino acid specified the cSNP is called "silent". Other single nucleotide polymoφhisms occur in noncoding regions. Some of these polymoφhisms may also result in defective protein expression (e.g., as a result of defective splicing). Other single nucleotide polymoφhisms have no phenotypic effects. Genetic information is available or obtainable in several forms comprising genomic sequence (sequence of chromosomal versions of genes), or non-genomic sequence such as cDNA sequence or expressed sequence tag (EST) sequence. These types of sequences provide different levels of information regarding the structure of the gene of interest and the variations of the gene sequence that affect phenotype in the organism from which the sequence is derived. The term "variation" or "polymoφhism" implies that more than one version of the gene has been sequenced for comparison. Information on allelic variation is sometimes available for chromosomal copies of genes, if more than one example of a chromosomal copy has been sequenced, though often only one version is sequenced. Information on allelic variation for cDNA (and thus only the coding portion of a gene) is also sometimes available if more than one version has been sequenced. However, this information clearly does not include any information for example from regions upstream or downstream of the mature RNA nor the introns and therefore does not provide complete information of the gene structure nor the expression phenotype of a gene, where expression phenotype includes both expression level and structure of the gene product. Finally, information on allelic variation is also likely to be available for EST sequences, as it is possible that more than one example of a given EST has been sequenced. However, like cDNA sequence, EST sequence does not provide information from for example regions nor the introns. Thus, the currently available sequence information does not readily provide complete information on phenotypic allelic variation or where phenotypic variation could be available, the information is incomplete and lacks genetic structure information.
Other typical methods for determining complete or additional gene structure include generating PCR products based in part on known gene structure (Shiinoki et al, Metabolism, 48:581-584 (1999)) or sequencing PCR products wherein one primer is derived from cDNA sequence and the other primer is derived from Alu repetitive element sequence (Monani and Burgess, Genome Res. 6:1200-1206 (1996)). These methods have the disadvantage of requiring some prior knowledge of the gene structure and the additional step of PCR amplification of portions of the genomic sequence. Therefore, a rapid, cost effective method is needed to determine the useful sequence of a chromosomal copy of a gene of interest, for gene structure determination wherein prior knowledge of the gene structure and/or the sequencing of the entire nontranscribed portions of the gene are not required.
The method of the present invention provides gene structure, wherein gene structure includes the coding regions or exons together with the exon-intron boundaries (a point on a line that separates two regions) with at least about 50 nucleotides of sequence of all intron termini determined as well as boundaries between the mature transcript and 5' and 3' UTR and at least about 50 nucleotides of sequence of the 5' and 3' UTR adjacent to the mature transcript. For example, in the genomic copy of the gene of interest, two regions separated by a boundary are adjacent or contiguous in the genomic copy of the gene of interest.
The present invention is drawn to a method of determining boundaries between at least one exon and at least one non-exon (where non-exon includes introns as well as sequence 5' and 3' of that which ultimately becomes the mature RNA sequence) region of a gene. The term boundary therefore, refers to the junction between exon and non-exon sequence. The term "sequence" refers to the arrangement of specific nucleotides within the specified polynucleic acid.
As used herein, the term exon refers to a segment or region of nucleotides within a eukaryotic gene that is retained in the mature RNA transcript such as mRNA, tRNA and rRNA. As μsed herein, exons comprise coding sequences that encode part of the final gene product, and regulatory sequences, such as leader sequences. As used herein, "leader sequence" refers to nucleotide sequence at the 5' end of a gene of interest that is transcribed but is not part of the final gene product. The method of the present invention comprises the steps of conducting one or more sequencing reactions, comprising a template and a primer or set of primers. In the method of the present invention, the template comprises a gene or fragment thereof (e.g. genomic sequence or genomic polynucleotide). The primer, or set thereof, comprises one or more oligonucleotides, wherein said oligonucleotides hybridize to the cDNA or RNA product of interest, wherein said cDNA or RNA product has known sequence. In one embodiment, the primers of the present invention comprise one or a set of oligonucleotides that hybridize to the coding and non-coding strand of said cDNA or to said RNA product.
The primers are hybridized to the gene of interest or fragment thereof and used to prime template dependent nucleic acid polymerization. Sequence obtained is compared with the known sequence of said cDNA or RNA product. Sequence obtained that is not within the sequence of the cDNA or RNA product reveals the boundaries between said exons and said non-exon regions and reveals non-exon sequence. One of ordinary skill in the art can readily assemble the sequence and boundary information thus obtained to generate the gene structure of said gene of interest.
Sequence of cDNA or RNA product of interest is obtained by methods well known in the art. For example, cDNA or RNA sequence of interest can be obtained from commercial or public databases, such as GenBank. In addition, cDNA or RNA can be obtained by standard laboratory protocols, such as those described in Chapters 7 and 8 in Molecular Cloning, a Laboratory Manual (Sambrook et al, Cold Spring Harbor Laboratory Press, (1989)). One of ordinary skill in the art would readily be able to either construct the necessary cDNA libraries and/or screen libraries for the desired cDNA or RNA using standard laboratory techniques. For example, libraries can be screened using antibodies specific for the encoded protein of interest or oligonucleotide probes that hybridize the cDNA or RNA of interest as described in Chapter 12 of Sambrook et al. Furthermore, one of ordinary skill in the art can readily obtain a sequence of the cDNA or RNA from commercial sequencing companies; commercial sequencing aparati or by the following standard laboratory techniques, such as that provided in Chapter 13 of Sambrook et al.
Primers suitable for the methods described herein can be designed and produced using techniques well-known to those of skill in the art. As used herein, the term "primer" refers to an oligonucleotide suitable for the puφose of initiating template dependent nucleic acid synthesis. Said primer can comprise, for example, deoxyribonucleotides. In one embodiment of the present invention, the primers are about 5 to about 50 nucleotides in length. In a preferred embodiment, the primer is about 20 nucleotides in length. In another embodiment, the primers have a Tm of about 42 to about 55°C. The primers do not have to be exactly complementary to the cDNA, as long as they specifically hybridize to one location of the template to be sequenced. However, if ambiguous sequence is obtained from a particular primer, one or ordinary skill in the art would readily be able to design a more suitable primer if necessary. To design and pick the primers, "primer picking" programs can be used, such as "Oligo 5.0" (MedProbe AS, Norway). As used herein, the term "set of primers" comprises one or more primers, such that the primers hybridize to a polynucleotide strand of interest. In one embodiment, the primers hybridize to the polynucleotide of interest at discrete intervals. In another embodiment, the primers hybridize to the polynucleotide of interest as intervals of about 50 to about 500 nucleotides. In one embodiment, the primers hybridize at intervals of about 100 to about 300 nucleotides. In still another embodiment, the primers hybridize at intervals of about 100 to about 200 nucleotides. In another embodiment of the present invention, the primers hybridize to said cDNA at evenly spaced intervals. In another embodiment of the present invention, the set of primers hybridize at similarily or evenly spaced internals on both strands of a double-stranded polynucleotide of interest. Primers that hybridize with said cDNA or RNA product at similarly or evenly spaced intervals are referred to herein as "even spaced" or "tiled primers". In one embodiment, the primers are designed such that sequence information generated from one primer extends at least until the 5' terminus of the next downstream primer, if there is no intervening sequence. In this way, no intervening sequences are missed by this method. For example, if the gene structure of a cDNA comprising 1000 base pairs in length is to be analyzed using the method of the present invention and a distance of 100 bases between primers and a primer length of 20 bases, then primers could be designed such that they hybridize at about nucleotides 1-20, 120-140, 240-260, 360-380, 480- 500, 600-620, 720-740, 840-860 and 960-980 of one strand and bases 1000-980, 880-860, 760-740, 640-620, 520-500, 400-380, 280-260, 160-140 and 40-20 of the opposite stand. It is understood that one of ordinary skill in the art can readily determine the spacing or interval between primers based on chosen sequencing parameters, polymerase used, hybridization conditions and conditions of sequencing such as buffer concentration and salt concentration as well as primer length and hybridization temperature. Furthermore, one or ordinary skill in the art can ensure primer coverage of the cDNA by routine optimization.
One of ordinary skill in the art can optimize the number of primers necessary based on known information about the cDNA. To further reduce costs for example, the number of primers can be reduced. For example, domains of the cDNA that are thought to be typically encoded by one exon or a known number of exons in a given pattern may not require multiple internal primers. In another example, the gene structure of the cDNA may be known for another organism. Therefore, primers can be designed to hybridize near putative boundaries.
As described above, the primers are used to prime sequencing reactions of at least one template comprising all or a portion of the gene of interest. The present invention can be used with any nucleic acid sequence from eukaryotic archaebacterial or viral sources wherein regions of said sequence have been processed, e.g. joined together by excising intervening sequences present in the original parent molecule. For example, the primers are designed using the processed molecule and the template is the original parent molecule. In one embodiment of the present invention, the eukaryotic source comprises fungal, plant, mammalian and non-mammalian sources. Using standard techniques in the art, the gene or fragment thereof to be used as template can be isolated from any tissue, fluid or extract from an organism comprising said polynucleic acid of interest. In addition, said polynucleic acid of interest can be derived from a libraries in the form of artificial chromosome libraries. Such libraries contain chromosomal DNA in excess of 100 kilobases in length. For example, such libraries can be in yeast artificial chromosome (YAC) libraries, bacterial artificial chromosome (BAC) or PI artificial chromosome (PAC) libraries. BAC and PAC libraries are especially useful because these are bacterial plasmid- based vectors that be easily isolated, manipulated and amplified. Such libraries are well known in the art and commercially available. For example, one of ordinary skill in the art can isolate the template as described in Example 2. Templates of various lengths can be used. Uncloned genomic DNA can be used as template. One of ordinary skill in the art can readily determine the optimal conditions for using uncloned genomic DNA. For example, the smaller the size of the genomic DNA or fragment thereof, the higher the signal to noise ratio and less template can be used. In one embodiment of the present invention, the template is about 10 to about 500 kilobases in length and about 250 nanograms to 2.5 micrograms is used.
The method of the present invention is useful to determine the boundaries between regions of nucleic acid that were separated by intervening sequence wherein said intervening sequence has been removed. For example, cDNA can be analyzed, wherein the boundaries between the exons comprising the cDNA and the introns present in the gene are determined. In addition, the method of the present invention is useful for the determination of boundaries present in genes containing group 1 type introns such as Tetrahymena rRNA, where self-splicing occurs in the presence of guanosine cofactor.
The method of the present invention provides sequence extending into the non-exon regions of the gene of interest, h one embodiment, the present invention provides sequence information of the promoter and enhancer upstream of the 5' UTR of the cDNA. In a one embodiment, the present invention provides sequence in the upstream of the 5' most exon wherein the 5' most exon is up to about 500 base pairs before the transcription initiation site. In another embodiment of the present invention, sequence upstream of the transcription initiation site is provided, comprising the promoter of the gene of interest.
In one embodiment, about 300 bases of sequence are obtained from each sequencing reaction from each primer. However, it is not necessary to obtain 300 bases of new sequence. In another embodiment, at least about 50 bases of intron sequence at the boundary of the exon, non-exon region is provided. In still another embodiment, the sequencing reactions can be conducted simultaneously in a multiplex assay so long as the sequence information can be unambiguously assigned to a given primer. As described, the non-exon regions comprise sequence upstream and downstream of the mature RNA, as well as intron sequence. It is well known in the art that eukaryotic gene structure comprises promoter and enhancer sequences 5' to the coding sequence, followed by a terminator sequence on the 3' side of the coding sequence. For example, eukaryotic genes are transcribed into a primary RNA transcript which comprises untranslated region (5' UTR) with introns upstream of the start codon or ATG, followed by the coding sequence, interrupted by introns, followed by a stop codon such as TAA, followed by 3' untranslated region (3' UTR) and ending in polyA tail. Said primary RNA transcripts are also refened to herein as "pre-mRNA" and as "heterogeneous nuclear RNA or hnRNA". Introns, if present, are removed from the 5' UTR and from the coding sequence to generate a mature transcript.
The intronic sequences of a gene generally do not contain sequence useful in the removal of introns except for near the 5' and 3' termini (e.g. within 50 bases of the boundary). The 5' and 3' temiini of the intron sequences, contain the donor and the acceptor sites for splicing or removal of the introns. These sites are known to contain consensus sequences that are required by the splicing machinery of the cell to properly excise the intron sequences in order to generate mature RNA product such as mRNA. Mutations to such consensus sequences prevent the accurate removal of introns. Therefore, not only are the sequences of the exons important, but the sequences of the consensus sequences within the introns are also important. Donor consensus sequence for example, comprises SEQ ID NO: 1, AGGTAAGT, wherein the first two nucleotides, AG, are present within the exon and the last six nucleotides, GTAAGT, are present within the intron. In particular, as shown in Table 1 , the first two bases of the consensus sequence in the intron (GT) are absolutely conserved.
The method of the present invention would provide this sequence information as well as sequence further into the intron, without the need to sequence the entire intron or the entire gene. Table I*
Sequence Derived From:
Exon Intron
Position 1 2 3 4 5 6 7 8
Base A G G T A A G T
Frequency .64 .73 1 1 .62 .68 .84 .63
*Information derived from Figure 30.3 In: Genes VI, by B. Lewin, Oxford University Press (1997).
Furthermore, as shown in Table π, the 3' terminus of an intron, otherwise known as the acceptor site, comprises the sequence 12Py NCAGN, wherein 12Py stands for 12 pyrimidine bases and N stands for A, G, C or T and wherein the last nucleotide is present in the exon and the remaining nucleotides are present at the 3' terminus of the intron.
Table if Sequence Derived From:
Intron Exon
Position 1-12 13 14 15 16 17
Base 12Py N C A G N
Frequency - - .65 1 1 -
"Information derived from Figure 30.3 In: Genes VI, by B. Lewin, Oxford University Press
(1997).
The method of the present invention provides this sequence without the need to sequence the entire intron or the entire gene. In addition, 18-24 nucleotides upstream of the 3' splice site within the intron comprises a "branch site." This branch site is another consensus site necessary for the proper removal of the intron. In yeast, this consensus sequence is highly conserved and comprises SEQ ID NO: 2, UACUAAC. However, other eukaryotic branch site consensus sequences are not highly conserved and comprises a sequence of 7 nucleotides in length having a sequence according to Table HI. Table m*
Branch Point Sequence
Position 1 2 3 4 5 6
7
Base Py N Py Py Pu A Py
Frequency .80 - .80 .87 .75 - .95
"Information derived from Figure 30.3 In.Genes VI, by B. Lewin, Oxford University Press (1997). Py = pyrimidine, Pu = Purine.
The adenosine residue at position 6 in the branch point sequence is required for proper intron removal. The adenine at position 6 is the site at which the lariat between the 5' end of the intron and the internal portion of the intron is formed through a 5'-2' phosphodiester bond. The method of the present invention also provides this sequence without having to sequence the entire intron or the entire gene. The present invention also provides sequence present on the 3' side of the mature RNA of interest. The method of the present invention provides at least about 50 nucleotides of sequence from each primer. Therefore, if the primer hybridizes near the boundary between an exon and non-exon, then at least about 50 nucleotides of non-exon sequence is provided. This sequence is sufficient in length to define genomic consensus sequences that are required for transcription, proper removal of introns and translation of said gene to generate functional gene product.
Example 1: p53 In Silico Experiment p53 was chosen as an in Silico test for the present invention. The cDNA of p53 is approximately 1.3 kilobases in length. Primers were designed using software for primer design, (Oligo 5.0 MedProbe AS, Oslo, Norway). The parameters used for the software were: 50 mM monovalent salt and the T , was chosen to be between 42 and 55°C. Oligonucleotides of 20 bases in length were generated using both the coding and the non-coding strand of p53 cDNA.
Of the complete set of oligos described above and without regard to the genomic structure of p53 (a blind selection based only on hybridization parameters and distance between primers), oligos that were separated on the respective strand of cDNA by about 90 to about 195 nucleotides were chosen for further analysis (Tables IV and V).
Table IV: Primers for Upper Strand
Figure imgf000017_0001
Table V: Primers for Lower Strand
Figure imgf000018_0001
Oligos were user-defined.
The selected primers were aligned on the genomic sequence of p53 as shown in Figure 2. Each of primers 1-5, 7 and 9 from Table IV hybridized completely within an exon. Primers 6 and 8 hybridized at an intron/exon boundary and are therefore not expected to result in a successful sequencing reaction. It can readily be seen by one of ordinary skill in the art that sequencing reactions using these primers and a genomic copy of p53 reveal all intron exon boundaries and all useful intronic sequence. Thus, relevant sequence information from the p53 gene is extracted from about 350 bases of sequence information, including exon intron boundaries, enhancer, promoter and intron consensus sequences. When added to the sequence of the cDNA (1.3 kilobases), the complete gene structure and relevant sequence information for phenotypic allelic variations is obtained. This complete information is obtained from about 1650 bases of sequence rather than the entire sequence of the p53 gene (over 20 kilobases). Example 2: Boundary Determination of Human Cytochrome P450 2C19 Screening and Isolation of a Bacterial Artificial Chromosome encoding the Human Cytochrome P450 2C19 Gene (CYP450 2C19 gene).
Four members of the Cytochrome P450 2C subfamily are known: CYP450 2C8, '2C9, '2C18 and '2C19 (leiri and Higuchi, J. Toxicol. Sci. 23:129-131, (1998)). The CYP450 2C19 gene is flanked by two other members of the CYP450 2C family, CYP450 2C18 and CYP450 2C9 (Gray, et al, Genomics, 28:328-332 (1995)). In order to ensure that a BAC containing the entire CYP450 2C19 gene was isolated, gene specific primers were designed such that amplicons would be generated from the 5' end, the middle and the 3' end of the coding region (Table VI). For an amplicon from the putative boundary between intron 4 and exon 5, primers were taken as published in the partial gene structure de Morais, et al, Mol. Pharmacol, 46:594-598 (1994)).
The primers were used for primary PCR screening of 48 human BAC DNA pools from Research Genetics (Huntsville, Alabama). PCR reactions were carried out using 100 μM dNTP, 1.5 mM units Amplitaq™ (PE BioSystems, Foster City, California) in a final volume of 14 μl. Cycling conditions were are follows: 94° for 2 minutes then 35 cycles of 94° for 30 seconds, 30 seconds at the appropriate annealing temperature (Tm, Table VI) and 45 seconds at 72° C, followed by a final extension at 72° for 7 minutes. For each primer, the positive pools from the primary screening were subjected to secondary screening. For a secondary screening, each pool was split into 48 samples, which consisted of 10 plate pools, 14 row pools and 24 column pools. Based on the secondary screen, eight clones were identified as potentially containing the human CYP4502C19 gene. These clones were streaked on agar plates containing 12.5 μg/ml chloramphenicol and incubated at 37° for 48 hours. Table VI: Primer pairs used to screen for a BAC Clone harboring the full length CYP450 2C19 Gene
Figure imgf000020_0001
1. SEQ ID NO: 21 3. SEQ ID NO: 23 5. SEQ ID NO: 25
2. SEQ ID NO: 22 4. SEQ ID NO: 24 6. SEQ ID NO: 2δ
For each clone, two colonies were picked using a sterile pipet tip and suspended in 50 μl sterile double-distilled water. Suspended colonies (5μl/reaction) were screened for 3 PCR amplicons as described above. Only 3 colonies were positive for all 3 amplicons. Of these, the clone with the plate address 421-N-l 1 was used as template for "rapid gene structure determination" as described below.
Rapid Determination of CYP450 2C19 Gene Structure
BAC-DNA was isolated on a large scale as follows.
1. A single BAC colony was picked and inoculated in a starter culture of 5 ml medium (LB medium containing 12.5 ug/ml chloramphenicol). The culture was shaken vigorously at 37°C until the OD600nm read between 1.0-1.5 (6-8 hrs). OD600 should be maintained at 1.0-2.0; however, if the growth exceeds the limit less pre-culture volume per 500 ml culture can be used.
2. 2.5-5.0 ml of the starter culture was inoculated into 500 ml of LB- chloramphenicol and grown at 37°C for 14-16 hrs with vigorous shaking (horizontal gyrator platform at -250 φm). 3. The 500 ml culture was poured into a 1 liter centrifuge bottle, and harvested by centrifugation at 4500 x g (GSA rotor at 5100 rpm) for 20 min. at 4°C.
4. The supernatant was discarded by decanting it into a waste beaker.
5. The bottle was kept inverted and drained on a paper towel. 6. Each bacterial pellet was gently and completely resuspend in 100 ml of ice- cold 10 mM Tris (pH 8.0) and transferred to a 250 ml centrifuge bottle.
7. The resuspended bacteria were centrifuged at 4500 x g (GSA rotor at 5100 rpm) for 20 min at 4°C.
8. Steps 4 to 7 were repeated one time. 9. Each bacterial pellet was gently and completely resuspend in 50 ml of ice-cold Qiagen™ Buffer PI (containing RNAse A (100 ug/ml) as per Qiagen instructions) (Valencia, California) and incubated for 10 min. at room temperature (e.g. 24.0°C). Note: Addition of Buffer P2 is the most critical step to keep the E. coli contamination low. Buffer P2 must be quickly and completely distributed throughout the cell suspension after its addition.
10. 50 ml of Buffer P2 was added and the bottle capped and very slowly inverted 4-6 times, NOT vortexed or shaken vigorously.
11. The bottle was incubated undisturbed at room temperature for 15 minutes. 12. 50 ml ice-cold Buffer P3 was added to each bottle, mixed immediately by gently inverting 4-6 times, and incubated on ice for 30 min.
13. The bottles were centrifuged at 20,000 x g (GSA rotor at 11 ,000 ipm) for 30 min. at 4°C.
14. The bottles were VERY GENTLY recovered from the centrifuge without disturbing the pellet. The bottles were placed in such a way that they did not move at all while the supernatant was recovered. The supernatant was removed promptly using a 25 ml pipette and transferred to a fresh 250 ml bottle.
15. The supernatant was re-centrifuge at 20,000 x g for 15 min. at 4°C. The supernatant was promptly removed and kept on ice. The total volume was about 150 ml. Note: Filter through cheesecloth to remove cell debris if necessary.
16. Two QIAGEN-tip 500™ columns were equilibrated by applying 20 ml Buffer QBT, (see manufacturer's instructions), and allowed to empty by gravity.
17. 75 ml of supernatant (half of the supernatant from step 14) was applied to each QIAGEN-tip 500 column and allowed to enter the resin by gravity flow.
18. Each QIAGEN-tip 500 column was washed with the 3 x 25 ml Buffer QC (see manufacturer's instructions).
19. Buffer QF (Qiagen) was pre-warmed in a water bath at 65°C and the DNA was eluted from the Qiagen-tip with 4 successive aliquots of 5 ml of pre- warmed Buffer QF for 9 total volume of 20 ml per tip.
20. The eluted DNA (20 ml total) was transfened to a 45 ml centrifuge tube.
21. 14 ml of room temperature isopropanol was added to each 20 ml DNA solution. The contents were mixed by inverting gently several times.
22. The tubes were centrifuged immediately at >15,000 x g (SA 600 rotor at 11 ,000 φm) for 30 min. at 4°C. The supernatant was carefully discard by decanting. This was done as soon as the centrifuge came to a stop. Note: The pellet will be BOTH at the bottom of the tube AND as a streak on the tube wall, so be very gentle. The pellet may become detached.
23. The excess ethanol was quickly drained by inverting over blotting paper for 5 min. The outside of the tube was marked to indicate the location of the pellet.
24. To each DNA pellet was added 2 ml of -20°C 70% ethanol. The tube was gently rotated such that the ethanol washed the bottom and the walls of the tube. The tubes were allowed to stand undisturbed for 5 min. at RT, then centrifuged at 10,000 x g for 10 min. 25. The supernatant was promptly (i.e. as soon as the centrifuge comes to a stop) discarded by decanting it from the tube. The tube was allowed to stand inverted on blotting paper for 5 min. to drain. 26. The pellet was air-dried for 10 min. by laying the open tube on its side. 27. The DNA was dissolved by adding 150 μl of Tris EDTA (TE) (pH 8.0). The tube was vortexed gently to ensure that most of the DNA was dissolved. The tube was spun for 5 min. to collect the solution to the bottom of the tube. The tube was left at 4°C overnight to allow for the DNA to completely dissolve. The BAC -DNA was directly sequenced as follows, except that 250 nanograms of
BAC-DNA template was used instead of 2.5 micrograms of genomic DNA.
DIRECT GENOMIC DNA SEQUENCING Setting up sequencing reactions.
Note: Table VII below allows for DNA concentrations of 0.31-2.5 μg/μl. DNA with a lower concentration may be used if the primer is used at a higher concentration. 1. Sequencing mix (1 rxn) was prepared for each primer reaction as follows:
TABLE Vπ
Figure imgf000023_0001
1 Genomic DNA should be of high quality (eg , OD 26o28o~' 7-1 9) and be quantitated accurately, e g , by fluorometry and by agarose gel electrophoresis The DNA does not need to be of a certain size
2 BigDye™ Mix Perkin Elmer/ABI BigDye™ Terminator Ready Reaction Mix with AmphTaq® FS, Part number 4303151 (for 5000rxn kit)
3 5X CSA buffer, Perkin Elmer Part number 361058C
4 Sequencing pπmers non-vector Cone = 3 2 uM Primer picking programs may be used with the following requirements
Tm = 50°C
%GC= 50%
Oligo Length 18-22 bp
Avoid designing A or T at the first two bases at either the 3' or the 5' ends Avoid more than two consecutive G or C at either the 3' or the 5' ends 2. The sequencing cocktail was vortexed to ensure it was well -mixed.
3. The PCR tubes were capped tightly and then quickly spun to collect all the reagents.
4. In a thermocycler, the following program was run: 1) 95°C for 5 minutes
2) 95°C for 30 seconds
3) 55°C for 20 seconds
4) 65°C for 4 minutes
5) Go to 2 for an additional 99 times 6) 4°C hold
7) End
5. At the end of the sequencing reaction, the plate was taken out of the thermocycler and quickly spun to collect the contents.
Post-Sequencing Cleanup. One column (Centri-Sep™ , Princeton Separations #CS901) per sequencing reaction was used per sequencing reaction and the columns were non-reusable.
Column hydration (Step 1 , below) was performed on the same day as the sequencing reactions. The next day, the sequencing reactions were processed through the pre- hydrated columns. Column Hydration
1. The column was tapped gently to insure that the dry gel had settled to the bottom of the spin column.
2. The top column cap was removed and the column reconstituted by adding 0.8 ml of nuclease-free water. The column end stopper was left in place so that column could stand up by itself.
3. The column cap was replaced and the gel was hydrated. It is important to hydrate all of the dry gel. Very effective mixing and hydration was accomplished by vigorous agitation on a vortex mixer. The column was incubated at least 30 minutes at room temperature before being stored in a refrigerator overnight. 4. Next day, the column was removed from the refrigerator and allows to warm to room temperature before continuing with this procedure.
Removal of Interstitial Fluid
1. Trapped air bubbles were removed from the column gel by vigorously tapping the column, allowing the air bubbles to rise to the surface. Particular attention was paid to the bottom of the column. The gel was allowed to settle. After the gel had settled and was free of bubbles, the column cap was removed, and then the column end stopper was removed from the bottom. Note: The order of removing caps prevents further bubble formation. 2. Excess column liquid was allowed to drain by gravity into a 2-ml wash tube while in the microtube rack. If the liquid does not begin to flow through the filtered end of the column in a reasonable time (about 1 minute), give it a quick spin in a tabletop microcentrifuge for up to 750g. After the column stopped draining, the water was discarded and the column was put back into the same wash tube.
Note: While using the tabletop microcentrifuge, it is important to be aware of the orientation of the columns through all subsequent steps in this procedure. Place an orientation mark on the top rim of the spin column and keep it pointing to the outside of the rotor at all times. 3. The column was placed and its wash tube in the centrifuge. The columns and wash tubes were spun in the microcentrifuge at 750g for 2 minutes. 4. After the centrifugation, drops of water at the end of the column were blotted it dry. The wash tubes and the interstitial fluid was discarded. The gel material was not allowed to dry excessively. The samples were processed within the next 2-3 minutes.
Sample processing
1. The entire reaction mixture was transfered to the top of the gel. The sample directly was carefully dispensed onto the center of the gel bed at the top of the column, without disturbing the gel surface. Note: Do not contact the sides of the column with the reaction mixture or the sample pipet tip (because the sample may bypass going into the pores of the gel).
2. The marked columns were placed into the labeled 1.5-ml sample collection tube. Proper column orientation was maintained. The column and collection tube were spun in the tabletop microcentrifuge at 750g for 2 minutes.
3. The sample was transfered from the collection tube into a plate format so that it was easier to load them onto the sequencing gel. The samples were dried in a plate vacuum centrifuge, applying medium heat and checked after 30 minutes. High temperature was not used. The sample was completely dry before transfer to sequencing gel.
LOADING SEQUENCING GELS
1. Each well was resuspended with 1.5 μl of loading dye.
2. The plate was sealed.
3. The plate was quickly spun in a plate centrifuge for up to 800g. 4. The plate was throughly vortexed with a plate shaker for 5 minutes.
5. The plate was quickly spun again.
6. The samples were denatured in a thermocycler or heat block for 2 minutes at 90°C.
7. The samples were vortexed samples AGATN for 1-2 minutes to ensure resuspension.
8. The plate was quick spun again.
9. The samples were stored on ice until they were ready to be loaded.
10. 1.3 μl of the sample was loaded onto an ABI (P.E. BioSystems, Foster City, California) sequencer sequence data was collected following manufacturer's instructions.
Primers were chosen such that they were spaced approximately 150 bp apart. One set of primers was complementary to the non-coding strand and an a second set was complementary to the coding strand of the CYP450 2C19 cDNA sequence (lower, L and upper, U respectively). For the first round of sequencing, the primers were chosen with the software Oligo 5.0, using the same parameters as described in the p53 in silico experiment. For the second sequencing round, where primers chosen by Oligo 5.0 failed and where cDNA sequence lacked coverage from the first round of sequencing, additional primers were chosen manually. All primers are listed in Table VIII. Hybridization of the primers is shown in Figure 3, which shows CYP450 2C19 cDNA sequence (AC #M61854, Romkes et al. Biochemistry 32:1390 (1993), SEQ ID NO: 58). The ATG start codon is underlined. In Figure 3, primers used for direct sequencing using the BAC-cDNA template described above are shown above and below the cDNA depending on whether they hybridize the coding or non-coding strand, respectively. Primers with an asterix yielded usable sequence. The coding region is shown in capital letters. SEQ ID NOS: of the primers are provided in Table VIII.
Table VII
Figure imgf000027_0001
Figure imgf000028_0001
Table VIII Names and sequence of all primers used for direct BAC-DNA sequencing Primers for the first round have a suffix of 1-22 and 24 m parentheses whereas primers for the second round have a suffix of 2° 2-2° 9 in parentheses The second round pπmers were positioned slightly off those pπming positions which yielded no readable sequence duπng the first round of sequencing Failed sequencing was mostly due to pπmer design (* ) in puπne-πch regions of the cDNA sequence However the pπmers 947L and 950U hybπdized to a exon - intron boundary (#) and thus did not yield any sequence The pπmers 948U and 948L of the second round solved that problem
RESULTS
Of 23 primers designed and used during the first round of direct sequencing using BAC DNA template, 15 yielded good sequence. Using the information of the first round direct sequencing, 8 primers were designed for a second round. Six of these primers gave excellent sequence information. Overall 31 primers were used to decipher the gene structure with a total success rate of 65%. Sequence Alignments and Gene Structure Building:
Sequences obtained using the direct primers from both rounds of sequencing were compared to the published CYP450 2C19 cDNA sequence. This was done by using the program "Bestfit" in the GCG-analysis program software package (Genetics Computer Group, Madison WI 53711 USA) and refined by manual editing.
Using the method of the present invention, all exon - intron boundaries of the CYP450 2C19 gene encoded on BAC 421-N-l 1 were resolved. As the primers were chosen to cover the entire cDNA, it is reasonably certain that no introns were missed. An area between primers 617L (09) and 680U (10) was not covered by direct sequencing. However, de Morais et al. (1994) have published the adjacent intron boundaries 5' of exon 5. The published sequence were added to gene structure of human CYP450 2C19 to give SEQ ID NO: 59 (see Figure 5). The sequence obtained from 19 of the primers is shown in Figure 4, SEQ ID NOS. 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 83, 85, 87, 88, 90, 92 and 94, respectively. The non-underlined sequence is provided herein for the first time and includes sequence belonging to 5' and 3' untranslated regions or intronic regions. The novel sequences are assigned SEQ ID Nos. as follows.
Table IX
Figure imgf000029_0001
Figure imgf000030_0001
Using the method of the present invention, at least 200 bp of previously unknown intron sequences adjacent to each exon end are provided, as shown in Figure 4. Figure 5 shows the 5' and 3' intron and 5' and 3' untranslated sequences provided by the method of the present invention, assembled with the published cDNA sequence (Romkes et al), published sequence is in capital letters, the ATG start codon is boxed, and positions of the primers are boxed. All newly discovered sequence are in lower case and underlined. Missing intron sequence is shown as a string of underlined "n."
A combination of all newly derived 5' and 3' sequence and intron sequence with the already published cDNA sequence is shown below (Figure 6). In summary, the coding sequence of human CYP450 2C19 is disrupted by 8 introns. A total of about 6,700 bp of previously unknown sequence was added to the 1,746 bp of published cDNA using the method of the present invention, including more than 600 bp of previously unknown 5' sequence, which should harbor the transcriptional control elements. Thus, all intron-exon boundaries and consensus splice sequences, relevant gene structure information, is provided by the method of the present invention without having to sequence the entire gene.
While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

Claims

CLADvISWhat is claimed is:
1. A method of determining gene structure when genomic sequence is unknown comprising sequencing the gene across exon-intron boundaries using evenly spaced primers, wherein the primers comprise nucleic acids that hybridize to cDNA sequence of the gene at about 100 to about 300 base intervals and the gene comprises the template, wherein cDNA sequence of said gene is known.
2. The method of Claim 1 further comprising sequencing the gene beyond known 5' and 3' regions to determine 5' and 3' untranslated region sequence.
3. The method of Claim 1 , wherein the primers cover both strands of the gene.
4. A method of determining stmcture of a gene from conesponding cDNA where the cDNA sequence is known, comprising sequencing the gene wherein the genomic DNA is the template and the primers comprise evenly spaced primers obtained from the cDNA sequence.
5. The method of Claim 4, wherein the primers comprise nucleic acid about 5 to about 50 nucleotides long and wherein said primers hybridize to the cDNA at intervals of about 100 to about 300 nucleotides.
6. The method of Claim 5, wherein the primers cover both strands of the cDNA.
7. A method of identifying boundaries between at least one exon and at least one non-exon region of a gene, comprising the steps of: a) conducting a plurality of sequencing reactions wherein each reaction comprises a template and primer, wherein the template comprises a gene or fragment thereof and wherein the primers hybridize to coding and non-coding strands of cDNA of said gene, wherein said cDNA has known sequence; and b) determining the boundaries between said exons and said non-exon regions.
8. The method of Claim 7, wherein said primers hybridize to said cDNA at about 300 base intervals.
9. The method of Claim 8, wherein said primers hybridize to a coding and non- coding strand of said cDNA at about 200 base intervals.
10. The method of Claim 9, wherein said primers hybridize to a coding and non- coding strand of said cDNA at about 100 base intervals.
11. The method of Claim 7, wherein the gene is from eukaryotic, archaebacterial or viral sources.
12. The method of Claim 9, wherein said eukaryotic source includes, fungal, plant, mammalian and non-mammalian sources.
13. The method of Claim 7, wherein step a) is repeated with additional primers.
14. A method of determining exon adjacent sequence, comprising: a) contacting a genomic sequence of a cDNA of interest with primers, wherein the cDNA has known sequence and wherein the primers hybridize to coding and non-coding strands of said cDNA, under conditions suitable for said primers to hybridize to said genomic sequence; b) conducting template dependent sequencing of said genomic sequence of interest using said hybridized primers; c) comparing the sequence obtained in b) with the sequence of said cDNA, wherein sequence of b) that is not found in the sequence of said cDNA is exon adjacent sequence.
15. The method of Claim 14, wherein steps a), b) and c) are repeated with additional primers.
16. The method of Claim 14, wherein the primers hybridize said cDNA at about 300 base intervals.
17. The method of Claim 16, wherein said primers hybridize to a coding and non- coding strand of said cDNA at about 200 base intervals.
18. The method of Claim 17, wherein said primers hybridize to a coding and non- coding strand of said cDNA at about 100 base intervals.
19. A method of identifying the exon-intron boundaries and 5' and 3' untranslated regions of a gene wherein all or a portion of the genomic sequence is unknown and the cDNA sequence is known, comprising the steps of: a) designing primers based on the corresponding cDNA sequence of the gene, wherein the primers comprise about 5 to about 50 nucleotides and wherein said primers hybridize the cDNA at evenly spaced intervals of about 100 to about 300 nucleotides; b) sequencing the gene using the gene as a template and the evenly spaced primers designed in step a); c) analyzing the sequences obtained in step b) using a sequence alignment program to determine newly obtained sequences and to further determine regions where the cDNA and the newly obtained sequences differ, wherein the exon-intron boundaries and 5' and 3' untranslated regions comprise regions where the cDNA and newly obtained sequences differ.
20. The method of Claim 19, wherein step a) the primers are evenly spaced along both strands of the cDNA.
21. The method of Claim 19, wherein steps a), b) and c) are repeated with additional primers.
22. A composition comprising isolated polynucleic acid selected from the group consisting of : SEQ ID NO: 61, SEQ ID NO: 63, SEQ ID NO: 65, SEQ ED NO: 67, SEQ ID NO: 69, SEQ ID NO: 71, SEQ ID NO: 73, SEQ ID NO: 75, SEQ ED NO: 77, SEQ ID NO: 79, SEQ ID NO: 81, SEQ ID NO: 84, SEQ ID NO: 86, SEQ ID NO: 89 and SEQ ED NO: 91.
23. A composition comprising isolated polynucleic acid comprising SEQ ID NO: 59.
PCT/US2001/001461 2000-01-20 2001-01-17 RAPID DETERMINATION OF GENE STRUCTURE USING cDNA SEQUENCE Ceased WO2001053529A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AU29532/01A AU2953201A (en) 2000-01-20 2001-01-17 Rapid determination of gene structure using cdna sequence
EP01942674A EP1294943A2 (en) 2000-01-20 2001-01-17 RAPID DETERMINATION OF GENE STRUCTURE USING cDNA SEQUENCE
CA002398683A CA2398683A1 (en) 2000-01-20 2001-01-17 Rapid determination of gene structure using cdna sequence

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US48812700A 2000-01-20 2000-01-20
US09/488,127 2000-01-20

Publications (3)

Publication Number Publication Date
WO2001053529A2 WO2001053529A2 (en) 2001-07-26
WO2001053529A9 true WO2001053529A9 (en) 2002-10-24
WO2001053529A3 WO2001053529A3 (en) 2003-01-16

Family

ID=23938426

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/001461 Ceased WO2001053529A2 (en) 2000-01-20 2001-01-17 RAPID DETERMINATION OF GENE STRUCTURE USING cDNA SEQUENCE

Country Status (4)

Country Link
EP (1) EP1294943A2 (en)
AU (1) AU2953201A (en)
CA (1) CA2398683A1 (en)
WO (1) WO2001053529A2 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2091102C (en) * 1992-03-06 2009-05-26 John R. Ii Wetterau Microsomal triglyceride transfer protein
US5707863A (en) * 1993-02-25 1998-01-13 General Hospital Corporation Tumor suppressor gene merlin
US5578493A (en) * 1993-09-01 1996-11-26 The Trustees Of Columbia University In The City Of New York Wilson's disease gene
US5858661A (en) * 1995-05-16 1999-01-12 Ramot-University Authority For Applied Research And Industrial Development Ataxia-telangiectasia gene and its genomic organization
WO1999009147A1 (en) * 1997-08-13 1999-02-25 Icos Corporation Truncated platelet-activating factor acetylhydrolase
WO2000001816A1 (en) * 1998-07-02 2000-01-13 Imperial Cancer Research Technology Limited TUMOUR SUPPRESSOR GENE DBCCR1 AT 9q32-33

Also Published As

Publication number Publication date
AU2953201A (en) 2001-07-31
WO2001053529A3 (en) 2003-01-16
CA2398683A1 (en) 2001-07-26
WO2001053529A2 (en) 2001-07-26
EP1294943A2 (en) 2003-03-26

Similar Documents

Publication Publication Date Title
US8501459B2 (en) Test probes, common oligonucleotide chips, nucleic acid detection method, and their uses
EP1448793B1 (en) Annealing control primer and its uses
JP2006528482A (en) Method for reverse transcription and / or amplification of nucleic acid
US20030175749A1 (en) Annealing control primer and its uses
KR100649165B1 (en) Annealing Control Primer and Its Uses
Men et al. Sanger DNA sequencing
Arcot et al. High-resolution cartography of recently integrated human chromosome 19-specific Alu fossils
Osanai et al. Essential motifs in the 3′ untranslated region required for retrotransposition and the precise start of reverse transcription in non-long-terminal-repeat retrotransposon SART1
US6312913B1 (en) Method for isolating and characterizing nucleic acid sequences
Miller et al. Whole blood RNA offers a rapid, comprehensive approach to genetic diagnosis of cardiovascular diseases
JP2001512694A (en) Method and kit for determining HLA class I type of DNA
WO2001053529A9 (en) RAPID DETERMINATION OF GENE STRUCTURE USING cDNA SEQUENCE
Gonen et al. High throughput fluorescent CE-SSCP SNP genotyping
US8110357B2 (en) Method for detecting an individual who is afflicted with or a carrier for Van Buchem's disease
CN101343667A (en) A kind of aquatic animal SNP marker screening method
WO2001062966A2 (en) Methods for characterizing polymorphisms
CN109554462B (en) PCR primer group, kit, amplification system and detection method of gene CYP11B1 exon
US20070190535A1 (en) Size fractionation of nucleic acid samples
JP2008079604A (en) Primer set, probe set, method and kit for predicting alcohol resolution and drunkenness tolerance
CN1809637A (en) Method of isolating nucleic acid and, for nucleic acid isolation, kit and apparatus
CN108753990B (en) Whole-genome microsatellite marker of Charybdis feriatus, screening method and application
EP1896608A1 (en) Il10 snp associated with acute rejection
AU2007201538A1 (en) Methods for identification of Alport Syndrome
Sellas et al. Isolation and characterization of 10 tetranucleotide microsatellite loci in an enigmatic East African bird, the spot-throat (Modulatrix stictigula).
WO2004065573A2 (en) Novel high throughput method of generating and purifying labeled crna targets for gene expression analysis

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2398683

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 29532/01

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 520370

Country of ref document: NZ

WWE Wipo information: entry into national phase

Ref document number: 2001942674

Country of ref document: EP

AK Designated states

Kind code of ref document: C2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1/18-18/18, DRAWINGS, REPLACED BY NEW PAGES 1/19-19/19; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWP Wipo information: published in national office

Ref document number: 2001942674

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2001942674

Country of ref document: EP

NENP Non-entry into the national phase in:

Ref country code: JP