AU2953201A

AU2953201A - Rapid determination of gene structure using cdna sequence

Info

Publication number: AU2953201A
Application number: AU29532/01A
Authority: AU
Inventors: Michael S. Fitzgerald; Hans-Ulrich Thomann
Original assignee: Genome Therapeutics Corp
Current assignee: Oscient Pharmaceuticals Corp
Priority date: 2000-01-20
Filing date: 2001-01-17
Publication date: 2001-07-31
Also published as: CA2398683A1; EP1294943A2; WO2001053529A2; WO2001053529A9; WO2001053529A3

Description

WO 01/53529 PCT/USO1/01461 -1 RAPID DETERMINATION OF GENE STRUCTURE USING cDNA SEQUENCE RELATED APPLICATIONS This application is a Continuation of U.S. Application No.: 09/488,127, filed 5 January 20, 2000 entitled "Rapid Determination of Gene Structure Using cDNA Sequence," the teachings of which are incorporated herein in their entirety. BACKGROUND OF THE INVENTION Typically, eukaryotic genes comprise sequences (exons) destined to be part of the mature RNA interrupted by sequences that are not destined to be part of the 10 mature RNA. Such interrupting sequences are known as intervening sequences or introns. The exons comprise coding sequence and 5' regulartory sequence. The combination of coding sequence and introns is transcribed into a primary RNA transcript. Genes also comprise non-coding sequence 5' of the transcribed region; such upstream regions are known as enhancers and promoters. Thus, genomic 15 sequence that is not present in the mature RNA product, be it mRNA, rRNA or tRNA comprises enhancer and promoter sequences 5' of the translated region as well as introns interspersed within the translated region. The intervening sequences must be removed to generate mature RNA as a product itself or for translation into protein. For example, generation of messenger RNA (mRNA) requires processing 20 to remove introns. Primary mRNA is processed into mature mRNA by 5' capping, removal of intervening sequences, and addition of a polyA tail on the 3' terminus of the mRNA. The human genome as well as those of most other mammals is in the range of 3 x10 9 base pairs. The average size of a gene or primary transcript is 16.6 25 kilobase pairs, of which 2.2 kilobase pairs is the average size of the mature mRNA.

WO 01/53529 PCT/USO1/01461 -2 Therefore, non-coding regions make up the vast majority of the size of genes (about 87%). The study of allelic variations comprises more than the study of variations within the exons and requires the information present in the genomic version of the 5 gene of interest, information that does not ultimately end up in the mRNA or final RNA product. Typically, this information is available only when the complete sequence of a chromosomal copy of a gene of interest is obtained. Therefore, not all sequence pertinent to gene structure and phenotypic variation is available in cDNA or EST sequence, because these sequences are derived from mature transcribed 10 copies of the genes where introns have been removed. A typical method for obtaining the desired genetic information comprises cloning and sequencing the entire chromosomal copy of a gene of interest. This method is very costly and time consuming and involves sequencing many thousand kilobases of DNA in order to obtain enough sequence coverage to assemble a given 15 gene. SUMMARY OF THE INVENTION The present invention is drawn to a method of determining gene structure including boundaries between exons and introns of a gene and between 5' or 3' termini of mature RNA transcripts and the adjacent genomic sequence, including 20 intron termini 5' and 3' untranslated regions (UTR) and promoter and enhancer sequence. As used herein, the term "gene structure" refers to the order of exons and introns in the chromosomal copy of a gene as well as about 50 to about 300 nucleotides of sequence 5' and 3' of each exon terminus. As used herein, the term "non-exon regions" refers to 5' untranscribed regions of the gene, 3' untranscribed 25 regions of the gene and introns. The present invention further provides genomic sequence 5' and 3' of the mature RNA termini, as well as sequence of 5' and 3' ends of introns. Furthermore, the sequence provided herein can be used to obtain additional sequence 5' and 3' of the mature RNA termini as well as additional intron sequence if desired, e.g. using primer walking with sequence obtained by the present 30 . method.

WO 01/53529 PCT/USO1/01461 -3 In the method of the present invention, regions of a chromosomal copy of a gene or fragments thereof are sequenced using a set of primers. In the method of the present invention the sequence of the mature transcript is known. In one embodiment, the primers cover both strands of the cDNA, at evenly spaced or 5 similarly spaced intervals. The present invention provides information necessary to determine gene structure and phenotypic expression without the need to sequence the entire chromosomal copy of the gene or fragment thereof. As a result of the method of the present invention, gene structure can be determined without the need to sequence the entire gene. The present invention is useful, for example, in germ 10 line sequence variation analysis. The method of the present invention is drawn to determining gene structure, where at least some portion of the genomic sequence of the gene of interest is unknown. The method involves sequencing the gene across exon-intron boundaries using evenly spaced primers, or tiled primers. The tiled primers comprise nucleic 15 acids that hybridize to the known cDNA sequence of the gene at about 100 to about 300 base intervals and the gene comprises the template. More specifically, the present invention is drawn to a method of determining boundaries between at least one exon and at least one non-exon of a gene. The method comprises the steps of conducting one or more sequencing reactions, 20 comprising a template and a primer or set of primers. In the method of the present invention, the template comprises a gene or fragment thereof and the primer or set thereof comprises at least one oligonucleotide, wherein said oligonucleotide hybridizes to the cDNA encoded by said gene or fragment thereof and wherein said cDNA has known sequence. The set of primers of the present invention comprises 25 oligonucleotides that hybridize to the coding and non-coding strand of said cDNA. In the method of the present invention, sequence obtained as described above is compared with the known sequence of said cDNA, thereby determining the boundaries between the sequence corresponding to exons (cDNA) and the sequence corresponding to non-exons, wherein sequence obtained as described above that is 30 not within the sequence of the cDNA is non-exon sequence.

WO 01/53529 PCT/US01/01461 -4 As described herein, because much of the DNA sequence of a gene is not likely to contain gene structure information or phenotypic allelic variation information, the vast portion need not be sequenced to determine gene structure or even to determine most of sequence of the gene that can affect phenotype. Other 5 than regulatory sequences in the 5' non-transcribed region, such as the enhancer and the promoter, 5', 3' non-translated regions and consensus sequences necessary for correct removal of introns, much of the non-coding sequence does not appear to affect gene expression phenotype. Therefore, all that is required for analysis of expression of a functional gene product is the sequence of each exon together with 10 that portion of the intron that encompasses the consensus splice sequences, as well as conserved promoter and terminator sequences required for the minimal regulation of gene expression and stability of the mRNA product. The present invention has several advantages. The present invention does not require prior knowledge of "genomic sequence" including boundaries between 15 exon and non-exon sequence, nor knowledge of any sequence within the non-exon regions. The present invention requires much less work and therefore saves time and money than traditional methods of determining gene structure because the entire chromosomal copy of a gene need not be sequenced. For example, using conventional techniques, to determine the gene structure 20 of a gene contained in a 150 kb BAC clone, including a 6 fold sequencing and labor, the cost would be at least 20 times more than the method of the present invention. If the 150 kb BAC clone contains coding sequence for a 2 kb cDNA, for example, the method of the present invention could provide the gene structure from 37 sequencing reactions using 30 primers. This includes 20 primers designed for a first round of 25 sequencing reactions where the primers hybridize at 200 base intervals on both strands of the cDNA. This estimate also includes a 25% failure rate in first round sequencing reactions such that 5 sequencing must be repeated as well as a 50% failure of primes such that 10 new primers must be synthesized an used in sequencing reactions and synthesized to fill in any the gaps. Thus the gene structure 30 can be determined using the method of the present invention in two rounds of sequencing with a total of 25 primers and 25 sequencing reactions. One of ordinary WO 01/53529 PCT/USO1/01461 -5 skill in the art can readily determine if and when additional primers need to be designed for additional rounds of sequencing and how to design the additional primers. On the other hand, to sequence the entire 150 kb BAC clone, if each sequencing reaction yields 500 bases of sequence, a minimum of 300 sequencing 5 reactions must be conducted with 300 primers. The time involved to sequence the entire BAC clone is also an important factor and is estimated at 2 months in contrast to the estimated two weeks required in the present invention. The present invention is also drawn to human cytochrome P450 2C 19 sequence. More particularly, the present invention is drawn to SEQ ID NOS: 59, 61, 10 63, 65, 67, 71, 73, 75, 77, 79, 81, 84, 86, 89 and 91. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a schematic diagram of the present invention. Figure 2 is a schematic diagram of the hybridization pattern of primers of 15 Tables III and IV with the p53 cDNA, SEQ ID NO: 96. Figure 3 is alignment of primers on the P450 2C19 cDNA,SEQ ID NO: 58. Figure 4 shows a sequence obtained using the P450 2C 19 gene as template and the cDNA specific primers according to Example H. Figure 5 is the gene structure of human P450 2C19 as determined by the 20 present invention in the form of a composite sequence, SEQ ID NOS: 59 and 97, where the underlined sequence is novel sequence and the primer hybridization sites and starting ATG are boxed. Figure 6 is a schematic diagram of the human P450 2C19 gene. DETAILED DESCRIPTION OF THE INVENTION 25 As used herein, the term "gene" refers to a contiguous stretch of deoxynucleotides comprising the basic unit of heredity of an organism, encoding a given protein or RNA. As used herein, the terms "gene" and also "genomic DNA" comprises one or more exon or part thereof, one or more intron or part thereof, all or a portion of the 5' untranslated region, and all or a portion of the 3' untranslated 30 region.

WO 01/53529 PCT/USO1/01461 -6 As used herein, the term "gene structure" includes the coding regions or exons together with the exon-intron boundaries with at least 50 nucleotides of sequence of all intron termini as well as 5' and 3' UTR. The gene structure as determined by the present invention can also include promoter and enhancer 5 sequences. As used herein, the term "cDNA" refers to complementary DNA of an mRNA molecule. cDNA can represent the complete mRNA or a fragment thereof. RNA product can be mRNA, tRNA, rRNA or other structural RNA. As used herein, the term "polymorphism" is an allelic variation in nucleic 10 acid sequence between two or more samples. Such polymorphisms can be, for example, restriction fragment length polymorphism (RFLP), a variation in DNA sequence that alters the length of a restriction fragment (Botstein et al., Am. J Hum. Genet. 32, 314-331 (1980)). Other polymorphisms include of short tandem repeats (STRs) that include tandem di-, tri- and tetra-nucleotide repeated motifs. These 15 tandem repeats are also referred to as variable number tandem repeat (VNTR) polymorphisms. VNTRs have been used in identity and paternity analysis (US 5,075,217; Armour et al., FEBS Lett. 307, 113-115 (1992); Horn et al., WO 91/14003; Jeffreys, EP 370,719), and in a large number of genetic mapping studies. Other polymorphisms include single nucleotide variations between 20 individuals of the same species. Such polymorphisms are far more frequent than RFLPs, STRs and VNTRs. Some single nucleotide polymorphisms (SNP) occur in protein-coding sequences (coding sequence SNP (cSNP)), in which case, one of the polymorphic forns may give rise to the expression of a defective or otherwise variant protein and, potentially, a genetic disease. Examples of genes in which 25 polymorphisms within coding sequences give rise to genetic disease include p globin (sickle cell anemia), apoE4 (Alzheimer's Disease), Factor V Leiden (thrombosis), and CFTR (cystic fibrosis). cSNPs can alter the codon sequence of the gene and therefore specify an alternative amino acid. Such changes are called "missense" when another amino acid is substituted, and "nonsense" when the 30 alternative codon specifies a stop signal in protein translation. When the cSNP does not alter the amino acid specified the cSNP is called "silent". Other single WO 01/53529 PCT/USO1/01461 -7 nucleotide polymorphisms occur in noncoding regions. Some of these polymorphisms may also result in defective protein expression (e.g., as a result of defective splicing). Other single nucleotide polymorphisms have no phenotypic effects. Genetic information is available or obtainable in several forms comprising 5 genomic sequence (sequence of chromosomal versions of genes), or non-genomic sequence such as cDNA sequence or expressed sequence tag (EST) sequence. These types of sequences provide different levels of information regarding the structure of the gene of interest and the variations of the gene sequence that affect phenotype in the organism from which the sequence is derived. 10 The term "variation" or "polymorphism" implies that more than one version of the gene has been sequenced for comparison. Information on allelic variation is sometimes available for chromosomal copies of genes, if more than one example of a chromosomal copy has been sequenced, though often only one version is sequenced. Information on allelic variation for cDNA (and thus only the coding 15 portion of a gene) is also sometimes available if more than one version has been sequenced. However, this information clearly does not include any information for example from regions upstream or downstream of the mature RNA nor the introns and therefore does not provide complete information of the gene structure nor the expression phenotype of a gene, where expression phenotype includes both 20 expression level and structure of the gene product. Finally, information on allelic variation is also likely to be available for EST sequences, as it is possible that more than one example of a given EST has been sequenced. However, like cDNA sequence, EST sequence does not provide information from for example regions nor the introns. Thus, the currently available sequence information does not readily 25 provide complete information on phenotypic allelic variation or where phenotypic variation could be available, the information is incomplete and lacks genetic structure information. Other typical methods for determining complete or additional gene structure include generating PCR products based in part on known gene structure (Shiinoki et 30 al., Metabolism, 48:581-584 (1999)) or sequencing PCR products wherein one primer is derived from cDNA sequence and the other primer is derived frorn Alu WO 01/53529 PCT/USO1/01461 -8 repetitive element sequence (Monani and Burgess, Genome Res. 6:1200-1206 (1996)). These methods have the disadvantage of requiring some prior knowledge of the gene structure and the additional step of PCR amplification of portions of the genomic sequence. Therefore, a rapid, cost effective method is needed to determine 5 the useful sequence of a chromosomal copy of a gene of interest, for gene structure determination wherein prior knowledge of the gene structure and/or the sequencing of the entire nontranscribed portions of the gene are not required. The method of the present invention provides gene structure, wherein gene structure includes the coding regions or exons together with the exon-intron 10 boundaries (a point on a line that separates two regions) with at least about 50 nucleotides of sequence of all intron termini detennined as well as boundaries between the mature transcript and 5' and 3' UTR and at least about 50 nucleotides of sequence of the 5' and 3' UTR adjacent to the mature transcript. For example, in the genomic copy of the gene of interest, two regions separated by a boundary are 15 adjacent or contiguous in the genomic copy of the gene of interest. The present invention is drawn to a method of determining boundaries between at least one exon and at least one non-exon (where non-exon includes introns as well as sequence 5' and 3' of that which ultimately becomes the mature RNA sequence) region of a gene. The term boundary therefore, refers to the 20 junction between exon and non-exon sequence. The term "sequence" refers to the arrangement of specific nucleotides within the specified polynucleic acid. As used herein, the term exon refers to a segment or region of nucleotides within a eukaryotic gene that is retained in the mature RNA transcript such as mRNA, tRNA and rRNA. As used herein, exons comprise coding sequences that 25 encode part of the final gene product, and regulatory sequences, such as leader sequences. As used herein, "leader sequence" refers to nucleotide sequence at the 5' end of a gene of interest that is transcribed but is not part of the final gene product. The method of the present invention comprises the steps of conducting one or more sequencing reactions, comprising a template and a primer or set of primers. 30 In the method of the present invention, the template comprises a gene or fragment thereof (e.g. genomic sequence or genomic polynucleotide). The primer, or set WO 01/53529 PCT/USO1/01461 -9 thereof, comprises one or more oligonucleotides, wherein said oligonucleotides hybridize to the cDNA or RNA product of interest, wherein said cDNA or RNA product has known sequence. In one embodiment, the primers of the present invention comprise one or a set of oligonucleotides that hybridize to the coding and 5 non-coding strand of said cDNA or to said RNA product. The primers are hybridized to the gene of interest or fragment thereof and used to prime template dependent nucleic acid polymerization. Sequence obtained is compared with the known sequence of said cDNA or RNA product. Sequence obtained that is not within the sequence of the cDNA or RNA product reveals the 10 boundaries between said exons and said non-exon regions and reveals non-exon sequence. One of ordinary skill in the art can readily assemble the sequence and boundary infonnation thus obtained to generate the gene structure of said gene of interest. Sequence of cDNA or RNA product of interest is obtained by methods well 15 known in the art. For example, cDNA or RNA sequence of interest can be obtained from commercial or public databases, such as GenBank. In addition, cDNA or RNA can be obtained by standard laboratory protocols, such as those described in Chapters 7 and 8 in Molecular Cloning, a Laboratory Manual (Sambrook et al., Cold Spring Harbor Laboratory Press, (1989)). One of ordinary skill in the art would 20 readily be able to either construct the necessary cDNA libraries and/or screen libraries for the desired cDNA or RNA using standard laboratory techniques. For example, libraries can be screened using antibodies specific for the encoded protein of interest or oligonucleotide probes that hybridize the cDNA or RNA of interest as described in Chapter 12 of Sambrook et al. Furthermore, one of ordinary skill in the 25 art can readily obtain a sequence of the cDNA or RNA from commercial sequencing companies; commercial sequencing aparati or by the following standard laboratory techniques, such as that provided in Chapter 13 of Sambrook et al. Primers suitable for the methods described herein can be designed and produced using techniques well-known to those of skill in the art. As used herein, 30 the term "primer" refers to an oligonucleotide suitable for the purpose of initiating template dependent nucleic acid synthesis. Said primer can comprise, for example, WO 01/53529 PCT/USO1/01461 -10 deoxyribonucleotides. In one embodiment of the present invention, the primers are about 5 to about 50 nucleotides in length. In a preferred embodiment, the primer is about 20 nucleotides in length. In another embodiment, the primers have a Tm of about 42 to about 55'C. The primers do not have to be exactly complementary to 5 the cDNA, as long as they specifically hybridize to one location of the template to be sequenced. However, if ambiguous sequence is obtained from a particular primer, one or ordinary skill in the art would readily be able to design a more suitable primer if necessary. To design and pick the primers, "primer picking" programs can be used, such as "Oligo 5.0" (MedProbe AS, Norway). 10 As used herein, the term "set of primers" comprises one or more primers, such that the primers hybridize to a polynucleotide strand of interest. In one embodiment, the primers hybridize to the polynucleotide of interest at discrete intervals. In another embodiment, the primers hybridize to the polynucleotide of interest as intervals of about 50 to about 500 nucleotides. In one embodiment, the 15 primers hybridize at intervals of about 100 to about 300 nucleotides. In still another embodiment, the primers hybridize at intervals of about 100 to about 200 nucleotides. In another embodiment of the present invention, the primers hybridize to said cDNA at evenly spaced intervals. In another embodiment of the present invention, the set of primers hybridize at similarily or evenly spaced internals on 20 both strands of a double-stranded polynucleotide of interest. Primers that hybridize with said cDNA or RNA product at similarly or evenly spaced intervals are referred to herein as "even spaced" or "tiled primers". In one embodiment, the primers are designed such that sequence information generated from one primer extends at least until the 5' terminus of the next downstream primer, if there is no intervening 25 sequence. In this way, no intervening sequences are missed by this method. For example, if the gene structure of a cDNA comprising 1000 base pairs in length is to be analyzed using the method of the present invention and a distance of 100 bases between primers and a primer length of 20 bases, then primers could be designed such that they hybridize at about nucleotides 1-20, 120-140, 240-260, 360-380, 480 30 500, 600-620, 720-740, 840-860 and 960-980 of one strand and bases 1000-980, 880-860, 760-740, 640-620, 520-500, 400-380, 280-260, 160-140 and 40-20 of the WO 01/53529 PCT/USO1/01461 -11 opposite stand. It is understood that one of ordinary skill in the art can readily determine the spacing or interval between primers based on chosen sequencing parameters, polymerase used, hybridization conditions and conditions of sequencing such as buffer concentration and salt concentration as well as primer length and 5 hybridization temperature. Furthermore, one or ordinary skill in the art can ensure primer coverage of the cDNA by routine optimization. One of ordinary skill in the art can optimize the number of primers necessary based on known information about the cDNA. To further reduce costs for example, the number of primers can be reduced. For example, domains of the cDNA that are 10 thought to be typically encoded by one exon or a known number of exons in a given pattern may not require multiple internal primers. In another example, the gene structure of the cDNA may be known for another organism. Therefore, primers can be designed to hybridize near putative boundaries. As described above, the primers are used to prime sequencing reactions of at 15 least one template comprising all or a portion of the gene of interest. The present invention can be used with any nucleic acid sequence from eukaryotic archaebacterial or viral sources wherein regions of said sequence have been processed, e.g. joined together by excising intervening sequences present in the original parent molecule. For example, the primers are designed using the processed molecule and the template 20 is the original parent molecule. In one embodiment of the present invention, the eukaryotic source comprises fungal, plant, mammalian and non-mammalian sources. Using standard techniques in the art, the gene or fragment thereof to be used as template can be isolated from any tissue, fluid or extract from an organism comprising said polynucleic acid of interest. In addition, said polynucleic acid of 25 interest can be derived from a libraries in the form of artificial chromosome libraries. Such libraries contain chromosomal DNA in excess of 100 kilobases in length. For example, such libraries can be in yeast artificial chromosome (YAC) libraries, bacterial artificial chromosome (BAC) or P1 artificial chromosome (PAC) libraries. BAC and PAC libraries are especially useful because these are bacterial plasmid 30 based vectors that be easily isolated, manipulated and amplified. Such libraries are well known in the art and commercially available. For example, one of ordinary skill WO 01/53529 PCT/USO1/01461 -12 in the art can isolate the template as described in Example 2. Templates of various lengths can be used. Uncloned genomic DNA can be used as template. One of ordinary skill in the art can readily determine the optimal conditions for using uncloned genomic DNA. For example, the smaller the size of the genomic DNA or 5 fragment thereof, the higher the signal to noise ratio and less template can be used. In one embodiment of the present invention, the template is about 10 to about 500 kilobases in length and about 250 nanograms to 2.5 micrograms is used. The method of the present invention is useful to determine the boundaries between regions of nucleic acid that were separated by intervening sequence wherein 10 said intervening sequence has been removed. For'example, cDNA can be analyzed, wherein the boundaries between the exons comprising the cDNA and the introns present in the gene are determined. In addition, the method of the present invention is useful for the determination of boundaries present in genes containing group I type introns such as Tetrahymena rRNA, where self-splicing occurs in the presence of 15 guanosine cofactor. The method of the present invention provides sequence extending into the non-exon regions of the gene of interest. In one embodiment, the present invention provides sequence information of the promoter and enhancer upstream of the 5' UTR of the cDNA. In a one embodiment, the present invention provides sequence in the 20 upstream of the 5' most exon wherein the 5' most exon is up to about 500 base pairs before the transcription initiation site. In another embodiment of the present invention, sequence upstream of the transcription initiation site is provided, comprising the promoter of the gene of interest. In one embodiment, about 300 bases of sequence are obtained from each 25 sequencing reaction from each primer. However, it is not necessary to obtain 300 bases of new sequence. In another embodiment, at least about 50 bases of intron sequence at the boundary of the exon, non-exon region is provided. In still another embodiment, the sequencing reactions can be conducted simultaneously in a multiplex assay so long as the sequence information can be unambiguously assigned 30 to a given primer.

WO 01/53529 PCT/USO1/01461 -13 As described, the non-exon regions comprise sequence upstream and downstream of the mature RNA, as well as intron sequence. It is well known in the art that eukaryotic gene structure comprises promoter and enhancer sequences 5' to the coding sequence, followed by a terminator sequence on the 3' side of the coding 5 sequence. For example, eukaryotic genes are transcribed into a primary RNA transcript which comprises untranslated region (5' UTR) with introns upstream of the start codon or ATG, followed by the coding sequence, interrupted by introns, followed by a stop codon such as TAA, followed by 3' untranslated region (3' UTR) and ending in polyA tail. Said primary RNA transcripts are also referred to herein as 10 "pre-mRNA" and as "heterogeneous nuclear RNA or hnRNA". Introns, if present, are removed from the 5' UTR and from the coding sequence to generate a mature transcript. The intronic sequences of a gene generally do not contain sequence useful in the removal of introns except for near the 5' and 3' termini (e.g. within 50 bases of the 15 boundary). The 5' and 3' tennini of the intron sequences, contain the donor and the acceptor sites for splicing or removal of the introns. These sites are known to contain consensus sequences that are required by the splicing machinery of the cell to properly excise the intron sequences in order to generate mature RNA product such as mRNA. Mutations to such consensus sequences prevent the accurate removal of 20 introns. Therefore, not only are the sequences of the exons important, but the sequences of the consensus sequences within the introns are also important. Donor consensus sequence for example, comprises SEQ ID NO: 1, AGGTAAGT, wherein the first two nucleotides, AG, are present within the exon and the last six nucleotides, GTAAGT, are present within the intron. In particular, as shown in Table 1, the first 25 two bases of the consensus sequence in the intron (GT) are absolutely conserved. The method of the present invention would provide this sequence information as well as sequence further into the intron, without the need to sequence the entire intron or the entire gene.

WO 01/53529 PCT/USO1/01461 -14 Table I* Sequence Derived From: Exon Intron Position 1 2 3 4 5 6 7 8 Base A G G T A A G T 5 Frequency .64 .73 1 1 .62 .68 .84 .63 *Information derived from Figure 30.3 In: Genes VI, by B. Lewin, Oxford University Press (1997). Furthermore, as shown in Table II, the 3' terminus of an intron, otherwise known as the acceptor site, comprises the sequence 12Py NCAGN, wherein 12Py 10 stands for 12 pyrimidine bases and N stands for A, G, C or T and wherein the last nucleotide is present in the exon and the remaining nucleotides are present at the 3' terminus of the intron. Table II* Sequence Derived From: Intron Exon 15 Position 1-12 13 14 15 16 17 Base 12Py N C A G N Frequency - - .65 1 1 *Information derived from Figure 30.3 In: Genes VI, by B. Lewin, Oxford University Press (1997). 20 The method of the present invention provides this sequence without the need to sequence the entire intron or the entire gene. In addition, 18-24 nucleotides upstream of the 3' splice site within the intron comprises a "branch site." This branch site is another consensus site necessary for the proper removal of the intron. In yeast, this consensus sequence is highly conserved and comprises SEQ ID NO: 2, 25 UACUAAC. However, other eukaryotic branch site consensus sequences are not highly conserved and comprises a sequence of 7 nucleotides in length having a sequence according to Table Ill.

WO 01/53529 PCT/USO1/01461 -15 Table III* Branch Point Sequence Position 1 2 3 4 5 6 7 Base Py N Py Py Pu A Py 5 Frequency .80 - .80 .87 .75 .95 *Information derived from Figure 30.3 In:Genes VI, by B. Lewin, Oxford University Press (1997). Py =pyriindine, Pu= Purine. The adenosine residue at position 6 in the branch point sequence is required for proper intron removal. The adenine at position 6 is the site at which the lariat 10 between the 5' end of the intron and the internal portion of the intron is formed through a 5'-2' phosphodiester bond. The method of the present invention also provides this sequence without having to sequence the entire. intron or the entire gene. The present invention also provides sequence present on the 3' side of the mature RNA of interest. 15 The method of the present invention provides at least about 50 nucleotides of sequence from each primer. Therefore, if the primer hybridizes near the boundary between an exon and non-exon, then at least about 50 nucleotides of non-exon sequence is provided. This sequence is sufficient in length to define genomic consensus sequences that are required for transcription, proper removal of introns and 20 translation of said gene to generate functional gene product. Example 1: p53 In Silico Experiment p53 was chosen as an in Silico test for the present invention. The cDNA of p53 is approximately 1.3 kilobases in length. Primers were designed using software 25 for primer design, (Oligo 5.0 MedProbe AS, Oslo, Norway). The parameters used for the software were: 50 mM monovalent salt and the Tm was chosen to be between 42 WO 01/53529 PCT/USO1/01461 -16 and 55'C. Oligonucleotides of 20 bases in length were generated using both the coding and the non-coding strand of p53 cDNA. Of the complete set of oligos described above and without regard to the genomic structure of p53 (a blind selection based only on hybridization parameters 5 and distance between primers), oligos that were separated on the respective strand of cDNA by about 90 to about 195 nucleotides were chosen for further analysis (Tables IV and V). Table IV: Primers for Upper Strand Oligo Position Distance Tm Sequence SEQ 10 Number in between ('C) ID cDNA primers 1 47 - 54.7 ACACTTTGCGTTCGGGCTGG 3 2 137 90 51.8 TGGAGGAGCCGCAGTAGAT 4 3 321 184 50.7 AGCTCCCAGAATGCCAGAGG 5 4 516 195 48.8 CCCTGCCCTCAACAAGATGT 6 15 5 615 99 44.3 GGCCATCTACAAGCAGTCAC 7 6 792 177 49.8 CTATGAGCCG8CCTGAGGTTG 8 7 933 141 49.2 ACGGAACAGCTTTGAGGTGC 9 8 1124 191 53.0 TTCAGATCCGTGGGCGTGAG 10 9 1257 133 47.1 TCAGTCTACCTCCCGCCATA 11 WO 01/53529 PCT/USO1/01461 -17 Table V: Primers for Lower Strand Oligo Position Distance Tm Sequence SEQ Number in between ('C) ID cDNA primers 1 46 - 52.9 -CAGCCCGAACGCAAAGTGTC 12 5 2* 194 148 - AAGTAGTTTCCATAGGTCTG 13 3* 325 131 - GCAGCCTCTGGCATTCTGGG 14 4 516 191 48.8 ACATCTTGTTGAGGGCAGGG 15 5 640 124 52.1 CGCCTCACAACCTCCGTCAT 16 6 792 152 49.8 CAACCTCAGGCGGCTCATAG 17 10 7 963 171 53.0 CCGGTCTCTCCCAGGACAGG 18 8 1082 119 49.3 TGGTTTCTTCTTTGGCTGGG 19 9 1256 174 47.9 ATGGCGGGAGGTAGACTGAC 20 * Oligos were user-defined. The selected primers were aligned on the genomic sequence of p53 as shown 15 in Figure 2. Each of primers 1-5, 7 and 9 from Table IV hybridized completely within an exon. Primers 6 and 8 hybridized at an intron/exon boundary and are therefore not expected to result in a successful sequencing reaction. It can readily be seen by one of ordinary skill in the art that sequencing reactions using these primers and a genomic copy of p53 reveal all intron exon boundaries and all useful intronic sequence. Thus, 20 relevant sequence information from the p53 gene is extracted from about 350 bases of sequence information, including exon/intron boundaries, enhancer, promoter and intron consensus sequences. When added to the sequence of the cDNA (1.3 kilobases), the complete gene structure and relevant sequence information for phenotypic allelic variations is obtained. This complete information is obtained from 25 about 1650 bases of sequence rather than the entire sequence of the p53 gene (over 20 kilobases).

WO 01/53529 PCT/USO1/01461 -18 Example 2: Boundary Determination of Human Cytochrome P450 2C19 Screening and Isolation of a Bacterial Artificial Chromosome encoding the Human Cytochrome P450 2C19 Gene (CYP450 2C19 gene). Four members of the Cytochrome P450 2C subfamily are known: CYP450 5 2C8, '2C9, '2C18 and '2C19 (Ieiri and Higuchi, J. Toxicol. Sci. 23:129-131, (1998)). The CYP450 2C19 gene is flanked by two other members of the CYP450 2C family, CYP450 2C18 and CYP450 2C9 (Gray, et al., Genomics, 28:328-332 (1995)). In order to ensure that a BAC containing the entire CYP450 2C19 gene was isolated, gene specific primers were designed such that amplicons would be generated from the 10 5' end, the middle and the 3' end of the coding region (Table VI). For an amplicon from the putative boundary between intron 4 and exon 5, primers were taken as published in the partial gene structure de Morais, et al., Mol. Pharmacol., 46:594-598 (1994)). The primers were used for primary PCR screening of 48 human BAC DNA 15 pools from Research Genetics (Huntsville, Alabama). PCR reactions were carried out using 100 tM dNTP, 1.5 mM units Amplitaq T M (PE BioSystems, Foster City, California) in a final volume of 14 tl. Cycling conditions were are follows: 940 for 2 minutes then 35 cycles of 94' for 30 seconds, 30 seconds at the appropriate annealing temperature (Tm, Table VI) and 45 seconds at 720 C, followed by a final extension at 20 720 for 7 minutes. For each primer, the positive pools from the primary screening were subjected to secondary screening. For a secondary screening, each pool was split into 48 samples, which consisted of 10 plate pools, 14 row pools and 24 column pools. Based on the secondary screen, eight clones were identified as potentially containing the human CYP450 2C19 gene.. These clones were streaked on agar plates 25 containing 12.5 tg/ml chloramphenicol and incubated at 37' for 48 hours.

WO 01/53529 PCT/USO1/01461 -19 Table VI: Primer pairs used to screen for a BAC Clone harboring the full length CYP450 2C19 Gene Amplic Primer Set Primer Sequences Tm Predicted on # Name ('C) Product Size (bp) 5 1 2C19 1 1F/lR F: 5'-gatccttttgtggtcttgt-3' 55 152 R: 5'-ttgctgacatecttaatatct-3' 2 2 2C19_4_1F/lR F: 5'-aattacaaccagagcttggc-3' 3 60 168 R: 5-tatcactttccataaaagcaag-3' 4 3 2C19_9_lF/1R F: 5'-ctgetcctgtgctgtc-3' 5 60 151 R: 5'-atatttgcacagtgaaacttt-3' 6 1. SEQ ID NO: 21 3. SEQ ID NO: 23 5. SEQ ID NO: 25 2. SEQ ID NO: 22 4. SEQ ID NO: 24 6. SEQ ID NO: 26 10 For each clone, two colonies were picked using a sterile pipet tip and suspended in 50 pl sterile double-distilled water. Suspended colonies (5pgl/reaction) were screened for 3 PCR amplicons as described above. Only 3 colonies were positive for all 3 amplicons. Of these, the clone with the plate address 421-N-Il was used as template for "rapid gene structure determination" as described below. 15 Rapid Detennination of CYP450 2C19 Gene Structure BAC-DNA was isolated on a large scale as follows. 1. A single BAC colony was picked and inoculated in a starter culture of 5 ml medium (LB medium containing 12.5 ug/mI chloramphenicol). The culture was shaken vigorously at 37'C until the OD 600 nm read between 1.0-1.5 (6-8 20 hrs). OD 600 should be maintained at 1.0-2.0; however, if the growth exceeds the limit less pre-culture volume per 500 ml culture can be used. 2. 2.5-5.0 ml of the starter culture was inoculated into 500 ml of LB chloramphenicol and grown at 37'C for 14-16 hrs with vigorous shaking (horizontal gyrator platform at -250 rpm).

WO 01/53529 PCT/USO1/01461 -20 3. The 500 ml culture was poured into a 1 liter centrifuge bottle, and harvested by centrifugation at 4500 x g (GSA rotor at 5100 rpm) for 20 min. at 4'C. 4. The supernatant was discarded by decanting it into a waste beaker. 5. The bottle was kept inverted and drained on a paper towel. 5 6. Each bacterial pellet was gently and completely resuspend in 100 ml of ice cold 10 mM Tris (pH 8.0) and transferred to a 250 ml centrifuge bottle. 7. The resuspended bacteria were centrifuged at 4500 x g (GSA rotor at 5100 rpm) for 20 min at 4'C. 8. Steps 4 to 7 were repeated one time. 10 9. Each bacterial pellet was gently and completely resuspend in 50 ml of ice-cold Qiagen TM Buffer P1 (containing RNAse A (100 ug/ml) as per Qiagen instructions) (Valencia, California) and incubated for 10 min. at room temperature (e.g. 24.0'C). Note: Addition of Buffer P2 is the most critical step to keep the E. coli contamination 15 low. Buffer P2 must be quickly and completely distributed throughout the cell suspension after its addition. 10. 50 ml of Buffer P2 was added and the bottle capped and very slowly inverted 4-6 times, NOT vortexed or shaken vigorously. 11. The bottle was incubated undisturbed at room temperature for 15 minutes. 20 12. 50 ml ice-cold Buffer P3 was added to each bottle, mixed immediately by gently inverting 4-6 times, and incubated on ice for 30 min. 13. The bottles were centrifuged at 20,000 x g (GSA rotor at 11,000 rpm) for 30 min. at 4*C. 14. The bottles were VERY GENTLY recovered from the centrifuge without 25 disturbing the pellet. The bottles were placed in such a way that they did not move at all while the supernatant was recovered. The supernatant was removed promptly using a 25 ml pipette and transferred to a fresh 250 ml bottle. 15. The supernatant was re-centrifuge at 20,000 x g for 15 min. at 4*C. The 30 supernatant was promptly removed and kept on ice. The total volume was about 150 ml.

WO 01/53529 PCT/USO1/01461 -21 Note: Filter through cheesecloth to remove cell debris if necessary. 16. Two QIAGEN-tip 5 0 0 TM columns were equilibrated by applying 20 ml Buffer QBT, (see manufacturer's instructions), and allowed to empty by gravity. 17. 75 ml of supernatant (half of the supernatant from step 14) was applied to each 5 QIAGEN-tip 500 column and allowed to enter the resin by gravity flow. 18. Each QIAGEN-tip 500 column was washed with the 3 x 25 ml Buffer QC (see manufacturer's instructions). 19. Buffer QF (Qiagen) was pre-warmed in a water bath at 65'C and the DNA was eluted from the Qiagen-tip with 4 successive aliquots of 5 ml of pre 10 warmed Buffer QF for 9 total volume of 20 ml per tip. 20. The eluted DNA (20 ml total) was transferred to a 45 ml centrifuge tube. 21. 14 ml of room temperature isopropanol was added to each 20 ml DNA: solution. The contents were mixed by inverting gently several times. 22. The tubes were centrifuged immediately at >15,000 x g (SA 600 rotor at 15 11,000 rpm) for 30 min. at 4'C. The supernatant was carefully discard by decanting. This was done as soon as the centrifuge came to a stop. Note: The pellet will be BOTH at the bottom of the tube AND as a streak on the tube wall, so be very gentle. The pellet may become detached. 23. The excess ethanol was quickly drained by inverting over blotting paper for 5 20 min. The outside of the tube was marked to indicate the location of the pellet. 24. To each DNA pellet was added 2 ml of -20'C 70% ethanol. The tube was gently rotated such that the ethanol washed the bottom and the walls of the tube. The tubes were allowed to stand undisturbed for 5 min. at RT, then centrifuged at 10,000 x g for 10 min. 25 25. The supernatant was promptly (i.e. as soon as the centrifuge comes to a stop) discarded by decanting it from the tube. The tube was allowed to stand inverted on blotting paper for 5 min. to drain. 26. The pellet was air-dried for 10 min. by laying the open tube on its side.

WO 01/53529 PCT/USO1/01461 -22 27. The DNA was dissolved by adding 150 1i of Tris EDTA (TE) (pH 8.0). The tube was vortexed gently to ensure that most of the DNA was dissolved. The tube was spun for 5 min. to collect the solution to the bottom of the tube. The tube was left at 4'C overnight to allow for the DNA to completely dissolve. 5 The BAC-DNA was directly sequenced as follows, except that 250 nanograms of BAC-DNA template was used instead of 2.5 micrograms of genomic DNA. DIRECT GENOMIC DNA SEQUENCING Setting up sequencing reactions. Note: Table VII below allows for DNA concentrations of 0.31-2.5 gg/pl. DNA with 10 a lower concentration may be used if the primer is used at a higher concentration. 1. Sequencing mix (1 rxn) was prepared for each primer reaction as follows: TABLE VII Volume (ul) Genomic DNA (2.5 ug)' X Big Dye Sequence Mix 2 16 15 5x CSA buffer' 8 Primer (3.2uM) 4 8 H20 (nuclease free) 8-X Total Volume 40ul 1. Genomic DNA should be of high quality (eg., OD 2W/20=1.7-1.9) and be quantitated accurately, e.g., by fluorometry and by 20 agarose gel electrophoresis. The DNA does not need to be of a certain size. 2. BigDye T M Mix: Perkin Elmer/ABI BigDyeM Terminator Ready Reaction Mix with AmpliTaq@ FS, Part number 4303151 (for 5000rxn kit) 3. 5X CSA buffer, Perkin Elmer Part number 361058C 4. Sequencing primers: non-vector Cone. = 3.2 uM 25 Primer picking programs may be used with the following requirements Tm= 50*C %GC= 50% Oligo Length 18-22 bp Avoid designing A or T at the first two bases at either the 3' or the 5' ends 30 Avoid more than two consecutive G or C at either the 3' or the 5' ends WO 01/53529 PCT/USO1/01461 -23 2. The sequencing cocktail was vortexed to ensure it was well-mixed. 3. The PCR tubes were capped tightly and then quickly spun to collect all the reagents. 4. In a thermocycler, the following program was run: 5 1) 95'C for 5 minutes 2) 95*C for 30 seconds 3) 55'C for 20 seconds 4) 65'C for 4 minutes 5) Go to 2 for an additional 99 times 10 6) 4'C hold 7) End 5. At the end of the sequencing reaction, the plate was taken out of the thermocycler and quickly spun to collect the contents. Post-Sequencing Cleanup. 15 One column (Centri-SepTM Princeton Separations #CS901) per sequencing reaction was used per sequencing reaction and the columns were non-reusable. Column hydration (Step 1, below) was performed on the same day as the sequencing reactions. The next day, the sequencing reactions were processed through the pre hydrated columns. 20 Column Hydration 1. The column was tapped gently to insure that the dry gel had settled to the bottom of the spin column. 2. The top column cap was removed and the column reconstituted by adding 0.8 ml of nuclease-free water. The column end stopper was left in place so that 25 column could stand up by itself. 3. The column cap was replaced and the gel was hydrated. It is- important to hydrate all of the dry gel. Very effective mixing and hydration was accomplished by vigorous agitation on a vortex mixer. The column was incubated at least 30 minutes at room temperature before being stored in a 30 refrigerator overnight.

WO 01/53529 PCT/USO1/01461 -24 4. Next day, the column was removed from the refrigerator and allows to warm to room temperature before continuing with this procedure. Removal of Interstitial Fluid 1. Trapped air bubbles were removed from the column gel by vigorously tapping 5 the column, allowing the air bubbles to rise to the surface. Particular attention was paid to the bottom of the column. The gel was allowed to settle. After the gel had settled and was free of bubbles, the column cap was removed, and then the column end stopper was removed from the bottom. Note: The order of removing caps prevents further bubble formation. 10 2. Excess column liquid was allowed to drain by gravity into a 2-ml wash tube while in the microtube rack. If the liquid does not begin to flow through the filtered end of the column in a reasonable time (about 1 minute), give it a quick spin in a tabletop microcentrifuge for up to 750g. After the column stopped draining, the water was discarded and the column was put back into 15 the same wash tube. Note: While using the tabletop microcentrifuge, it is important to be aware of the orientation of the columns through all subsequent steps in this procedure. Place an orientation mark on the top rim of the spin column and keep it pointing to the outside of the rotor at all times. 20 3. The column was placed and its wash tube in the centrifuge. The columns and wash tubes were spun in the microcentrifuge at 750g for 2 minutes. 4. After the centrifugation, drops of water at the end of the column were blotted it dry. The wash tubes and the interstitial fluid was discarded. The gel material was not allowed to dry excessively. The samples were processed 25 within the next 2-3 minutes. Sample processing 1. The entire reaction mixture was transfered to the top of the gel. The sample directly was carefully dispensed onto the center of the gel bed at the top of the column, without disturbing the gel surface.

WO 01/53529 PCT/USO1/01461 -25 Note: Do not contact the sides of the column with the reaction mixture or the sample pipet tip (because the sample may bypass going into the pores of the gel). 2. The marked columns were placed into the labeled 1.5-ml sample collection tube. Proper column orientation was maintained. The column and collection 5 tube were spun in the tabletop microcentrifuge at 750g for 2 minutes. 3. The sample was transfered from the collection tube into a plate format so that it was easier to load them onto the sequencing gel. The samples were dried in a plate vacuum centrifuge, applying medium heat and checked after 30 minutes. High temperature was not used. The sample was completely dry 10 before transfer to sequencing gel. LOADING SEQUENCING GELS 1. Each well was resuspended with 1.5 1ti of loading dye. 2. The plate was sealed. 3. The plate was quickly spun in a plate centrifuge for up to 800g. 15 4. The plate was throughly vortexed with a plate shaker for 5 minutes. 5. The plate was quickly spun again. 6. The samples were denatured in a thermocycler or heat block for 2 minutes at 90 0 C. 7. The samples were vortexed samples AGAIN for 1-2 minutes to ensure 20 resuspension. 8. The plate was quick spun again. 9. The samples were stored on ice until they were ready to be loaded. 10. 1.3 pl of the sample was loaded onto an ABI (P.E. BioSystems, Foster City, California) sequencer sequence data was collected following manufacturer's 25 instructions. Primers were chosen such that they were spaced approximately 150 bp apart. One set of primers was complementary to the non-coding strand and an a second set was complementary to the coding strand of the CYP450 2C19 cDNA sequence (lower, L and upper, U respectively). For the first round of sequencing, the primers 30 were chosen with the software Oligo 5.0, using the same parameters as described in WO 01/53529 PCT/USO1/01461 -26 the p53 in silico experiment. For the second sequencing round, where primers chosen by Oligo 5.0 failed and where cDNA sequence lacked coverage from the first round of sequencing, additional primers were chosen manually. All primers are listed in Table VIII. Hybridization of the primers is shown in Figure 3, which shows CYP450 2C19 5 cDNA sequence (AC #M61854, Romkes et al. Biochemistry 32:1390 (1993), SEQ ID NO: 58). The ATG start codon is underlined. In Figure 3, primers used for direct sequencing using the BAC-cDNA template described above are shown above and below the cDNA depending on whether they hybridize the coding or non-coding strand, respectively. Primers with an asterix yielded usable sequence. The coding 10 region is shown in capital letters. SEQ ID NOS: of the primers are provided in Table VIII. Table VII Primer name Sequence SEQ ID Successful NO. sequence, comment 40L (1) TGAAAGGAGAAGCAAACATGAG 27 NO, * 15 66U (2) GGAGACAGAGCTCTGGGAGA 28 NO, * 192L (3) GCCCTGTGTTCACTCTGTATTT, 29 YES 235U (4) GCTGCATGGATATGAAGTGG 30 YES 349L (5) CTTCCATCTCTTTCCATTGCTG 31 YES 367U (6) GAAGGAGATCCGGCGTTTCT 32 YES 20 506L (7) GAGCACAGCCCAGGATGAAAGT 33 YES 508U (8) TTTCATCCTGGGCTGTGCTC, 34 NO, * 617L (9) GGGTGCTTACAATCCTGATGTT, 35 YES 680U (10) TATTTCCCGGGAACCCATAA, 36 YES 790L (11) CAGGAAGCAATCAATAAAGTCC, 37 NO, * 25 784U (12) CCCTCGGGACTTTATTGATT, 38 NO, * 947L (13) CTGTGACCTCTGGGTGCTTCAG, 39 NO, # WO 01/53529 PCT/USO1/01461 -27 950U (14) AAGCACCCAGAGGTCACAGC, 40 NO, # 1063L (15) GATGTATCTCTGGACCTCGTGC, 41 YES 1108U (16) CCATGCAGTGACCTGTGACG, 42 YES 1231U (17) CCCTCGTCACTTTCTGGATG, 43 YES 5 1392L (18) ACAGAAGCAAATCCATTGACAA, 44 YES 1396U (19) TGTTGTCAATGGATTTGCTT, 45 YES 1522L (20) GGGTCAGAAGAAGCATCACAGA, 46 YES 1519U (21) CTATCTGTGATGCTTCTTCT, 47 YES 1690L (22) TAATAATATGTTAATAACTC, 48 NO, * 10 1238L (24) TTCCACCTTCATCCAGAAAGTG, 49 NO, * 43U (2 2) ATGTTTGCTTCTCCTT, 50 NO, * 110L (2 3) CAATCACTGGAAGAGG, 51 NO, * 126L (2- 4) ATCTGTAGGATATTTC, 52 YES 948U (2 5) TGAAGCACCCAGAGG, 53 YES 15 948L (2 6) CCTCTGGGTGTCTCA, 54 YES 1064U (2 7) CACGAGGTCCAGAGAT, 55 YES 1106L (20 8) AGGTCACTGCATGGGG, 56 YES 1215U (2 9) ACCCAGAGATGTTTGA, 57 YES Table VIII. Names and sequence of all primers used for direct BAC-DNA sequencing. Primers for the first round have a suffix 20 of 1-22 and 24 in parentheses whereas primers for the second round have a suffix of 2* 2-2* 9 in parentheses. The second round primers were positioned slightly off those priming positions which yielded no readable sequence during the first round of sequencing. Failed sequencing was mostly due to primer design (* ) in purine-rich regions of the cDNA sequence. However the primers 947L and 950U hybridized to a exon - intron boundary (#) and thus did not yield any sequence. The primers 948U and 948L of the second round solved that problem. 25 RESULTS Of 23 primers designed and used during the first round of direct sequencing using BAC DNA template, 15 yielded good sequence. Using the information of the first round direct sequencing, 8 primers were designed for a second round. Six of these primers gave excellent sequence information. Overall 31 primers were used to 30 decipher the gene structure with a total success rate of 65%.

WO 01/53529 PCT/USO1/01461 -28 Sequence Alignments and Gene Structure Building: Sequences obtained using the direct primers from both rounds of sequencing were compared to the published CYP450 2C19 cDNA sequence. This was done by using the program "Bestfit" in the GCG-analysis program software package (Genetics 5 Computer Group, Madison WI 53711 USA) and refined by manual editing. Using the method of the present invention, all exon - intron boundaries of the CYP450 2C19 gene encoded on BAC 421-N-11 were resolved. As the primers were chosen to cover the entire cDNA, it is reasonably certain that no introns were missed. An area between primers 617L (09) and 680U (10) was not covered by direct 10 sequencing. However, de Morais et al. (1994) have published the adjacent intron boundaries 5' of exon 5. The published sequence were added to gene structure of human CYP450 2C19 to give SEQ ID NO: 59 (see Figure 5). The sequence obtained from 19 of the primers is shown in Figure 4, SEQ ID NOS. 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 83, 85, 87, 88, 90, 92 and 94, respectively. The non-underlined 15 sequence is provided herein for the first time and includes sequence belonging to 5' and 3' untranslated regions or intronic regions. The novel sequences are assigned SEQ ID Nos. as follows. Table IX SEQ ID NO. DESCRIPTION 20 61 Nucleotides 12 through 572 of SEQ ID NO. 60 63 Nucleotides 77 through 245 of SEQ ID NO. 62 65 Nucleotides 7 through 173 of SEQ ID NO. 64 67 Nucleotides 100 through 570 of SEQ ID NO. 66 69 Nucleotides 100 through 644 of SEQ ID NO. 68 25 71 Nucleotides 124 through 548 of SEQ ID NO. 70 73 Nucleotides 124 through 648 of SEQ ID NO. 72 75 Nucleotides 21 through 399 of SEQ ID NO. 74 77 Nucleotides 48 through 587 of SEQ ID NO. 76 WO 01/53529 PCT/USO1/01461 -29 79 Nucleotides 25 through 308 of SEQ ID NO. 78 81 Nucleotides 305 through 582 of SEQ ED NO. 80 84 Nucleotides 168 through 468 of SEQ ID NO. 83 86 Nucleotides 102 through 795 of SEQ ID NO. 85 5 89 Nucleotides 97 through 700 of SEQ ID NO. 88 91 Nucleotides 66 through 700 of SEQ ID NO. 90 93 Nucleotides 116 through 565 of SEQ ID NO. 92 95 Nucleotides 34 through 550 of SEQ ID NO. 94 Using the method of the present invention, at least 200 bp of previously 10 unknown intron sequences adjacent to each exon end are provided, as shown in Figure 4. Figure 5 shows the 5' and 3' intron and 5' and 3' untranslated sequences provided by the method of the present invention, assembled with the published cDNA sequence (Romkes et al.), published sequence is in capital letters, the ATG start codon is boxed, and positions of the primers are boxed. All newly discovered 15 sequence are in lower case and underlined. Missing intron sequence is shown as a string of underlined "n." A combination of all newly derived 5' and 3' sequence and intron sequence with the already published cDNA sequence is shown below (Figure 6). In summary, the coding sequence of human CYP450 2C19 is disrupted by 8 introns. A total of 20 about 6,700 bp of previously unknown sequence was added to the 1,746 bp of published cDNA using the method of the present invention, including more than 600 bp of previously unknown 5' sequence, which should harbor the transcriptional control elements. Thus, all intron-exon boundaries and consensus splice sequences, relevant gene structure information, is provided by the method of the present 25 invention without having to sequence the entire gene. While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in fonn and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

Claims

1. A method of determining gene structure when genomic sequence is unknown comprising sequencing the gene across exon-intron boundaries using evenly 5 spaced primers, wherein the primers comprise nucleic acids that hybridize to cDNA sequence of the gene at about 100 to about 300 base intervals and the gene comprises the template, wherein cDNA sequence of said gene is known.

2. The method of Claim 1 further comprising sequencing the gene beyond known 5' and 3' regions to determine 5' and 3' untranslated region sequence. 10 3. The method of Claim 1, wherein the primers cover both strands of the gene.

4. A method of determining structure of a gene from corresponding cDNA where the cDNA sequence is known, comprising sequencing the gene wherein the genomic DNA is the template and the primers comprise evenly spaced primers obtained from the cDNA sequence. 15 5. The method of Claim 4, wherein the primers comprise nucleic acid about 5 to about 50 nucleotides long and wherein said primers hybridize to the cDNA at intervals of about 100 to about 300 nucleotides.

6. The method of Claim 5, wherein the primers cover both strands of the cDNA.

7. A method of identifying boundaries between at least one exon and at least one 20 non-exon region of a gene, comprising the steps of: WO 01/53529 PCT/USO1/01461 -31 a) conducting a plurality of-sequencing reactions wherein each reaction comprises a template and primer, wherein the template comprises a gene or fragment thereof and wherein the primers hybridize to coding and non-coding strands of cDNA of said gene, wherein said cDNA has 5 known sequence; and b) determining the boundaries between said exons and said non-exon regions.

8. The method of Claim 7, wherein said primers hybridize to said cDNA at about 300 base intervals. 10 9. The method of Claim 8, wherein said primers hybridize to a coding and non coding strand of said cDNA at about 200 base intervals.

10. The method of Claim 9, wherein said primers hybridize to a coding and non coding strand of said cDNA at about 100 base intervals.

11. The method of Claim 7, wherein the gene is from eukaryotic, archaebacterial 15 or viral sources.

12. The method of Claim 9, wherein said eukaryotic source includes, fungal, plant, mammalian and non-mammalian sources.

13. The method of Claim 7, wherein step a) is repeated with additional primers.

14. A method of determining exon adjacent sequence, comprising: 20 a) contacting a genomic sequence of a cDNA of interest with primers, wherein the cDNA has known sequence and wherein the prirners hybridize to coding and non-coding strands of said cDNA, under conditions suitable for said primers to hybridize to said genomic sequence; WO 01/53529 PCT/USO1/01461 -32 b) conducting template dependent sequencing of said genomic sequence of interest using said hybridized primers; c) comparing the sequence obtained in b) with the sequence of said cDNA, wherein sequence of b) that is not found in the sequence of said 5 cDNA is exon adjacent sequence.

15. The method of Claim 14, wherein steps a), b) and c) are repeated with additional primers.

16. The method of Claim 14, wherein the primers hybridize said cDNA at about 300 base intervals. 10 17. The method of Claim 16, wherein said primers hybridize to a coding and non coding strand of said cDNA at about 200 base intervals.

18. The method of Claim 17, wherein said primers hybridize to a coding and non coding strand of said cDNA at about 100 base intervals.

19. A method of identifying the exon-intron boundaries and 5' and 3' untranslated 15 regions of a gene wherein all or a portion of the genomic sequence is unknown and the cDNA sequence is known, comprising the steps of: a) designing primers based on the corresponding cDNA sequence of the gene, wherein the primers comprise about 5 to about 50 nucleotides and wherein said primers hybridize the cDNA at evenly spaced 20 intervals of about 100 to about 300 nucleotides; b) sequencing the gene using the gene as a template and the evenly spaced primers designed in step a); WO 01/53529 PCT/USO1/01461 -33 c) analyzing the sequences obtained in step b) using a sequence alignment program to determine newly obtained sequences and to further determine regions where the cDNA and the newly obtained sequences differ, wherein the exon-intron boundaries and 5' and 3' 5 untranslated regions comprise regions where the cDNA and newly obtained sequences differ.

20. The method of Claim 19, wherein step a) the primers are evenly spaced along both strands of the cDNA.

21. The method of Claim 19, wherein steps a), b) and c) are repeated with 10 additional primers.

22. A composition comprising isolated polynucleic acid selected from the group consisting of: SEQ ID NO: 61, SEQ ID NO: 63, SEQ ID NO: 65, SEQ ID NO: 67, SEQ ID NO: 69, SEQ ID NO: 71, SEQ ID NO: 73, SEQ ID NO: 75, SEQ ID NO: 77, SEQ ID NO: 79, SEQ ID NO: 81, SEQ ID NO: 84, SEQ ID 15 NO: 86, SEQ ID NO: 89 and SEQ ID NO: 91.

23. A composition comprising isolated polynucleic acid comprising SEQ ID NO:

59.