[go: up one dir, main page]

WO2004005547A2 - Method for identifying hypersensitive site consensus sequences - Google Patents

Method for identifying hypersensitive site consensus sequences Download PDF

Info

Publication number
WO2004005547A2
WO2004005547A2 PCT/GB2003/002895 GB0302895W WO2004005547A2 WO 2004005547 A2 WO2004005547 A2 WO 2004005547A2 GB 0302895 W GB0302895 W GB 0302895W WO 2004005547 A2 WO2004005547 A2 WO 2004005547A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequences
sequence
consensus sequences
chromatin
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/GB2003/002895
Other languages
French (fr)
Other versions
WO2004005547A3 (en
Inventor
Robert Otto Johannes Weinzierl
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ip2ipo Innovations Ltd
Original Assignee
Imperial College Innovations Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0215547A external-priority patent/GB0215547D0/en
Priority claimed from PCT/GB2002/003080 external-priority patent/WO2003004702A2/en
Application filed by Imperial College Innovations Ltd filed Critical Imperial College Innovations Ltd
Priority to AU2003281288A priority Critical patent/AU2003281288A1/en
Publication of WO2004005547A2 publication Critical patent/WO2004005547A2/en
Publication of WO2004005547A3 publication Critical patent/WO2004005547A3/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61PSPECIFIC THERAPEUTIC ACTIVITY OF CHEMICAL COMPOUNDS OR MEDICINAL PREPARATIONS
    • A61P43/00Drugs for specific purposes, not provided for in groups A61P1/00-A61P41/00
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to methods, their uses and products obtained therefrom.
  • the present invention relates to Hypersensitive Sites (HSs) and methods for identifying HS consensus sequences, HS sequences and HS core sequences.
  • HSs Hypersensitive Sites
  • the large amount of DNA present in eukaryotic cells needs to be efficiently stored into a small space, the cell nucleus. This is achieved by packaging DNA molecules into chromatin, which involves looping DNA molecules around histones to create nucleosomal DNA-protein complexes. Subsequent coiling of these nucleosomal complexes into solenoid and higher order structures increases the packaging density further.
  • HSs 'Hypersensitive Sites'
  • nuclease hypersensitive sites are genomic regions that are up to two orders of magnitude more accessible to nuclease digestion in purified nuclei preparations in comparison to bulk nuclear DNA (Nedospasov and Georgiev, 1980; Wu, 1980).
  • HSs are highly specific localized DNA access points for a variety of factors involved in transcription, replication, repair, recombination and attachment to the nuclear matrix.
  • Some HSs are permanently present ('constitutive' HSs), whereas other HSs are only formed in response to specific endogenous or exogenous stimuli ('regulated' HSs). Wifh the advent of extensive sequence data from a variety of eukaryotic genome projects there is renewed interest in bioinformatic tools suitable for identifying functional regulatory elements on the DNA sequence level.
  • the present invention relates to novel and useful aspects concerning HSs.
  • the present invention is based upon the surprising finding that HS consensus sequences can be identified from a plurality of HSs.
  • the present invention relates to HS consensus sequences derived from a plurality of HS sequences - such as HS core sequences.
  • HS consensus sequences may be used to allow the prediction of other HS sequences using bioinformatic tools rather than using exclusively experimental tools.
  • bioinformatic tools rather than using exclusively experimental tools.
  • the availability of large portions of the human genome sequence presents an opportunity for identifying, mapping and analysing HSs on a large scale using computational approaches.
  • the present invention relates to a method for identifying one or more HS consensus sequences comprising the steps of: (a) providing a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the HS core sequences; and(c) returning one or more HS consensus sequences comprising a plurality of motifs identified in step (b).
  • the method according to the first aspect may be implemented in a variety of ways.
  • the principle is that the availability of a plurality of HS core sequences, which may have been identified conventionally using known experimental tools (or even previous bioinformatic tools), can be used to generate large numbers of HS core sequences (the numbers of which will increase in the future) allowing the definition and extraction of sequence-based rules that can be used to identify other sites in genomes that also fulfil these rules and are therefore candidate HSs.
  • the search algorithm includes a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model and a Monte Carlo-statistical model.
  • a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model and a Monte Carlo-statistical model.
  • the search algorithm comprises a word counting method or a probabilistic method.
  • the HS consensus sequences are returned as a regular expression or a sequence logo. More preferably, the HS consensus sequences are returned as a weight matrix. Most preferably, the weight matrix is a position specific scoring matrix (PSSM).
  • PSSM position specific scoring matrix
  • returning the HS consensus sequences as a PSSM comprises the step of computing a score for finding a matching sequence in the plurality of HS consensus sequences.
  • the plurality of HS core sequences are identified using Global Analysis of Chromatin Topology or Hypergenomic Display.
  • the present invention relates to a method for identifying one or more HS sequences comprising the steps of: (a) identifying a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the plurality of HS core sequences; (c) returning one or more HS consensus sequences comprising a plurality of the motifs identified in step (b); and (d) searching for one or more HS sequences comprising one or more HS consensus sequences.
  • step (c) returns the HS consensus sequences as a PSSM comprising the steps of: (i) providing a plurality of HS consensus sequences; (ii) computing the score for finding a matching sequence in the plurality of HS consensus sequences; and (iii) identifying HS sequences in one or more DNA sequences that were not part of the plurality of HS core sequences using the PSSMs identified.
  • the search algorithm is used in a word counting method or a probabilistic method.
  • the search algorithm includes a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model and a Monte Carlo-statistical model.
  • the HS consensus sequences are returned as a regular expression or a sequence logo. More preferably, the HS consensus sequences are returned as a weight matrix. Most preferably, the weight matrix is a position specific scoring matrix (PSSM).
  • PSSM position specific scoring matrix
  • returning the HS consensus sequences as a PSSM comprises the step of computing a score for finding a matching sequence in the plurality of HS consensus sequences.
  • the plurality of HS core sequences are identified using Global Analysis of Chromatin Topology or Hypergenomic Display.
  • the DNA sequences are from a database of DNA sequences.
  • one or more HS sequences comprising the HS consensus sequences are searched by searching for clusters of cw-elements.
  • the most probable arrangement of czs-elements in the cluster are integrated using the Niterbi algorithm.
  • a forward-backward algorithm to consider the sum of all paths through a hidden Markov model is used.
  • the plurality of HS core sequences are identified using Global Analysis of Chromatin Topology or Hypergenomic Display.
  • the present invention relates to a method for identifying an HS core sequence comprising the steps of: (a) providing a D ⁇ A sequence in the sense or antisense orientation that is not part of the plurality of HS core sequences; (b) providing an HS sequence; and (c) searching the D ⁇ A sequence for the presence a hypersensitive restriction site.
  • the HS sequence is between about 50 nucleotides to about 200 nucleotides in length.
  • the DNA sequence is 1 kb in length.
  • the method for identifying an HS core sequence comprises the additional step of using the identified HS consensus sequences or HS sequences to prepare a nucleic acid construct.
  • the methods according to the present invention comprise the additional step of using the identified HS consensus sequences or HS sequences in an assay (or assay development program) and/or a pharmaceutical (or in the preparation of or development of a pharmaceutical).
  • the present invention relates to a method of treating a disease associated with chromatin structure in a subject, the method comprising administering to the subject an effective amount of a chromatin modulating (e.g. modifying) agent capable of modulating (e.g. modifying) the chromatin structure to a non-diseased form.
  • a chromatin modulating (e.g. modifying) agent capable of modulating (e.g. modifying) the chromatin structure to a non-diseased form.
  • the present invention relates to a pharmaceutical composition
  • a pharmaceutical composition comprising a chromatin modulating agent and a pharmaceutically acceptable carrier, diluent, excipient or adjuvant or any combination thereof.
  • the present invention relates to a method of preventing and/or treating a disorder comprising administering a chromatin modulating agent wherein said chromatin modulating agent is capable of modulating an HS to cause a beneficial preventative and/or therapeutic effect.
  • the present invention relates to the use of a chromatin modulating agent in the preparation of a pharmaceutical composition for the treatment of an HS related disorder.
  • the present invention relates to one or more HS consensus sequences identifiable, preferably identified using the methods of the present invention or a variant, derivative, or homologue thereof.
  • the present invention relates to an HS sequence identifiable, preferably identified using the methods of the present invention or a variant, derivative, or homologue thereof.
  • the present invention relates to weight matrices identifiable, preferably identified using the methods of the present invention. More preferably, the weight matrices are PSSMs.
  • the present invention relates to a recording medium bearing machine readable instructions for implementing the first to the third aspects of the invention.
  • the present invention relates to a computer system loaded with machine readable instructions for implementing the first to the third aspects of the invention
  • Figure 1 is a diagrammatic representation of a HS core sequence comprising 100 nucleotides of genomic sequence immediately adjacent to a hypersensitive Mbo I target site (204 bp in total).
  • Figure 2 is a diagrammatic representation of the identification of a HS sequence in Human ⁇ - globin constitutive HS5 (Genbank Accession No. AF064190). A single strong signal as indicated by a distinct peak of predicted HS potential centered around position 6,200 in the nucleotide sequence, is detected which coincides precisely with the experimentally mapped constitutive HS5 (Dhar et al., 1990).
  • Figure 3 is a diagrammatic representation of the identification of HS sequences in Mouse mammary tumour virus 3' long terminal repeat (MMTV-3' LTR; Genbank Accession No. MMTPRO). Both experimentally mapped HSs, including a constitutive and a glucocorticoid- inducible site, are reliably detected as indicated by distinct peaks of predicted HS potential centered around positions 200 and around 1300 in the nucleotide sequence (Zaret and Yamamoto, 1984).
  • Figure 4 is a diagrammatic representation of the identification of HS sequences in Human vascular endothelial growth factor A promoter (Genbank Accession No. AF005785). A strong signal is detected at position 2600. Experimental data shows the presence of two HSs in this area (Liu et al, 2001). The presence of a single broad peak suggests that in some cases the clustering algorithm of CISTER causes artifactual merging of motif clusters from adjacent HSs. Also, two other experimentally mapped sites are only weakly detected.
  • Figure 5 is a diagrammatic representation of the identification of HS sequences in Human erythropoietin (embedded in 13 kb of human genome sequence). Strong signals are detected from an experimentally mapped regulated 5' located HS and from two HSs located at the 3' end of the gene (Zhang et al, 2000). Some merging of the predicted HS signals from the two separate 3' HSs is observed in the computer prediction due to the CISTER algorithm. The program also shows a HS signal within the transcribed region, which is compatible with the experimentally observed emergence of hypersensitivity of the gene at the onset of active expression.
  • Figure 6 is a diagrammatic representation of the identification of HS sequences in human c- Myc (embedded in 55 kb of human genome sequence). This is the largest region yet analysed. HSs surrounding the 5' and 3' end of the c-Myc gene (Mautner et al, 1995) are reliably detected. There are some additional strong signals in the surrounding regions for which no experimental data is currently available. This indicates the high signal/noise ratio achievable with the current set-up.
  • Figure 7 is a diagrammatic representation of the result of a computer-based experiment.
  • the DNA sequence tested consists of a continuous string of the 59 HS core sequences (shown in blue or light grey and dark grey), preceded by the same sequence randomised (shown in red or medium grey). This procedure therefore creates a test sequence (TestSeq) which contains two halves: the first half is random (and thus should lack HS-specif ⁇ c motifs), whereas the other half is 'packed' with all the HS sequences.
  • PSSMs were compiled from two non-overlapping subsets of the HS core sequences: motifs were separately derived from 'collection A' (shown in light blue or light grey) and 'collection B' (shown in dark blue or dark grey).
  • FIG. 8 schematically illustrates a general purpose computer (132) of the type that may be used to perform the methods in accordance with the present invention.
  • the computer (132) includes a central processing unit (134), a read only memory (136), a random access memory
  • a hard disk drive 140
  • a display driver 142
  • display 144
  • a user input/output circuit 146
  • keyboard 148
  • mouse 150
  • the central processing unit (134) may execute program instructions stored within the ROM (136), the RAM (138) or the hard disk drive (140) to carry out processing of signal values that may be stored within the RAM (138) or the hard disk drive (140).
  • the program may be written in a wide variety of different programming languages.
  • the computer program itself may be stored and distributed on a recording medium, such as a compact disc, or may be downloaded over a network link (not illustrated).
  • the general purpose computer (132) when operating under control of an appropriate computer program effectively forms an apparatus for performing aspects of the present invention - such as identifying one or more HS consensus sequences, HS sequences and HS core sequences.
  • Figure 9 is a Table listing SEQ ID No. 3 to 55.
  • Figure 10 is a Table listing HS consensus sequences.
  • Figure 11 is a Table listing HS consensus sequences, which are shown in bold. Instead of a • more precise PSSM they are written in code to indicate redundancies in certain positions.
  • Figure 12 is a Table listing YEBIS PSSMs.
  • Figure 13 is a Table listing YEBIS-MATRIX PSSMs. HYPERSENSITIVE SITES
  • HSs Nuclease Hypersensitive Sites
  • HSs are genomic sites that are highly susceptible to nuclease attack under experimental conditions - typically by approximately two orders of magnitude as compared to bulk chromatin (see Stalder et al, 1980; Wu, 1980). All available data suggests that HSs are mostly free of nucleosomes, but contain a number of transcription factor complexes that are bound to specific sequence motifs present in the genomic DNA.
  • HSs can be viewed as the gateways to the genome for the vast majority of molecules involved in regulating gene expression and many other important genomic functions, such as DNA replication, repair, recombination and insertion of retroviral genomes (reviewed in Gross and
  • Garrard, 1988 They expose or hide gene regulatory signals and therefore constitute one of the most important epigenetic regulatory layers that are superimposed on the genome to control and direct its expression (Bonifer 2000).
  • HSs can be present in a number of forms - such as constitutive HSs, developmentally regulated HSs, tissue-specific HSs and cell type-specific HSs.
  • Such constitutive HSs could, in addition to regulating the expression of adjacent genes, serve as border elements to define functional chromatin domains, or could facilitate the precise folding patterns of individual chromatin fibres (Filipsky et al, 1990).
  • the continuous reconfiguration of chromatin architecture is an essential prerequisite for directing the changing gene expressions patterns during embryonic development and cell type-specific differentiation.
  • Many HSs - such as developmentally regulated HSs - are created near defined subsets of genes in a tissue- and stage-specific manner (see e.g. Gross and Garrard, 1988) due to the local activity of transcription factors and chromatin remodeling machineries (reviewed in Wolffe and Hayes, 1999).
  • HSs near genes are one of several steps in the pathway that prepares a regulatory sequence to become functionally active in chromatin.
  • One of the best-understood model systems is the chicken lysozyme gene, where the HS configuration on its promoter has been shown to be highly dynamic.
  • Several distinct HSs appear and disappear over different promoter elements as cells progress through haemopoetic development (Huber et al., 1995; see Kontaraki et al., 2000). In many cases, a direct correlation between the appearances and disappearances of HSs with known biological functions has been shown.
  • HS consensus sequence refers to a plurality of motifs ie. nucleotides that are common, although not necessarily identical, to other nucleotides in an HS sequence.
  • an HS consensus sequence is an idealised sequence that represents the most likely motif to occur at each position within an HS sequence.
  • a plurality of motifs refers to at least about 2 to 200 or more motifs; more preferably, at least about 2 to 100 or more motifs; more preferably, at least about 2 to 50 or more motifs; more preferably, at least about 2 to 20 or more motifs; more preferably, at least about 6, 7, 8, 9, 10, 11, 12, 13, 14 or 15 motifs; most preferably, 7, 8, 12, 13 or 15 motifs; or any suitable combination of start or end points, for example, at least about 6 to 50 or more motifs.
  • the present invention demonstrates that it is possible to identify and extract HS consensus sequences comprising motifs that are shared by different HS sequences. There may be different functional types of HSs that may contain different sets of shared HS consensus sequences. It is possible to identify HS consensus sequences in other DNA sequence as a bioinformatic tool for the in silico prediction of HSs in these sequences in the absence of experimentally derived information.
  • HS core sequence refers to motifs ie. nucleotides that are typically within about 100 to 200 base pairs of a hypersensitive target site. Motifs may be fragmented at hypersensitive target sites by various entities including chemical or physical agent such as bleomycin, bromoacetaldehyde, chloracetaldehyde, cobalt chiral complex, copper phenanthroline, diethyl pyrocarbonate, dimethyl sulfate, iron(II)-EDTA, methidiumpropyl- EDTA, neocarzinostatin, psoralen and ultraviolet light.
  • chemical or physical agent such as bleomycin, bromoacetaldehyde, chloracetaldehyde, cobalt chiral complex, copper phenanthroline, diethyl pyrocarbonate, dimethyl sulfate, iron(II)-EDTA, methidiumpropyl- EDTA, neocarzinostatin, psoralen
  • the entity may be an enzyme such as a sequence specific nuclease, a non-sequence specific nuclease, Bal-31, DNase I, DNase II, an endogenous nuclease, exonuclease III, lambda exonuclease, micrococcal nuclease, mung bean nuclease, Neurospora crassa nuclease, a restriction enzyme including type I, II and III restriction enzymes, SI nuclease or a topoisomerases such as topoisomerase I or II.
  • the entity is an enzyme. More preferably, the entity is a restriction enzyme.
  • the restriction enzyme recognises at least a 4 base pair (bp) target sequence.
  • the restriction enzyme is selected from the group consisting of DpnII, Mbol ( Figure 1), Nlalll, SauIIIA and Tsp509I.
  • the methods of the present invention may involve the use of one or more such entities.
  • two different entities may be used - such as a restriction enzyme that recognises a 4 bp target sequence and a restriction enzyme that recognises a 6 bp target sequence.
  • a "plurality of HS core sequences” refers to at least about 3 HS core sequences and preferably is selected from the group comprising:about at least 10, 11, 12, 13, 14,15, 16, 17, 18, 19 or 20 HS core sequences; about at least 21-100 HS core sequences; about at least 101- 1000 HS core sequences; about at least 1001 to 5000 HS core sequences; about at least 5001 to 10000 HS core sequences; about at least 10001 to 50000 HS core sequences; and about at least 50001 to 100,000 HS core sequences; or any suitable combination of start or end points, for example, at least about 15 to about 100,000 HS core sequences.
  • the HS core sequence may comprise 200 or more nucleotides.
  • the HS core sequence may comprise 199 or less nucleotides. Whilst the reduction in the size definition of the HS core sequence may not have any effect on the validity of the approach described here it may reduce the amount of 'background noise' in the motif-extraction step.
  • the size of the HS core sequence may be optimised depending on the size of the HS data set and the amount of background noise that is detected.
  • the HS core sequence may even comprise 150 or less nucleotides, 100 or less nucleotides, 50 or less nucleotides or even 25 or less nucleotides.
  • the present invention relates to a method for identifying one or more HS consensus sequences comprising the steps of: (a) providing a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the HS core sequences or subsets thereof; and (c) returning one or more HS consensus sequences comprising a plurality of motifs identified in step (b).
  • Hypertag Display The principle of Hypertag Display is as follows; DNA present in HSs is selectively cut with the restriction enzyme Mbo I, recognizing the 4 bp target sequence 5'GATC ⁇ . Ligation of a compatible BamH I adapter molecule to the cleaved ends results in the selective tagging of each cleaved Mbo I site with a fragment of predetermined and known sequence. The tagged fragment is subsequently amplified by PCR using an oligonucleotide complementary to the adapter molecule and a second oligonucleotide (the 'Hypertag' primer) that is complementary to a sequence located next to a previously mapped HS. The required local sequence information can be derived from data obtained through the HS library approach.
  • This step covalently joins the DNA sequences adjacent to the Mbol cleavage site to the plasmid DNA, but does not yet result in the formation of a functional recombinant DNA molecule; the other end of the genomic fragment will usually be tens of kilobases away, may also be randomly sheared during the genomic DNA extraction step and will thus not be suitable for specifically joining the other BamHI site in the linearised plasmid.
  • the ligation mixture is cut to completion with EcoRI. This enzyme cuts at a defined site within the polylinker of the plasmid vector and also cuts every target site present in the genomic DNA.
  • this step creates the condition for specifically joining the other end of the construct through intramolecular ligation. Due to the random orientation of the plasmid vector relative to the Mbol fragment during the first ligation approximately 50% of the clones are lost at this stage, but the other 50% of ligation products will contain specifically cloned MboI-EcoRI genomic fragments. Transfection of the ligation products results in the creation of a library of genomic DNA fragments derived from a large variety of HSs. Determination of the insert sequence adjacent to the BamHI-MboI junction of each clone establishes, after a search against human genome databases, the precise genomic location of the Mbol site and thus allows the positioning of a specific HS surrounding it.
  • Various search algorithms may be used to search a plurality of motifs that are shared by HS core sequences.
  • Several methods to search for over-represented motifs in the upstream region of a set of coregulated genes have been developed and tested as described by Ohler & Niemann (2001) Trends Genet. 17, 56-60. These methods can be divided in to two different types: (1) methods based on word counting and (2) methods based on probabilistic sequence models.
  • Word counting methods are based on the frequency analysis of oligonucleotides in sequences and overrepresentation is measured by comparing the counted number of occurrences of a word to the expected number of occurrences. A common motif is then compiled by grouping similar words. Word counting methods have been described by Jensen & Knudsen (2001) Bioinformatics 16, 326-333 and Van Helden et al. (1998) J. Mol. Biol. 281, 827-842.
  • the motif model is represented as a position probability matrix and the motif is assumed to be hidden in a noisy background sequence.
  • maximum likelihood estimation is used.
  • the most frequent methods used for this are Expectation Maximisation (a maximum likelihood algorithm for estimating the parameters of a probabilistic model) and Gibbs Sampling (a stochastic equivalent of Expectation Maximisation).
  • Expectation Maximisation a maximum likelihood algorithm for estimating the parameters of a probabilistic model
  • Gibbs Sampling a stochastic equivalent of Expectation Maximisation
  • the search algorithm may include a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model or a Monte Carlo-statistical model.
  • a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model or a Monte Carlo-statistical model.
  • HS consensus sequences may be identified using YEBIS or MOTIFSAMPLER.
  • YEBIS (Yada et al, 1998) is available at www-scc.jst.go.jp/YEBIS/MotifExtraction. This program is capable of extracting a set of sequence motifs without any a priori knowledge from a number of related DNA sequences.
  • YEBIS uses an algorithm based upon a Markov statistical model and may be applied to a large number of unaligned sequences.
  • MOTIFSAMPLER Thijs et al, 2001
  • vvww.esat.k euven.ac.be/ ⁇ thijsAV ' ork/MotifSampler.ht ⁇ nl This software package tries to find over-represented motifs in the upstream region of a set of co-regulated genes.
  • This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. Higher-order background models are used to improve the robustness of the motif finding.
  • the Motif Sampler comes with background models for several organisms but is also suitable for other organisms since the background model can also be calculated from the input sequences. This programme differs from YEBIS because the length of the motifs and number of detected motifs can be entered as part of the search criteria.
  • MotifSampler requires four search parameters, including motif length and copy number.
  • the HS core sequences were analysed by specifying the expected lengths of motifs as 8, 12 and 15 in three independent runs. Consensus sequences shared by different members of the HS core sequences were successful identified. For some aspects of the present invention, MOTIFSAMPLER motifs of length 12 are preferred.
  • both YEBIS and MotifSampler are applications for searching for motifs, such as those that may be characteristic of HS sequences.
  • the motifs are extracted and can be sorted into groups.
  • An algorithm including a statistical model is applied and a matrix is returned that is derived by scoring the motifs at each position; this matrix can be used to define the variability between motifs.
  • the plurality of HS core sequences may be aligned prior to searching such that correspondences are assigned to preserve the order of the residues within the HS core sequences by identifying a start point, and if necessary introducing gaps.
  • step (c) Returning a HS consensus sequences comprising a plurality of motifs identified in step (b .
  • the HS consensus sequences are returned as a regular expression, or a sequence logo ie. a graphic method of illustrating consensus information comprising coloured letters of different sizes, where the letters indicate different proportions of motifs..
  • the HS consensus sequences are returned as a weight matrix.
  • Background teachings on weight matrices have been presented by Freeh et al. (1997). The following information concerning weight matrices has been extracted from that source:
  • a weight matrix uses the complete composition of nucleotides for each position of an alignment to achieve a more differentiated rating of a matching sequence. For example, a single position of an alignment of 12 sequences containing TTTTTTTAAACC (each letter representing one sequence at this position) would be assigned T in the IUPAC consensi. A new sequence with a T at this position would be considered a match while an A at the same place would cause the whole sequence to be dismissed as no match. Even a simple nucleotide distribution matrix would assign a weight score (in this case proportional to the percentage of the nucleotide) of 0.58 to the T and still 0.25 to an A. Thus, weight matrices represent the similarity of the tested sequence to all of the sequences in the alignment much better than
  • IUPAC consensi Most weight matrix-based methods add some more weighting by comparison of the actual nucleotide distribution with random values or by other statistical measures eg. information content.
  • Comput. Appl. Biosci. 11, 563-566 uses a matrix library containing more than 200 matrices.
  • Matlnspector (Ghosh (1993) Nucleic Acid Res. 21, 3117-3118) allows testing of individually selected matrices. Conslnspector (Kondrakhin et al. (1995) Comput. Appl. Biosci. 11, 477-
  • Weight matrices are advantageously used in the present invention because they are much less sensitive to sequence selection and provide a quantitative score. Even a single mismatch at a critical position will reduce the score of the match.
  • the weight matrix is a position specific scoring matrix (PSSM).
  • PSSMs as described by Freeh et al. (1997) use the complete composition of nucleotides for each position of the alignment to achieve a more differentiated rating of a matching sequence.
  • the HS consensus sequences are returned as a PSSM comprising the steps of: (a) computing a score for finding a matching sequence in the plurality of HS consensus sequences; and (b) returning the variability in the plurality of HS consensus sequences as a PSSM.
  • HS consensus sequences may be returned as a PSSM using various methods known in the art such as E-matrix maker (Thomas et al. (1999) Journal of Computational Biology 6: 219-235, 1999; Thomas et al. Bioinformatics 16: 233-244, 2000) which is available at http://motif.stanford.edu/ematrix-maker
  • the present invention relates to a method for identifying an HS sequence comprising the steps of: (a) identifying a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the plurality of HS core sequences; (c) returning an HS consensus sequences comprising a plurality of the motifs identified in step (b); and (d) searching for an HS sequence comprising one or more HS consensus sequences.
  • HS sequences comprising HS consensus sequences may be searched using various methods known in the art.
  • DNA sequences that are not part of the plurality of HS core sequence are searched for the presence of HS consensus sequences by searching for clusters of cw-elements.
  • the most probable arrangement of czs-elements in the cluster are integrated using the Viterbi algorithm.
  • Cluster Of Motifs E-value Tool fhttp://zlab.bu.edu/ ⁇ mfrith/comet/form.html
  • COMET assigns a positive score to each motif using the standard method of log likelihood ratios, and subtracts a 'gap penalty' linearly proportional to the distances between motifs.
  • each motif cluster receives a score, which is higher if the individual motifs are stronger, but lower if they are further apart.
  • the scoring scheme corresponds to a log likelihood ratio of explaining the data given a cluster model versus a background model.
  • the cluster model is for czs-elements to occur in a uniform distribution, with some intensity, whereas the background model consists of random nucleotides.
  • the gap penalty corresponds in a one-to-one fashion with the intensity parameter of the cluster model.
  • a forward-backward algorithm to consider the sum of all paths through a hidden Markov model is used.
  • CISTER Frrith et al, 2001 detects cis- element clusters by using a statistical model (a hidden Markov model) of what it expects these clusters to look like.
  • the parameters allow the user to vary some aspects of the model, and it is quite possible that different model parameters are suitable for different types of motif cluster.
  • Parameters include (i) the distance between neighbouring cis-elements within a cluster is assumed to be geometrically distributed with mean a; (ii) The number of cis- elements in a cluster is assumed to be geometrically distributed with mean b; and (iii) the distance between regulatory cis-element clusters is assumed to be geometrically distributed with mean g.
  • the background states are programmed to represent the local abundances of the 4 bases in the query sequence. Examining local abundances accounts for the biological reality of heterogeneous base composition, and prevents, for example, many spurious GC-rich motifs being detected in a part of the sequence that happens to be generally GC-rich. Cister uses the technique of posterior decoding, with this hidden Markov model.
  • a reformatting program may be used to convert the PSSMs into a format that a program subsequently used to process the PSSMs - such as CISTER - can understand.
  • various search algorithms may be used to search a plurality of motifs that are shared by HS core sequences.
  • Some programs - such as YEBIS - provide a fractional value for each of the four nucleotides for each position of the PSSM.
  • the reformatting program typically extracts data that show the aligned motifs by providing the actual numbers of occurrences of each nucleotide in each position of the PSSM. The results that are returned are actual numbers rather than fractions.
  • the methods of the present invention may be used to define high-resolution stochastic models to optimise the recognition rate of HSs. Accordingly, it may be possible to determine whether certain consensus sequences occur in a particular combination, whether the spacing between certain motifs is important, the relative frequency of each motif and whether some of the consensus sequences are more diagnostic than others.
  • step (c) returns the HS consensus sequences as a PSSM comprising the steps of: (i) providing a plurality of HS consensus sequences; (ii) computing the log-odds score for finding a matching sequence in the plurality of HS consensus sequences; (iii) returning the variability in the plurality of HS consensus sequences as a PSSM; and (iv) identifying HS sequences in one or more DNA sequences that were not part of the plurality of HS core sequences using the PSSMs identified.
  • HS consensus sequences may be returned as a PSSM using various methods known in the art such as E-matrix maker (Thomas et al. (1999) Journal of Computational Biology 6: 219-235, 1999; Thomas et al. Bioinformatics 16: 233-244, 2000) which is available at http://motif.stanford.edu/ematrix-maker.
  • the present invention proves a method for identifying an HS core sequence comprising the steps of: (a) providing a DNA sequence in the sense or antisense orientation that is not part of the plurality of HS core sequences; (b) providing an HS sequence; and (c) searching the DNA sequence for the presence a hypersensitive restriction site.
  • the DNA sequence may be from a database of DNA sequences.
  • the DNA sequence is about 10 kb in length.
  • the sequence may be converted to capitals and the ">" characters and whitespace may be removed. If the alignment in the sense orientation fails then the DNA sequence may be converted to an antisense orientation using routine methods known to in the art.
  • An HS sequence is provided which may be converted to capitals and the ">" characters and whitespace may be removed.
  • the HS sequence is between about 50 nucleotides to about 200 nucleotides in length, for example the HS sequence may be about 50 nucleotides in length.
  • the DNA sequence is then searched for the presence a hypersensitive site - such as hypersensitive enzyme site - in the sense or antisense orientation.
  • the enzyme may be a sequence specific nuclease or a restriction enzyme including type I, II and III restriction enzymes.
  • the enzyme is a restriction enzyme - such as a restriction enzyme that recognises at least a 4 base pair (bp) target sequence. More preferably, the restriction enzyme is selected from the group consisting of DpnII, Mbol, Nlalll, Sau3A and Tsp509I.
  • Examples include the inhibition of transcription of mutated proto-oncogenes - such as erbB-2 and bcr- abl (cancer); activation of fetal haemoglobin (sickle cell anemia); growth hormone (dwarfism); and erythropoetin and vascular endothelial growth factor (cancer therapy, diabetes); and the regulation of human telomerase reverse transcriptase to control aging and cancer proliferation. Therefore, the ability to predict the locations of HSs by bioinformatic means as described herein, has numerous implications for biotechnological and medical applications. Much of the current experimental work in biotechnology relates to the identification of various human gene regulatory sequences.
  • enhancers contain clusters of transcription factor binding sites, and can stimulate the activity of adjacent genes substantially. Enhancers also play an important role in directing tissue-specific gene expression programmes. Other gene regulatory regions, such as silencers, are involved in switching off the expression of nearby genes. Research carried out over the last two decades has established a strong link between the locations of enhancers and silencers with HSs (reviewed in Gross and Garrard, 1988; Bonifer, 2000). The ability to detect HSs using a bioinformatic approach in various eukaryotic genome sequences (especially the human genome) may have the potential to identify systematically numerous constitutive and tissue-specific enhancer and silencer sequences. These identified enhancer and silencer sequences may be suitable for numerous applications in gene therapy.
  • Constitutive enhancers that are active in a wide spectrum of cell types can be used to promote the expression of a target gene in any cell type.
  • Tissue-specific enhancers provide an enhanced level of specificity and can be used to promote the expression of target genes in a particular tissue, or in a restricted range of cell types.
  • silencers can be used to switch of the expression of unwanted genes (e.g. oncogenes) or to silence the expression of parasitic genomes (e.g. during viral infections). Examples to illustrate this approach can be found in Smith et al. (2000) and Phylactides et al. (2002).
  • HSs cystic fibrosis transmembrane conductance regulator
  • the identified enhancers can be used to direct the tissue- and stage-specific expression of a synthetic CFTR gene in future gene therapeutic applications.
  • Harland et al. (2002) specifically set out to identify an enhancer that confers ubiquitous expression to adjacent genes. They successfully analysed the promoter of the universally expressed transcription factor TATA-binding protein (TBP) for the presence of a DNAase I hypersensitive site indicative of the location of such enhancers.
  • TATA-binding protein TBP
  • bioinformatic tools for the predictions of HSs may be applied with ease to large regions of sequenced genomes and to identify and select HS candidate sequences located near a multitude of different genes. These candidate sequences can then be experimentally verified, thus providing substantial savings in cost and time.
  • HSs identified by bioinformatic means near any gene known or suspected to cause genetic and other diseases may therefore be useful for the prediction of genomic regions that are important candidates for the development of diagnostic tests. Sequencing of such bioinformatically identified regions in genomes derived from normal individuals and patients will rapidly and efficiently identify such mutations and lead to possible therapeutic interventions (eg. by gene therapy).
  • HSs by bioinformatic means also has numerous implications for transcription factor-based therapeutic applications.
  • Ma et al. (2002) identified two transcription factors binding to DNA sequences present in a HS near the platelet-derived growth factor (PDGF)-A gene (implicated in tumorigenesis, metastasis and tumour progression). These transcription factors are repressors and are capable of diminishing the transcription of the PDFG-A gene.
  • the identified transcription factors may play an important role in dampening the expression of this oncogenic growth factor.
  • the bioinformatic identification of HSs (especially those located near disease-causing genes) may be the starting point of large scale screens to identify transcription factors capable of interacting with them.
  • the bioinformatically identified HS core sequences can be chemically synthesised as single- and double-stranded oligonucleotides and used to isolate transcription factors binding to them (e.g. using the cDNA expression cDNA library screening approach used by Ma et al. (2002)).
  • the oligonucleotides derived from predicted HSs may be used to prepare DNA affinity columns (Kadonaga and Tjian, 1986).
  • the identified transcription factors may then be used as targets for the isolation and development of drugs capable of modulating their functional characteristics in a therapeutically useful manner.
  • HSs are an important regulatory access point for external agents to act upon the genome. It will be appreciated that the methods of the present invention have many applications in biotechnology and medicine, which may be carried out faster, and more economically using the methods described herein. The methods of the present invention are broadly applicable to all eukaryotic genomes.
  • nucleotide sequence is synonymous with the term “polynucleotide”.
  • aspects of the present invention involve the use of nucleotide sequences, which are available in databases.
  • the nucleotide sequence may be DNA or RNA of genomic or synthetic or recombinant origin.
  • the nucleotide sequence may be double-stranded or single-stranded whether representing the sense or antisense strand or combinations thereof.
  • the nucleotide sequence may be prepared by use of recombinant DNA techniques (e.g. recombinant DNA).
  • the nucleotide sequence may be the same as the naturally occurring form, or may be derived therefrom.
  • amino acid sequence is synonymous with the term “polypeptide” and/or the term “protein”. In some instances, the term “amino acid sequence” is synonymous with the term “peptide”. In some instances, the term “amino acid sequence” is synonymous with the term “protein”.
  • aspects of the present invention concern the use of amino acid sequences, which may be available in databases.
  • amino acid sequence may be isolated from a suitable source, or it may be made synthetically or it may be prepared by use of recombinant DNA techniques.
  • the present invention encompasses the use of variants, homologues and derivatives of nucleotide sequences.
  • the term “homologue” means an entity having a certain homology with nucleotide sequences.
  • the term “homology” can be equated with "identity”.
  • An homologous sequence is taken to include a nucleotide sequence which may be at least 75, 85 or 90% identical, preferably at least 95 or 98% identical to the subject sequence.
  • Homology comparisons can be conducted by eye, or more usually, with the aid of readily available sequence comparison programs. These commercially available computer programs can calculate % homology between two or more sequences.
  • % homology may be calculated over contiguous sequences, i.e. one sequence is aligned with the other sequence and each nucleotide in one sequence is directly compared with the corresponding nucleotidein the other sequence, one residue at a time. This is called an "ungapped" alignment. Typically, such ungapped alignments are performed only over a relatively short number of residues.
  • BLAST and FASTA are available for offline and online searching (see Ausubel et al, 1999 ibid, pages 7-58 to 7-60). However, for some applications, it is preferred to use the GCG Bestfit program.
  • a new tool, called BLAST 2 Sequences is also available for comparing nucleotide sequences (see FEMS Microbiol Lett 1999 174(2): 247-50; FEMS Microbiol Lett 1999 177(1): 187-8)
  • a scaled similarity score matrix is generally used that assigns scores to each pairwise comparison based on chemical similarity or evolutionary distance.
  • An example of such a matrix commonly used is the BLOSUM62 matrix - the default matrix for the BLAST suite of programs.
  • GCG Wisconsin programs generally use either the public default values or a custom symbol comparison table if supplied (see user manual for further details). For some applications, it is preferred to use the public default values for the GCG package, or in the case of other software, the default matrix, such as BLOSUM62.
  • % homology preferably % sequence identity.
  • the software typically does this as part of the sequence comparison and generates a numerical result.
  • Nucleotide sequences may include within them synthetic or modified nucleotides.
  • a number of different types of modification to oligonucleotides are known in the art. These include methylphosphonate and phosphorothioate backbones and/or the addition of acridine or polylysine chains at the 3' and/or 5' ends of the molecule. Such modifications may be carried out to enhance the in vivo activity or life span of nucleotide sequences.
  • the present invention relates to a method comprising the additional step of using the identified HS consensus sequences to prepare one or more nucleic acid constructs.
  • the present invention also relates to a nucleic acid construct comprising one or more HS consensus sequences.
  • construct is synonymous with the term “vector” and includes expression vectors, transformation vectors and shuttle vectors.
  • expression vector means a construct capable of in vivo or in vitro expression.
  • transformation vector means a construct capable of being transferred from one entity to another entity - which may be of the species or may be of a different species. If the construct is capable of being transferred from one species to another - such as from an
  • Escherichia coli plasmid to a bacterium such as of the genus Bacillus
  • the transformation vector is sometimes called a "shuttle vector". It may even be a construct capable of being transferred from an E. coli plasmid to an Agrobacterium to a plant.
  • Vectors may be transformed into a suitable host cell as described below to provide for expression of a polypeptide encompassed in the present invention.
  • the vectors may be for example, plasmid, virus or phage vectors provided with an origin of replication, optionally a promoter for the expression of the said polynucleotide and optionally a regulator of the promoter.
  • Vectors may be used in vitro, for example for the production of RNA or used to transfect or transform a host cell.
  • polynucleotides for use in the present invention may be incorporated into a construct — such as a recombinant vector (typically a replicable vector), for example a cloning or expression vector.
  • the vector may be used to replicate the nucleic acid in a compatible host cell.
  • quantities of polynucleotides may be made by introducing a polynucleotide into a replicable vector, introducing the vector into a compatible host cell, and growing the host cell under conditions, which bring about replication of the vector.
  • the vector may be recovered from the host cell. Suitable host cells are described below in connection with expression vectors.
  • Genetically engineered host cells may be used to express an amino acid sequence (or variant, homologue, fragment or derivative thereof) in screening methods for the identification of agents and antagonists. Such genetically engineered host cells could be used to screen peptide libraries or organic molecules. Antagonists and agents such as antibodies, peptides or small organic molecules will provide the basis for pharmaceutical compositions.
  • the present invention relates to a method comprising the additional step of using the identified HS consensus sequences in an assay (or assay development program).
  • Any one or more appropriate targets - such as a nucleotide sequence of an HS consensus sequence - may be used for identifying a chromatin modulating (e.g. modifying) agent according to the present invention.
  • the target employed in such a test may be free in solution, affixed to a solid support, borne on a cell surface, or located intracellularly.
  • the abolition of target activity or the formation of binding complexes between the target and the chromatin modulating (e.g. modifying) agent being tested may be measured.
  • the methods of the present invention may be a screen, whereby a number of chromatin modulating (e.g. modifying) agents are tested.
  • Techniques for drug screening may be based on the method described in Geysen, European Patent Application 84/03564, published on September 13, 1984.
  • large numbers of different small peptide test compounds are synthesized on a solid substrate, such as plastic pins or some other surface.
  • the peptide test compounds are reacted with a suitable target or fragment thereof and washed. Bound entities are then detected - such as by appropriately adapting methods well known in the art.
  • a purified target may also be coated directly onto plates for use in a drug screening techniques.
  • non-neutralising antibodies may be used to capture the peptide and immobilise it on a solid support.
  • chromatin modulating agent may refer to a single entity or a combination of entities.
  • the chromatin modulating agent may be an organic compound or other chemical.
  • the chromatin modulating agent may be a compound, which is obtainable from or produced by any suitable source, whether natural or artificial.
  • the chromatin modulating agent may be an amino acid molecule, a polypeptide, or a chemical derivative thereof, or a combination thereof.
  • the chromatin modulating agent may even be a polynucleotide molecule - which may be a sense or an anti-sense molecule.
  • the chromatin modulating agent may even be an antibody.
  • the chromatin modulating agent may be designed or obtained from a library of compounds, which may comprise peptides, as well as other compounds, such as small organic molecules.
  • the chromatin modulating (e.g. modifying) agent may be a natural substance, a biological macromolecule, or an extract made from biological materials such as bacteria, fungi, or animal (particularly mammalian) cells or tissues, an organic or an inorganic molecule, a synthetic agent, a semi-synthetic agent, a structural or functional mimetic, a peptide, a peptidomimetics, a derivatised agent, a peptide cleaved from a whole protein, a peptide synthesised synthetically (such as, by way of example, either using a peptide synthesizer or by recombinant techniques) or combinations thereof, a recombinant agent, an antibody, a natural or a non-natural agent, a fusion protein or equivalent thereof and mutants, derivatives or combinations thereof.
  • the chromatin modulating (e.g. modifying) agent may be an organic compound.
  • the organic compounds may comprise two or more hydrocarbyl groups.
  • hydrocarbyl group means a group comprising at least C and H and may optionally comprise one or more other suitable substituents. Examples of such substituents may include halo-, alkoxy-, nitro-, an alkyl group, a cyclic group etc.
  • substituents may include halo-, alkoxy-, nitro-, an alkyl group, a cyclic group etc.
  • a combination of substituents may form a cyclic group. If the hydrocarbyl group comprises more than one C then those carbons need not necessarily be linked to each other. For example, at least two of the carbons may be linked via a suitable element or group.
  • the hydrocarbyl group may contain hetero atoms. Suitable hetero atoms will be apparent to those skilled in the art and include, for instance, sulphur, nitrogen and oxygen.
  • the chromatin modulating (e.g. modifying) agent may comprise at least one cyclic group.
  • the cyclic group may be a polycyclic group, such as a non-fused polycyclic group.
  • the chromatin modulating (e.g. modifying) agent may comprise at least one of said cyclic groups linked to another hydrocarbyl group.
  • the chromatin modulating (e.g. modifying) agent may contain halo groups.
  • halo means halogen compounds eg. halides and includes fluoro, chloro, bromo or iodo groups.
  • the chromatin modulating (e.g. modifying) agent may contain one or more of alkyl, alkoxy, alkenyl, alkylene and alkenylene groups - which may be unbranched- or branched-chain.
  • the chromatin modulating (e.g. modifying) agent may be in the form of a pharmaceutically acceptable salt - such as an acid addition salt or a base salt - or a solvate thereof, including a hydrate thereof.
  • a pharmaceutically acceptable salt - such as an acid addition salt or a base salt - or a solvate thereof, including a hydrate thereof.
  • the chromatin modulating (e.g. modifying) agent of the present invention may be capable of displaying other therapeutic properties.
  • the chromatin modulating (e.g. modifying) agent may be used in combination with one or more other pharmaceutically active agents.
  • combinations of active agents are administered, then they may be administered simultaneously, separately or sequentially.
  • the present invention relates to a method comprising the additional step of using the identified HS consensus sequence(s) in a pharmaceutical (or in the preparation of or development of a pharmaceutical).
  • compositions useful in the present invention may comprise a therapeutically effective amount of chromatin modulating (e.g. modifying) agent(s) and pharmaceutically acceptable carrier, diluent or excipient (including combinations thereof).
  • chromatin modulating agent(s) e.g. modifying
  • pharmaceutically acceptable carrier e.g. diluent or excipient
  • compositions may be for human or animal usage in human and veterinary medicine and will typically comprise any one or more of a pharmaceutically acceptable diluent, carrier, or excipient.
  • Acceptable carriers or diluents for therapeutic use are well known in the pharmaceutical art, and are described, for example, in Remington's Pharmaceutical Sciences, Mack Publishing Co. (A. R. Gennaro edit. 1985).
  • the choice of pharmaceutical carrier, excipient or diluent may be selected with regard to the intended route of administration and standard pharmaceutical practice.
  • Pharmaceutical compositions may comprise as - or in addition to - the carrier, excipient or diluent any suitable binder(s), lubricant(s), suspending agent(s), coating agent(s) or solubilising agent(s).
  • Preservatives, stabilizers, dyes and even flavoring agents may be provided in pharmaceutical compositions.
  • preservatives include sodium benzoate, sorbic acid and esters of p-hydroxybenzoic acid.
  • Antioxidants and suspending agents may be also used.
  • compositions useful in the present invention may be formulated to be administered using a mini-pump or by a mucosal route, for example, as a nasal spray or aerosol for inhalation or ingestable solution, or parenterally in which the composition is formulated by an injectable form, for delivery, by, for example, an intravenous, intramuscular or subcutaneous route.
  • the formulation may be designed to be administered by a number of routes.
  • Chromatin modulating (e.g. modifying) agents may also be used in combination with a cyclodextrin.
  • Cyclodextrins are known to form inclusion and non-inclusion complexes with drug molecules. Formation of a drug-cyclodextrin complex may modify the solubility, dissolution rate, bioavailability and/or stability property of a drug molecule. Drug- cyclodextrin complexes are generally useful for most dosage forms and administration routes.
  • the cyclodextrin may be used as an auxiliary additive, e.g. as a carrier, diluent or solubiliser.
  • Alpha-, beta- and gamma- cyclodextrins are most commonly used and suitable examples are described in WO-A- 91/11172, WO-A-94/02518 and WO-A-98/55148.
  • the chromatin modulating (e.g. modifying) agent is a protein
  • said protein may be prepared in situ in the subject being treated.
  • nucleotide sequences encoding said protein may be delivered by use of non- viral techniques (e.g. by use of liposomes) and/or viral techniques (e.g. by use of retroviral vectors) such that the said protein is expressed from said nucleotide sequence.
  • the chromatin modulating (e.g. modifying) agents may exist as stereoisomers and/or geometric isomers - e.g. they may possess one or more asymmetric and/or geometric centres and so may exist in two or more stereoisomeric and/or geometric forms.
  • the present invention contemplates the use of the entire individual stereoisomers and geometric isomers of those chromatin modulating (e.g. modifying) agents, and mixtures thereof.
  • the terms used in the claims encompass these forms, provided said forms retain the appropriate functional activity (though not necessarily to the same degree).
  • the chromatin modulating (e.g. modifying) agent may be administered in the form of a pharmaceutically acceptable salt.
  • Suitable acid addition salts are formed from acids which form non-toxic salts and include the hydrochloride, hydrobromide, hydroiodide, nitrate, sulphate, bisulphate, phosphate, hydrogenphosphate, acetate, trifluoroacetate, gluconate, lactate, salicylate, citrate, tartrate, ascorbate, succinate, maleate, fumarate, gluconate, formate, benzoate, methanesulphonate, ethanesulphonate, benzenesulphonate and p-toluenesulphonate salts.
  • suitable pharmaceutically acceptable base addition salts can be formed from bases which form non-toxic salts and include the aluminium, calcium, lithium, magnesium, potassium, sodium, zinc, and pharmaceutically- active amines such as diethanolamine, salts.
  • a pharmaceutically acceptable salt of a chromatin modulating (e.g. modifying) agent may be readily prepared by mixing together solutions of a chromatin modulating (e.g. modifying) agent and the desired acid or base, as appropriate. The salt may precipitate from solution and be collected by filtration or may be recovered by evaporation of the solvent.
  • a chromatin modulating (e.g. modifying) agent may exist in polymorphic. form.
  • a chromatin modulating (e.g. modifying) agent may contain one or more asymmetric carbon atoms and therefore exist in two or more stereoisomeric forms. Where a chromatin modulating (e.g. modifying) agent contains an alkenyl or alkenylene group, cis (E) and trans (Z) isomerism may also occur.
  • the present invention includes the individual stereoisomers of a chromatin modulating (e.g. modifying) agent and, where appropriate, the individual tautomeric forms thereof, together with mixtures thereof.
  • Separation of diastereoisomers or cis- and tr ⁇ y-isomers may be achieved by conventional techniques, e.g. by fractional crystallisation, chromatography or H.P.L.C. of a stereoisomeric mixture of an agent or a suitable salt or derivative thereof.
  • An individual enantiomer of a chromatin modulating (e.g. modifying) agent may also be prepared from a corresponding optically pure intermediate or by resolution, such as by H.P.L.C. of the corresponding racemate using a suitable chiral support or by fractional crystallisation of the diastereoisomeric salts formed by reaction of the corresponding racemate with a suitable optically active acid or base, as appropriate.
  • the present invention also encompasses all suitable isotopic variations of a chromatin modulating (e.g. modifying) agent or a pharmaceutically acceptable salt thereof.
  • An isotopic variation of a chromatin modulating (e.g. modifying) agent or a pharmaceutically acceptable salt thereof is defined as one in which at least one atom is replaced by an atom having the same atomic number but an atomic mass different from the atomic mass usually found in nature. Examples of isotopes that may be incorporated into a chromatin modulating (e.g.
  • modifying agent and pharmaceutically acceptable salts thereof include isotopes of hydrogen, carbon, nitrogen, oxygen, phosphorus, sulphur, fluorine and chlorine such as 2 H, 3 H, 13 C, 14 C, 15 N, 17 O, 18 O, 31 P, 32 P, 35 S, 18 F and 36 C1, respectively.
  • Certain isotopic variations of a chromatin modulating (e.g. modifying) agent and pharmaceutically acceptable salts thereof, for example, those in which a radioactive isotope such as 3 H or 14 C is incorporated are useful in drug and/or substrate tissue distribution studies. Tritiated, i.e., 3 H, and carbon- 14, i.e., 14 C, isotopes are particularly preferred for their ease of preparation and detectability.
  • isotopic variations of chromatin modulating (e.g. modifying) agents and pharmaceutically acceptable salts thereof can generally be prepared by conventional procedures using appropriate isotopic variations of suitable reagents.
  • a chromatin modulating (e.g. modifying) agent may be derived from a prodrug.
  • prodrugs include entities that have certain protected group(s) and which may not possess pharmacological activity as such, but may, in certain instances, be administered (such as orally or parenterally) and thereafter metabolised in the body to form an agent of the present invention which are pharmacologically active.
  • pro-moieties for example as described in "Design of Prodrugs” by H. Bundgaard, Elsevier, 1985 (the disclosured of which is hereby incorporated by reference), may be placed on appropriate functionalities of chromatin modulating (e.g. modifying) agents. Such prodrugs are also included within the scope of the invention.
  • the present invention also includes the use of zwitterionic forms of a chromatin modulating (e.g. modifying) agent of the present invention.
  • chromatin modulating e.g. modifying
  • the terms used in the claims encompass one or more of the forms just mentioned.
  • a chromatin modulating (e.g. modifying) agent may be administered as a pharmaceutically acceptable salt.
  • a pharmaceutically acceptable salt may be readily prepared by using a desired acid or base, as appropriate. The salt may precipitate from solution and be collected by filtration or may be recovered by evaporation of the solvent.
  • the chromatin modulating (e.g. modifying) agent may be prepared by chemical synthesis techniques. It will be apparent to those skilled in the art that sensitive functional groups may need to be protected and deprotected during synthesis of a chromatin modulating (e.g. modifying) agent. This may be achieved by conventional techniques, for example as described in "Protective Groups in Organic Synthesis” by T W Greene and P G M Wuts, John Wiley and Sons Inc. (1991), and by P.J.Kocienski, in “Protecting Groups", Georg Thieme Verlag (1994).
  • any stereocentres present could, under certain conditions, be racemised, for example if a base is used in a reaction with a substrate having an having an optical centre comprising a base-sensitive group. This is possible during e.g. a guanylation step. It should be possible to circumvent potential problems such as this by choice of reaction sequence, conditions, reagents, protection/deprotection regimes etc. as is well-known in the art.
  • the compounds and salts may be separated and purified by conventional methods.
  • Separation of diastereomers may be achieved by conventional techniques, e.g. by fractional crystallisation, chromatography or H.P.L.C. of a stereoisomeric mixture of a compound or a suitable salt or derivative thereof.
  • An individual enantiomer of a compound may also be prepared from a corresponding optically pure intermediate or by resolution, such as by H.P.L.C. of the corresponding racemate using a suitable chiral support or by fractional crystallisation of the diastereomeric salts formed by reaction of the corresponding racemate with a suitably optically active acid or base.
  • the chromatin modulating (e.g. modifying) agent or variants, homologues, derivatives, fragments or mimetics thereof may be produced using chemical methods to synthesise the chromatin modulating (e.g. modifying) agent in whole or in part.
  • the chromatin modulating (e.g. modifying) agent is a peptide
  • the peptide can be synthesized by solid phase techniques, cleaved from the resin, and purified by preparative high performance liquid chromatography (e.g., Creighton (1983) Proteins Structures And Molecular Principles, WH Freeman and Co, New York NY).
  • the composition of the synthetic peptides may be confirmed by amino acid analysis or sequencing (e.g., the Edman degradation procedure; Creighton, supra).
  • derivative or "derivatised” as used herein includes chemical modification of an chromatin modulating (e.g. modifying) agent. Illustrative of such chemical modifications would be replacement of hydrogen by a halo group, an alkyl group, an acyl group or an amino group.
  • the chromatin modulating (e.g. modifying) agent may be a chemically modified agent.
  • the chemical modification of a chromatin modulating (e.g. modifying) agent may either enhance or reduce hydrogen bonding interaction, charge interaction, hydrophobic interaction, Van Der Waals interaction or dipole interaction.
  • the chromatin modulating (e.g. modifying) agent may act as a model (for example, a template) for the development of other compounds.
  • the present invention provides a method of modulating (e.g. modifying) chromatin structure in a subject comprising administering to the subject an effective amount of one or more chromatin modulating (e.g. modifying) agents identified according to the methods of the present invention.
  • the chromatin modulating (e.g. modifying) agents of the present invention may be administered alone but will generally be administered as a pharmaceutical composition comprising one or more components - e.g. when the components are in admixture with a suitable pharmaceutical excipient, diluent or carrier selected with regard to the intended route of administration and standard pharmaceutical practice.
  • the components may be administered (e.g. orally) in the form of tablets, capsules, ovules, elixirs, solutions or suspensions, which may contain flavouring or colouring agents, for immediate-, delayed-, modified-, sustained-, pulsed- or controlled-release applications.
  • the tablet may contain excipients such as microcrystalline cellulose, lactose, sodium citrate, calcium carbonate, dibasic calcium phosphate and glycine, disintegrants such as starch (preferably corn, potato or tapioca starch), sodium starch glycollate, croscarmellose sodium and certain complex silicates, and granulation binders such as polyvinylpyrrolidone, hydroxypropylmethylcellulose (HPMC), hydroxypropylcellulose (HPC), sucrose, gelatin and acacia. Additionally, lubricating agents such as magnesium stearate, stearic acid, glyceryl behenate and talc may be included.
  • excipients such as microcrystalline cellulose, lactose, sodium citrate, calcium carbonate, dibasic calcium phosphate and glycine
  • disintegrants such as starch (preferably corn, potato or tapioca starch), sodium starch glycollate, croscarmellose sodium and certain complex silicates
  • Solid compositions of a similar type may also be employed as fillers in gelatin capsules.
  • Preferred excipients in this regard include lactose, starch, a cellulose, milk sugar or high molecular weight polyethylene glycols.
  • the chromatin modulating (e.g. modifying) agent may be combined with various sweetening or flavouring agents, colouring matter or dyes, with emulsifying and/or suspending agents and with diluents such as water, ethanol, propylene glycol and glycerin, and combinations thereof.
  • the routes for administration include, but are not limited to, one or more of: oral (e.g. as a tablet, capsule, or as an ingestable solution), topical, mucosal (e.g. as a nasal spray or aerosol for inhalation), nasal, parenteral (e.g. by an injectable form), gastrointestinal, intraspinal, intraperitoneal, intramuscular, intravenous, intrauterine, intraocular, intradermal, intracranial, intratracheal, intravaginal, intracerebroventricular, intracerebral, subcutaneous, ophthalmic (including intravitreal or intracameral), transdermal, rectal, buccal, vaginal, epidural, sublingual.
  • oral e.g. as a tablet, capsule, or as an ingestable solution
  • mucosal e.g. as a nasal spray or aerosol for inhalation
  • nasal parenteral (e.g. by an injectable form)
  • gastrointestinal intraspinal, intraperitoneal
  • the composition comprises more than one active component, then those components may be administered by different routes.
  • a component is administered parenterally, then examples of such administration include one or more of: intravenously, intra-arterially, intraperitoneally, intrathecally, intraventricularly, intraurethrally, intrasternally, intracranially, intramuscularly or subcutaneously ad ⁇ iinistering the component; and/or by using infusion techniques.
  • the component is best used in the form of a sterile aqueous solution which may contain other substances, for example, enough salts or glucose to make the solution isotonic with blood.
  • aqueous solutions should be suitably buffered (preferably to a pH of from 3 to 9), if necessary.
  • suitable parenteral formulations under sterile conditions is readily accomplished by standard pharmaceutical techniques well-known to those skilled in the art.
  • the component(s) useful in the present invention may be administered intranasally or by inhalation and is conveniently delivered in the form of a dry powder inhaler or an aerosol spray presentation from a pressurised container, pump, spray or nebuliser with the use of a suitable propellant, e.g. dichlorodifluoromethane, trichlorofluoromethane, dichlorotetrafluoroethane, a hydrofluoroalkane such as 1,1,1,2-tetrafluoroethane (HFA 134ATM) or 1,1,1,2,3,3,3-he ⁇ tafluoro ⁇ ropane (HFA 227EATM), carbon dioxide or other suitable gas.
  • a suitable propellant e.g. dichlorodifluoromethane, trichlorofluoromethane, dichlorotetrafluoroethane, a hydrofluoroalkane such as 1,1,1,2-tetrafluoroethane (HFA 134
  • the dosage unit may be determined by providing a valve to deliver a metered amount.
  • the pressurised container, pump, spray or nebuliser may contain a solution or suspension of the active compound, e.g. using a mixture of ethanol and the propellant as the solvent, which may additionally contain a lubricant, e.g. sorbitan trioleate.
  • a lubricant e.g. sorbitan trioleate.
  • Capsules and cartridges (made, for example, from gelatin) for use in an inhaler or insufflator may be formulated to contain a powder mix of the agent and a suitable powder base such as lactose or starch.
  • the component(s) may be administered in the form of a suppository or pessary, or it may be applied topically in the form of a gel, hydrogel, lotion, solution, cream, ointment or dusting powder.
  • the component(s) may also be dermally or transdermally administered, for example, by the use of a skin patch. They may also be administered by the pulmonary or rectal routes. They may also be administered by the ocular route.
  • the compounds may be formulated as micronised suspensions in isotonic, pH adjusted, sterile saline, or, preferably, as solutions in isotonic, pH adjusted, sterile saline, optionally in combination with a preservative such as a benzylalkonium chloride.
  • a preservative such as a benzylalkonium chloride.
  • they may be formulated in an ointment such as petrolatum.
  • the component(s) may be formulated as a suitable ointment containing the active compound suspended or dissolved in, for example, a mixture with one or more of the following: mineral oil, liquid petrolatum, white petrolatum, propylene glycol, polyoxyethylene polyoxypropylene compound, emulsifying wax and water.
  • it may be formulated as a suitable lotion or cream, suspended or dissolved in, for example, a mixture of one or more of the following: mineral oil, sorbitan monostearate, a polyethylene glycol, liquid paraffin, polysorbate 60, cetyl esters wax, cetearyl alcohol, 2- octyldodecanol, benzyl alcohol and water.
  • the term "administered” also includes delivery by viral or non-viral techniques.
  • Viral delivery mechanisms include but are not limited to adenoviral vectors, adeno-associated viral (AAV) vectos, he ⁇ es viral vectors, retroviral vectors, lentiviral vectors, and baculoviral vectors.
  • Non-viral delivery mechanisms include lipid mediated transfection, liposomes, immunoliposomes, lipofectin, cationic facial amphiphiles (CFAs) and combinations thereof.
  • a physician will determine the actual dosage which will be most suitable for an individual subject.
  • the specific dose level and frequency of dosage for any particular patient may be varied and will depend upon a variety of factors including the activity of the specific compound employed, the metabolic stability and length of action of that compound, the age, body weight, general health, sex, diet, mode and time of administration, rate of excretion, drug combination, the severity of the particular condition, and the individual undergoing therapy.
  • the component(s) may be formulated into a pharmaceutical composition, such as by mixing with one or more of a suitable carrier, diluent or excipient, by using techniques that are known in the art.
  • a suitable carrier diluent or excipient
  • the present invention employs, unless otherwise indicated, conventional techniques of chemistry, molecular biology, microbiology, recombinant DNA and immunology, which are within the capabilities of a person of ordinary skill in the art. Such techniques are explained in the literature. See, for example, J. Sambrook, E. F. Fritsch, and T. Maniatis, 1989, Molecular Cloning: A Laboratory Manual, Second Edition, Books 1-3, Cold Spring Harbor Laboratory Press; Ausubel, F. M. et al. (1995 and periodic supplements; Current Protocols in Molecular Biology, ch. 9, 13, and 16, John Wiley & Sons, New York, N.Y.); B. Roe, J. Crabtree, and A.
  • HeLa S3 cells obtained from the European Collection of Cell Cultures; ECACC Ref. No. 87110901 are grown to 80% confluency in 150 cm 2 flasks at 37°C in Dulbecco's Minimal Essential Medium/10% newborn calf serum (Sigma) in a 5% CO2 humidified atmosphere. Before carrying out the procedure the appearance of cells is visually checked and their overall viability (>97%) assessed by trypan blue staining. After removing the medium the adherent cells are rinsed in Dulbecco's PBS (-Ca 2+ /Mg2+) and around 75% of the cells are detached by trypsin treatment. Isolation of nuclei is carried out using established protocols (Protocol
  • NP40 lysis buffer 10 mM Tris-HCl [pH 7.5]; 10 mM NaCl; 3 mM MgCl 2 ; 0.5% NP-40; 0.15 mM spermine-tetrachloride; 0.15 mM spermidine-trichloride
  • the nuclei are purified from the lysate by low speed centrifugation and washed once in
  • Restriction enzyme buffer* 50 mM Tris-HCl [pH 8.0]; 100 mM NaCl; 3 mM MgCl2; 0.15 mM spermine-tetrachloride; 0.15 mM spermidine-trichloride.
  • the purified nuclei are resuspended in 500 ⁇ l NEB Buffer 3 (100 mM NaCl, 50 mM Tris-HCl [pH7.9], 10 mM MgCl2, 1 mM dithiothreitol) to yield a final volume of around 800 ⁇ l.
  • Six 100 ⁇ l aliquots of the nuclei suspension are distributed into six separate microcentrifuge tubes which are subjected to the following treatments:
  • Reaction 2 100 units Mbo I (recombinant; 5 units/ ⁇ l; New England Biolabs)
  • Reaction 3 50 units Mbo I
  • oligonucleotide adapters There are at least two requirements for oligonucleotide adapters, including: i) presence of a single-stranded cohesive end that will base pair specifically with the DNA fragment ends produced by the genomic restriction nuclease fragmentation reaction, and ii) they need to have a unique sequence that is different from any other sequence found in eukaryotic genomes and can therefore act as a specific primer binding site during the PCR reaction.
  • the residue underlined corresponds to the 5 'OH group of the top strand oligonucleotide that can be labeled with, for example, radioactive and/or fluorescent labels to facilitate the subsequent detection of the PCR products.
  • the residue underlined may be phosphorylated to ensure the formation of a covalent link between the adapter oligonucleotide and the cleaved genomic DNA during the ligation reaction.
  • sequences shown in italics are derived from bacteriophage Ml 3 and used extensively as a 'universal' sequencing primer in standard plasmids and bacteriophage cloning vectors.
  • the motif does not display any obvious homologies to any sequenced eukaryotic genome and will therefore not cross-hybridize to endogenous loci during PCR reactions. Any other unique oligonucleotide sequence not occurring in the tested genome is suitable for this task.
  • oligonucleotide is purified away from uninco ⁇ orated rATP using a G25 microspin column (Amersham-Pharmacia). 8 pmoles of the 'top strand' oligonucleotide are added to the column eluate and, after heating to 75 °C for 5 minutes, the mixture is allowed to cool to room temperature to anneal the two strands. 2 ⁇ l of the annealed
  • BamHI adapter is ligated to 1 ⁇ g of hypersensitive site Mbol-cleaved genomic DNA at 16°C for 1 hour in a final volume of 10 ⁇ (note that due to the small number of cleaved Mbo I sites in the genomic DNA the adapter is likely to be in substantial excess and therefore no alkaline phosphatase treatment of the genomic DNA is required to prevent ligation of genomic fragments to each other). 1 ⁇ l of the ligation reaction containing the adapter-tagged DNA is used for each PCR reaction.
  • PCR reactions are carried out in a total reaction volume of 50 ⁇ l with a Stratagene Robocycler using the M ⁇ P 'Easy Start' system (obtained from Merck). Amplification is carried out for 40 cycles (45 seconds at 95°C; 45 seconds at 55°C; 1 minute at 72°C). 10 ⁇ l of each PCR reaction are analyzed on a 0.7% agarose/TBE gel.
  • the HS data set identified in Example 1 and shown in Figure 9 is analysed for the presence of HS consensus sequences using the computer program YEBIS (www- scc.jst.go.jp/YEBIS/MotifExtraction/; Yada et al., 1998) with default parameters. 17 different sequences, ranging in length from 7 to 13 nucleotides are identified ( Figure 10).
  • sequences e.g. 'Motif 1'
  • 'Motif 1' Some of the sequences (e.g. 'Motif 1') are relatively short, but are present as identical copies in several different HS sequences. Other sequences are substantially longer, display some degree of variability, but consensus sequences with highly conserved residues in particular positions emerge clearly in all cases.
  • the HS data set was also processed using the MOTIFSAMPLER algorithm inco ⁇ orating a higher-order background model (www.esat.kuleuven.ac.be/ ⁇ thij sAVOrk/MotifSampler.html; Thijs et al, 2001).
  • This program differs from YEBIS because the length of the motifs and number of detected motifs can be entered as part of the search criteria.
  • Analysis of the HS data set was carried out by specifying the expected lengths of motifs as 8, 12 and 15, respectively, in three independent runs. Again, motifs shared by different members of the HS data set were successful identified.
  • the MOTIFSAMPLER motifs of length 12 appeared to be most effective.
  • the MOTIFSAMPLER output is shown in Figure 11.
  • Example 3 The variability in the various consensus sequences identified in Example 3 is encoded as 'position-specific scoring matrices' ('PSSMs'; Freeh et al, 1997).
  • the PSSMs obtained using the data obtained from YEBIS is shown in Figure 12.
  • the PSSMs are rearranged in a different format from the aligned sequences using a custom-written PERL script YEBIS- MATRLX.
  • the resulting PSSMs are listed in Figure 13.
  • DNA sequences are analysed for high-density clusters of consensus sequences identified in the HS data set using PSSMs based on the YEBIS and MOTIFSAMPLER results (described above).
  • the relatively small HS data set available at this time does not yet allow the definition of high-resolution stochastic models to optimise the recognition rate of HSs.
  • the program CISTER that inco ⁇ orates a hidden Markov model to enhance the rate of detection of biologically significant cz ' s-element clusters (Frith et al, 2001) is used.
  • the HS-PSSMs are fed into the program CISTER (sullivan.bu.edu/ ⁇ mfrifh/cister.shtml) to locate the occurrence of similar motif clusters in new DNA sequences that were not part of the initial training set.
  • the resulting program identifies with a high degree of accuracy a number of constitutive HSs present in viral and cellular gene promoters. This result indicates that the prediction of novel HSs is feasible using information extracted from our HS data set.
  • Human ⁇ -globin constitutive HS5 (Genbank Accession No. AF064190).
  • Mouse mammary tumour virus 3 ' long terminal repeat (MMTV-3' LTR; Genbank Accession No. MMTPRO).
  • the DNA sequence tested consists of a continuous string of the 59 HS core sequences (HS Core Sequences Group A & B), preceded by the same sequence randomised (10 bp Randomized HS Core Sequences) ( Figure 7).
  • This procedure creates a test sequence which contains two halves: the first half is random (and thus should lack HS-specific motifs), whereas the other half is 'packed' with all the HS sequences.
  • the randomisation is carried out in blocks of lObp, so any local variation in base composition is preserved, whereas regulatory motifs get effectively scrambled.
  • PSSMs are compiled from two non-overlapping subsets of HS core sequences: motifs were separately derived from 'Group A' and 'Group B'. This has two pu ⁇ oses:
  • the signal to noise ratio is be assessed by comparing the specific signal obtained in the right half against the randomised sequence.
  • HSs computer-based rules from a set of newly identified HSs are defined that enable the prediction of the occurrence and positions of HSs in DNA sequences with a high degree of accuracy. Further increases in the size of the proprietary HS database will help to refine the search for HS-consensus sequences and improve the reliability of this approach even further. Accordingly, the positions of HSs can now be predicted using bioinformatic tools, rather than using exclusively experimental tools; it may be possible to define the HS consensus sequences with even higher accuracy and identify further HS consensus sequences once there is access to a larger HS data set ie. a larger number of HS sequences; it is possible to apply bioinformatic tools to predict the presence and positions of HSs near key genes for biotechnological and medical interventions.
  • a computer automated method for identifying one or more HS consensus sequences comprising the steps of: (a) providing a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the HS core sequences; and(c) returning one or more HS consensus sequences comprising a plurality of motifs identified in step (b).
  • a computer automated method for identifying one or more HS sequences comprising the steps of: (a) identifying a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the plurality of HS core sequences; (c) returning one or more HS consensus sequences comprising a plurality of the motifs identified in step (b); and (d) searching for one or more HS sequences comprising one or more HS consensus sequences.
  • a computer automated method for identifying an HS core sequence comprising the steps of: (a) providing a DNA sequence in the sense or antisense orientation that is not part of the plurality of HS core sequences; (b) providing an HS sequence; and (c) searching the DNA sequence for the presence a hypersensitive restriction site.
  • Chromatin fine structure profiles for a developmentally regulated gene reorganization of the lysozyme locus before trans-activator binding and gene expression.
  • NM23-H1 and NM23-H2 repress transcriptional activities of nuclease-hypersensitive elements in the platelet-derived growth factor-A promoter. J. Biol. Chem. 277, 1560-1567. Mautner, J., Joos, S., Werner, T., Eick, D., Bornkamm, G.W., and Polack, A. (1995). Identification of two enhancer elements downstream of the human c-myc gene. Nucl. Acids Res. 23, 72-80.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • General Chemical & Material Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Biochemistry (AREA)
  • Immunology (AREA)
  • Genetics & Genomics (AREA)
  • Medicinal Chemistry (AREA)
  • Veterinary Medicine (AREA)
  • Public Health (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Animal Behavior & Ethology (AREA)
  • Traffic Control Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to a method for identifying one or more HS consensus sequences comprising the steps of: (a) providing a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by HS core sequences; and returning one or more HS consensus sequences comprising a plurality of motifs identified in step (b).

Description

METHOD
FIELD OF THE INVENTION
The present invention relates to methods, their uses and products obtained therefrom.
In particular, the present invention relates to Hypersensitive Sites (HSs) and methods for identifying HS consensus sequences, HS sequences and HS core sequences.
BACKGROUND TO THE INVENTION
The large amount of DNA present in eukaryotic cells needs to be efficiently stored into a small space, the cell nucleus. This is achieved by packaging DNA molecules into chromatin, which involves looping DNA molecules around histones to create nucleosomal DNA-protein complexes. Subsequent coiling of these nucleosomal complexes into solenoid and higher order structures increases the packaging density further.
The arrangement of DNA into chromatin reduces the space taken up by DNA molecules in the nucleus, but creates an additional problem: the tight packaging of DNA prevents enzymes involved in gene expression, DNA repair and replication from accessing the genome. This situation is overcome in eukaryotic cells by exposing small selected regions of the genome to the various machineries in the form of 'Hypersensitive Sites' (HSs). For example, Nuclease Hypersensitive sites as the name implies, are genomic sites that are highly susceptible to nuclease attack under experimental conditions.
Thus, nuclease hypersensitive sites (HSs) are genomic regions that are up to two orders of magnitude more accessible to nuclease digestion in purified nuclei preparations in comparison to bulk nuclear DNA (Nedospasov and Georgiev, 1980; Wu, 1980). Subsequent biochemical studies have revealed that HSs are highly specific localized DNA access points for a variety of factors involved in transcription, replication, repair, recombination and attachment to the nuclear matrix. Some HSs are permanently present ('constitutive' HSs), whereas other HSs are only formed in response to specific endogenous or exogenous stimuli ('regulated' HSs). Wifh the advent of extensive sequence data from a variety of eukaryotic genome projects there is renewed interest in bioinformatic tools suitable for identifying functional regulatory elements on the DNA sequence level.
The present invention relates to novel and useful aspects concerning HSs.
SUMMARY OF THE PRESENT INVENTION
The present invention is based upon the surprising finding that HS consensus sequences can be identified from a plurality of HSs. Thus, in one broad aspect, the present invention relates to HS consensus sequences derived from a plurality of HS sequences - such as HS core sequences. These HS consensus sequences may be used to allow the prediction of other HS sequences using bioinformatic tools rather than using exclusively experimental tools. The availability of large portions of the human genome sequence presents an opportunity for identifying, mapping and analysing HSs on a large scale using computational approaches.
Although HSs have been mapped for human genes no attempts have previously been made to systematically extract the relevant sequence information; much of the HS information in the scientific literature is inconsistently annotated and is frequently not sufficiently precise to annotate on the DNA sequence level. This is the first time that the existence of sequence- based rules has emerged for HSs since as far as the inventor is aware this has never been considered to be a feasible approach.
In a first aspect, the present invention relates to a method for identifying one or more HS consensus sequences comprising the steps of: (a) providing a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the HS core sequences; and(c) returning one or more HS consensus sequences comprising a plurality of motifs identified in step (b).
The method according to the first aspect may be implemented in a variety of ways. The principle is that the availability of a plurality of HS core sequences, which may have been identified conventionally using known experimental tools (or even previous bioinformatic tools), can be used to generate large numbers of HS core sequences (the numbers of which will increase in the future) allowing the definition and extraction of sequence-based rules that can be used to identify other sites in genomes that also fulfil these rules and are therefore candidate HSs.
Preferably, the search algorithm includes a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model and a Monte Carlo-statistical model.
Preferably, the search algorithm comprises a word counting method or a probabilistic method.
Preferably, the HS consensus sequences are returned as a regular expression or a sequence logo. More preferably, the HS consensus sequences are returned as a weight matrix. Most preferably, the weight matrix is a position specific scoring matrix (PSSM).
Preferably, returning the HS consensus sequences as a PSSM comprises the step of computing a score for finding a matching sequence in the plurality of HS consensus sequences.
Preferably, the plurality of HS core sequences are identified using Global Analysis of Chromatin Topology or Hypergenomic Display.
In a second aspect, the present invention relates to a method for identifying one or more HS sequences comprising the steps of: (a) identifying a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the plurality of HS core sequences; (c) returning one or more HS consensus sequences comprising a plurality of the motifs identified in step (b); and (d) searching for one or more HS sequences comprising one or more HS consensus sequences.
Preferably, step (c) returns the HS consensus sequences as a PSSM comprising the steps of: (i) providing a plurality of HS consensus sequences; (ii) computing the score for finding a matching sequence in the plurality of HS consensus sequences; and (iii) identifying HS sequences in one or more DNA sequences that were not part of the plurality of HS core sequences using the PSSMs identified.
Preferably, the search algorithm is used in a word counting method or a probabilistic method. Preferably, the search algorithm includes a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model and a Monte Carlo-statistical model.
Preferably, the HS consensus sequences are returned as a regular expression or a sequence logo. More preferably, the HS consensus sequences are returned as a weight matrix. Most preferably, the weight matrix is a position specific scoring matrix (PSSM).
Preferably, returning the HS consensus sequences as a PSSM comprises the step of computing a score for finding a matching sequence in the plurality of HS consensus sequences.
Preferably, the plurality of HS core sequences are identified using Global Analysis of Chromatin Topology or Hypergenomic Display.
Preferably, the DNA sequences are from a database of DNA sequences.
Preferably, one or more HS sequences comprising the HS consensus sequences are searched by searching for clusters of cw-elements.
Preferably, the most probable arrangement of czs-elements in the cluster are integrated using the Niterbi algorithm.
Preferably, a forward-backward algorithm to consider the sum of all paths through a hidden Markov model is used.
Preferably, the plurality of HS core sequences are identified using Global Analysis of Chromatin Topology or Hypergenomic Display.
In a third aspect, the present invention relates to a method for identifying an HS core sequence comprising the steps of: (a) providing a DΝA sequence in the sense or antisense orientation that is not part of the plurality of HS core sequences; (b) providing an HS sequence; and (c) searching the DΝA sequence for the presence a hypersensitive restriction site. Preferably, the HS sequence is between about 50 nucleotides to about 200 nucleotides in length.
Preferably, the DNA sequence is 1 kb in length.
Preferably, the method for identifying an HS core sequence comprises the additional step of using the identified HS consensus sequences or HS sequences to prepare a nucleic acid construct.
Preferably, the methods according to the present invention comprise the additional step of using the identified HS consensus sequences or HS sequences in an assay (or assay development program) and/or a pharmaceutical (or in the preparation of or development of a pharmaceutical).
In a fourth aspect, the present invention relates to a method of treating a disease associated with chromatin structure in a subject, the method comprising administering to the subject an effective amount of a chromatin modulating (e.g. modifying) agent capable of modulating (e.g. modifying) the chromatin structure to a non-diseased form.
In a fifth aspect, the present invention relates to a pharmaceutical composition comprising a chromatin modulating agent and a pharmaceutically acceptable carrier, diluent, excipient or adjuvant or any combination thereof.
In a sixth aspect, the present invention relates to a method of preventing and/or treating a disorder comprising administering a chromatin modulating agent wherein said chromatin modulating agent is capable of modulating an HS to cause a beneficial preventative and/or therapeutic effect.
In a seventh aspect, the present invention relates to the use of a chromatin modulating agent in the preparation of a pharmaceutical composition for the treatment of an HS related disorder. In an eighth aspect, the present invention relates to one or more HS consensus sequences identifiable, preferably identified using the methods of the present invention or a variant, derivative, or homologue thereof.
In a ninth aspect, the present invention relates to an HS sequence identifiable, preferably identified using the methods of the present invention or a variant, derivative, or homologue thereof.
In a tenth aspect, the present invention relates to weight matrices identifiable, preferably identified using the methods of the present invention. More preferably, the weight matrices are PSSMs.
In an eleventh aspect, the present invention relates to a recording medium bearing machine readable instructions for implementing the first to the third aspects of the invention.
In an twelfth aspect, the present invention relates to a computer system loaded with machine readable instructions for implementing the first to the third aspects of the invention
DESCRIPTION OF THE DRAWINGS
Figure 1 is a diagrammatic representation of a HS core sequence comprising 100 nucleotides of genomic sequence immediately adjacent to a hypersensitive Mbo I target site (204 bp in total).
Figure 2 is a diagrammatic representation of the identification of a HS sequence in Human β- globin constitutive HS5 (Genbank Accession No. AF064190). A single strong signal as indicated by a distinct peak of predicted HS potential centered around position 6,200 in the nucleotide sequence, is detected which coincides precisely with the experimentally mapped constitutive HS5 (Dhar et al., 1990).
Figure 3 is a diagrammatic representation of the identification of HS sequences in Mouse mammary tumour virus 3' long terminal repeat (MMTV-3' LTR; Genbank Accession No. MMTPRO). Both experimentally mapped HSs, including a constitutive and a glucocorticoid- inducible site, are reliably detected as indicated by distinct peaks of predicted HS potential centered around positions 200 and around 1300 in the nucleotide sequence (Zaret and Yamamoto, 1984).
Figure 4 is a diagrammatic representation of the identification of HS sequences in Human vascular endothelial growth factor A promoter (Genbank Accession No. AF005785). A strong signal is detected at position 2600. Experimental data shows the presence of two HSs in this area (Liu et al, 2001). The presence of a single broad peak suggests that in some cases the clustering algorithm of CISTER causes artifactual merging of motif clusters from adjacent HSs. Also, two other experimentally mapped sites are only weakly detected.
Figure 5 is a diagrammatic representation of the identification of HS sequences in Human erythropoietin (embedded in 13 kb of human genome sequence). Strong signals are detected from an experimentally mapped regulated 5' located HS and from two HSs located at the 3' end of the gene (Zhang et al, 2000). Some merging of the predicted HS signals from the two separate 3' HSs is observed in the computer prediction due to the CISTER algorithm. The program also shows a HS signal within the transcribed region, which is compatible with the experimentally observed emergence of hypersensitivity of the gene at the onset of active expression.
Figure 6 is a diagrammatic representation of the identification of HS sequences in human c- Myc (embedded in 55 kb of human genome sequence). This is the largest region yet analysed. HSs surrounding the 5' and 3' end of the c-Myc gene (Mautner et al, 1995) are reliably detected. There are some additional strong signals in the surrounding regions for which no experimental data is currently available. This indicates the high signal/noise ratio achievable with the current set-up.
Figure 7 is a diagrammatic representation of the result of a computer-based experiment. The DNA sequence tested consists of a continuous string of the 59 HS core sequences (shown in blue or light grey and dark grey), preceded by the same sequence randomised (shown in red or medium grey). This procedure therefore creates a test sequence (TestSeq) which contains two halves: the first half is random (and thus should lack HS-specifϊc motifs), whereas the other half is 'packed' with all the HS sequences. PSSMs were compiled from two non-overlapping subsets of the HS core sequences: motifs were separately derived from 'collection A' (shown in light blue or light grey) and 'collection B' (shown in dark blue or dark grey).
Figure 8 schematically illustrates a general purpose computer (132) of the type that may be used to perform the methods in accordance with the present invention. The computer (132) includes a central processing unit (134), a read only memory (136), a random access memory
(138), a hard disk drive (140), a display driver (142) and display (144) and a user input/output circuit (146) with a keyboard (148) and mouse (150) all connected via a common bus (152).
The central processing unit (134) may execute program instructions stored within the ROM (136), the RAM (138) or the hard disk drive (140) to carry out processing of signal values that may be stored within the RAM (138) or the hard disk drive (140). The program may be written in a wide variety of different programming languages. The computer program itself may be stored and distributed on a recording medium, such as a compact disc, or may be downloaded over a network link (not illustrated). The general purpose computer (132) when operating under control of an appropriate computer program effectively forms an apparatus for performing aspects of the present invention - such as identifying one or more HS consensus sequences, HS sequences and HS core sequences.
Figure 9 is a Table listing SEQ ID No. 3 to 55.
Figure 10 is a Table listing HS consensus sequences.
Figure 11 is a Table listing HS consensus sequences, which are shown in bold. Instead of a more precise PSSM they are written in code to indicate redundancies in certain positions. Guanine. adenine, thymine, cytosine: G,A,T,C; Purine (adenine or guanine): R; Pyrimidine (thymine or cytosine): Y; Adenine or thymine: W; Guanine or cytosine: S; Adenine or cytosine: M; Guanine or thymine: K; Adenine or thymine or cytosine: H; Guanine or cytosine or thymine: B; Guanine or adenine or cytosine: N; Guanine or adenine or thymine: D; Guanine or adenine or thymine or cytosine: Ν. This is the international IUPAC nomenclature.
Figure 12 is a Table listing YEBIS PSSMs.
Figure 13 is a Table listing YEBIS-MATRIX PSSMs. HYPERSENSITIVE SITES
Nuclease Hypersensitive Sites (HSs) are genomic sites that are highly susceptible to nuclease attack under experimental conditions - typically by approximately two orders of magnitude as compared to bulk chromatin (see Stalder et al, 1980; Wu, 1980). All available data suggests that HSs are mostly free of nucleosomes, but contain a number of transcription factor complexes that are bound to specific sequence motifs present in the genomic DNA.
HSs can be viewed as the gateways to the genome for the vast majority of molecules involved in regulating gene expression and many other important genomic functions, such as DNA replication, repair, recombination and insertion of retroviral genomes (reviewed in Gross and
Garrard, 1988). They expose or hide gene regulatory signals and therefore constitute one of the most important epigenetic regulatory layers that are superimposed on the genome to control and direct its expression (Bonifer 2000).
HSs can be present in a number of forms - such as constitutive HSs, developmentally regulated HSs, tissue-specific HSs and cell type-specific HSs.
Many 'housekeeping' genes, especially those that are expressed in a large variety of tissues and continuously during development, contain constitutive HSs that are present in most cell types. The promoter directing the expression of the human β-globin gene cluster contains several HSs that are also present in a number of non-erythroid cells (Dhar et al, 1990). It is likely that the DNA present in many constitutive HSs contains primary sequence motifs that ensure HS formation in a manner that is independent of tissue-specific factors. Simmonsson et al. (1998) reported an unusual tetraplex DNA structure in the HS of the c-myc promoter, and other HSs are known to contain nuclease SI -sensitive DNA, indicating the presence of stem-loop or other unusual secondary structures (Mielke et at"., 1996). This hypothesis is further supported by observations showing that a DNA segment containing the 'HS-2' region of the human β-globin promoter establishes a bona fide HS conformation, even when maintained as part of an artificial chromosome in a yeast host strain (Svetlova et al., 1998). Such constitutive HSs could, in addition to regulating the expression of adjacent genes, serve as border elements to define functional chromatin domains, or could facilitate the precise folding patterns of individual chromatin fibres (Filipsky et al, 1990). The continuous reconfiguration of chromatin architecture is an essential prerequisite for directing the changing gene expressions patterns during embryonic development and cell type-specific differentiation. Many HSs - such as developmentally regulated HSs - are created near defined subsets of genes in a tissue- and stage-specific manner (see e.g. Gross and Garrard, 1988) due to the local activity of transcription factors and chromatin remodeling machineries (reviewed in Wolffe and Hayes, 1999). The creation of HSs near genes is one of several steps in the pathway that prepares a regulatory sequence to become functionally active in chromatin. One of the best-understood model systems is the chicken lysozyme gene, where the HS configuration on its promoter has been shown to be highly dynamic. Several distinct HSs appear and disappear over different promoter elements as cells progress through haemopoetic development (Huber et al., 1995; see Kontaraki et al., 2000). In many cases, a direct correlation between the appearances and disappearances of HSs with known biological functions has been shown.
HS CONSENSUS SEQUENCE
As used herein the term "HS consensus sequence" refers to a plurality of motifs ie. nucleotides that are common, although not necessarily identical, to other nucleotides in an HS sequence. Thus, an HS consensus sequence is an idealised sequence that represents the most likely motif to occur at each position within an HS sequence.
As used herein, the term "plurality of motifs" refers to more than one motif. Preferably, a plurality of motifs refers to at least about 2 to 200 or more motifs; more preferably, at least about 2 to 100 or more motifs; more preferably, at least about 2 to 50 or more motifs; more preferably, at least about 2 to 20 or more motifs; more preferably, at least about 6, 7, 8, 9, 10, 11, 12, 13, 14 or 15 motifs; most preferably, 7, 8, 12, 13 or 15 motifs; or any suitable combination of start or end points, for example, at least about 6 to 50 or more motifs.
The present invention demonstrates that it is possible to identify and extract HS consensus sequences comprising motifs that are shared by different HS sequences. There may be different functional types of HSs that may contain different sets of shared HS consensus sequences. It is possible to identify HS consensus sequences in other DNA sequence as a bioinformatic tool for the in silico prediction of HSs in these sequences in the absence of experimentally derived information.
HS CORE SEQUENCE
As used herein, the term "HS core sequence" refers to motifs ie. nucleotides that are typically within about 100 to 200 base pairs of a hypersensitive target site. Motifs may be fragmented at hypersensitive target sites by various entities including chemical or physical agent such as bleomycin, bromoacetaldehyde, chloracetaldehyde, cobalt chiral complex, copper phenanthroline, diethyl pyrocarbonate, dimethyl sulfate, iron(II)-EDTA, methidiumpropyl- EDTA, neocarzinostatin, psoralen and ultraviolet light. The entity may be an enzyme such as a sequence specific nuclease, a non-sequence specific nuclease, Bal-31, DNase I, DNase II, an endogenous nuclease, exonuclease III, lambda exonuclease, micrococcal nuclease, mung bean nuclease, Neurospora crassa nuclease, a restriction enzyme including type I, II and III restriction enzymes, SI nuclease or a topoisomerases such as topoisomerase I or II. Preferably, the entity is an enzyme. More preferably, the entity is a restriction enzyme. Even more preferably, the restriction enzyme recognises at least a 4 base pair (bp) target sequence. Most preferably, the restriction enzyme is selected from the group consisting of DpnII, Mbol (Figure 1), Nlalll, SauIIIA and Tsp509I. The methods of the present invention may involve the use of one or more such entities. For some embodiments, two different entities may be used - such as a restriction enzyme that recognises a 4 bp target sequence and a restriction enzyme that recognises a 6 bp target sequence.
A "plurality of HS core sequences" refers to at least about 3 HS core sequences and preferably is selected from the group comprising:about at least 10, 11, 12, 13, 14,15, 16, 17, 18, 19 or 20 HS core sequences; about at least 21-100 HS core sequences; about at least 101- 1000 HS core sequences; about at least 1001 to 5000 HS core sequences; about at least 5001 to 10000 HS core sequences; about at least 10001 to 50000 HS core sequences; and about at least 50001 to 100,000 HS core sequences; or any suitable combination of start or end points, for example, at least about 15 to about 100,000 HS core sequences.
Experimentally defined HSs are around 200 nucleotides in length and so for some aspects of the present invention, the HS core sequence may comprise 200 or more nucleotides. As the data set of the HS sequences increases in size, it may be advantageous to reduce the size definition of the HS core sequence to increase the likelihood that all sequence data used in the subsequent analysis is derived from the DNA in hypersensitive configuration. Thus, for some aspects of the present invention the HS core sequence may comprise 199 or less nucleotides. Whilst the reduction in the size definition of the HS core sequence may not have any effect on the validity of the approach described here it may reduce the amount of 'background noise' in the motif-extraction step. Accordingly, a person skilled in the art will understand that the size of the HS core sequence may be optimised depending on the size of the HS data set and the amount of background noise that is detected. Thus, the HS core sequence may even comprise 150 or less nucleotides, 100 or less nucleotides, 50 or less nucleotides or even 25 or less nucleotides.
IDENTIFYING HS CONSENSUS SEQUENCES
According to one aspect, the present invention relates to a method for identifying one or more HS consensus sequences comprising the steps of: (a) providing a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the HS core sequences or subsets thereof; and (c) returning one or more HS consensus sequences comprising a plurality of motifs identified in step (b).
fa) Providing a plurality of HS core sequences.
High-throughput procedures for identifying and mapping a plurality of HSs - for example, a plurality of HSs in eukaryotic cells, has been developed and is described in UK Patent Application 0116453.2 filed 5th July 2001 (the contents of which are incorporated herein by reference). The techniques are capable of yielding a large number of HSs using various methods as described therein - such as 'Hypertag Display' and 'Global Mapping'.
The principle of Hypertag Display is as follows; DNA present in HSs is selectively cut with the restriction enzyme Mbo I, recognizing the 4 bp target sequence 5'GATC^ . Ligation of a compatible BamH I adapter molecule to the cleaved ends results in the selective tagging of each cleaved Mbo I site with a fragment of predetermined and known sequence. The tagged fragment is subsequently amplified by PCR using an oligonucleotide complementary to the adapter molecule and a second oligonucleotide (the 'Hypertag' primer) that is complementary to a sequence located next to a previously mapped HS. The required local sequence information can be derived from data obtained through the HS library approach. This procedure results in the production of a PCR amplification product of defined size (a 'hypertag'). Note, however, that the synthesis of the PCR product will depend entirely on a successful Mbo I cleavage event, which in turn is dependent on the local chromatin environment. If cleavage occurs, because the target sequence is in HS-configuration, a specific PCR fragment will result. If the Mbo I target site is inaccessible due to dense chromatin packaging and absence of suitable HSs, no PCR product will be detected. The production of an amplified PCR fragment thus serves as a specific and sensitive beacon for the presence/absence of a predetermined HS within an eukaryotic genome.
Another method for identifying hypersensitive sites also described in UK Patent Application 0116453.2 filed 5 July 2001 (the contents of which are incorporated herein by reference) relates to the Global Mapping of Chromatin Configurations. To create a genomic library highly enriched in DNA fragments derived from HSs, DNA present in nuclei is digested with a restriction enzyme under controlled conditions. To maximize cleavage in the majority of sites the restriction enzyme chosen will usually cut a 4 bp target sequence and produce a 'sticky end' suitable for cloning. One example of an enzyme fulfilling these criteria is Mbol. The Mbol cleaved genomic DNA is ligated to a BamHI-cleaved and phosphatased plasmid vector. This step covalently joins the DNA sequences adjacent to the Mbol cleavage site to the plasmid DNA, but does not yet result in the formation of a functional recombinant DNA molecule; the other end of the genomic fragment will usually be tens of kilobases away, may also be randomly sheared during the genomic DNA extraction step and will thus not be suitable for specifically joining the other BamHI site in the linearised plasmid. To create genomic fragments suitable for cloning the ligation mixture is cut to completion with EcoRI. This enzyme cuts at a defined site within the polylinker of the plasmid vector and also cuts every target site present in the genomic DNA. Since the DNA fragments adjacent to the Mbo I sequence are already ligated to the plasmid vector (see above), this step creates the condition for specifically joining the other end of the construct through intramolecular ligation. Due to the random orientation of the plasmid vector relative to the Mbol fragment during the first ligation approximately 50% of the clones are lost at this stage, but the other 50% of ligation products will contain specifically cloned MboI-EcoRI genomic fragments. Transfection of the ligation products results in the creation of a library of genomic DNA fragments derived from a large variety of HSs. Determination of the insert sequence adjacent to the BamHI-MboI junction of each clone establishes, after a search against human genome databases, the precise genomic location of the Mbol site and thus allows the positioning of a specific HS surrounding it.
Further increases in the number of HS sequences identified will increase the size of the HS sequence database. This will help to refine the search for HS-consensus sequences.
(b) Using a search algorithm to search for a plurality of motifs that are shared by HS core sequences.
Various search algorithms may be used to search a plurality of motifs that are shared by HS core sequences. Several methods to search for over-represented motifs in the upstream region of a set of coregulated genes have been developed and tested as described by Ohler & Niemann (2001) Trends Genet. 17, 56-60. These methods can be divided in to two different types: (1) methods based on word counting and (2) methods based on probabilistic sequence models.
(1) Word counting methods are based on the frequency analysis of oligonucleotides in sequences and overrepresentation is measured by comparing the counted number of occurrences of a word to the expected number of occurrences. A common motif is then compiled by grouping similar words. Word counting methods have been described by Jensen & Knudsen (2001) Bioinformatics 16, 326-333 and Van Helden et al. (1998) J. Mol. Biol. 281, 827-842.
(2) In probabilistic methods, the motif model is represented as a position probability matrix and the motif is assumed to be hidden in a noisy background sequence. To find the parameters of such a model, maximum likelihood estimation is used. The most frequent methods used for this are Expectation Maximisation (a maximum likelihood algorithm for estimating the parameters of a probabilistic model) and Gibbs Sampling (a stochastic equivalent of Expectation Maximisation). The drawback of these algorithms is that they suffer from noise that has an experimental origin, are artifacts of the clustering process and also the large size of the sequences of the selected genes as compared to the small sizes of the motifs. It is therefore preferred that a detection gorithm is used that can cope with such noise and discriminate between motifs that are over represented by chance and motifs that are biologically functional.
Various probabilistic models have been reported which use a simple background model based on the frequency of nucleotides in the data set to represent an intergenic sequence (Bailey & Elkan (1995) Mack Learn. 21, 51-80; Lawrence et al. (1993) Science 262, 208-214).
Preferably, the search algorithm may include a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model or a Monte Carlo-statistical model.
By way of example, HS consensus sequences may be identified using YEBIS or MOTIFSAMPLER.
YEBIS (Yada et al, 1998) is available at www-scc.jst.go.jp/YEBIS/MotifExtraction. This program is capable of extracting a set of sequence motifs without any a priori knowledge from a number of related DNA sequences. YEBIS uses an algorithm based upon a Markov statistical model and may be applied to a large number of unaligned sequences.
By way of example, using default parameters 17 different HS consensus sequences, ranging in length from 7 to 13 nucleotides, have been identified using the default parameters in YEBIS (Appendix 3). Some of the motifs (e.g. 'Motif 1') are relatively short, but are present as identical copies in several different HS sequences. Other motifs are substantially longer, display some degree of variability, but consensus motifs with highly conserved residues in particular positions emerge in all cases.
MOTIFSAMPLER (Thijs et al, 2001) is available at vvww.esat.k euven.ac.be/~thijsAV'ork/MotifSampler.htιnl. This software package tries to find over-represented motifs in the upstream region of a set of co-regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. Higher-order background models are used to improve the robustness of the motif finding. The Motif Sampler comes with background models for several organisms but is also suitable for other organisms since the background model can also be calculated from the input sequences. This programme differs from YEBIS because the length of the motifs and number of detected motifs can be entered as part of the search criteria. MotifSampler requires four search parameters, including motif length and copy number.
By way of example, the HS core sequences were analysed by specifying the expected lengths of motifs as 8, 12 and 15 in three independent runs. Consensus sequences shared by different members of the HS core sequences were successful identified. For some aspects of the present invention, MOTIFSAMPLER motifs of length 12 are preferred.
Thus, both YEBIS and MotifSampler are applications for searching for motifs, such as those that may be characteristic of HS sequences. The motifs are extracted and can be sorted into groups. An algorithm including a statistical model is applied and a matrix is returned that is derived by scoring the motifs at each position; this matrix can be used to define the variability between motifs.
The plurality of HS core sequences may be aligned prior to searching such that correspondences are assigned to preserve the order of the residues within the HS core sequences by identifying a start point, and if necessary introducing gaps.
^he phrrairty-ofΗ5-core realign the sequence to accommodate a statistical model.
(c Returning a HS consensus sequences comprising a plurality of motifs identified in step (b .
In a preferred embodiment of the present invention, the HS consensus sequences are returned as a regular expression, or a sequence logo ie. a graphic method of illustrating consensus information comprising coloured letters of different sizes, where the letters indicate different proportions of motifs..
In another preferred embodiment the HS consensus sequences are returned as a weight matrix. Background teachings on weight matrices have been presented by Freeh et al. (1997). The following information concerning weight matrices has been extracted from that source:
A weight matrix uses the complete composition of nucleotides for each position of an alignment to achieve a more differentiated rating of a matching sequence. For example, a single position of an alignment of 12 sequences containing TTTTTTTAAACC (each letter representing one sequence at this position) would be assigned T in the IUPAC consensi. A new sequence with a T at this position would be considered a match while an A at the same place would cause the whole sequence to be dismissed as no match. Even a simple nucleotide distribution matrix would assign a weight score (in this case proportional to the percentage of the nucleotide) of 0.58 to the T and still 0.25 to an A. Thus, weight matrices represent the similarity of the tested sequence to all of the sequences in the alignment much better than
IUPAC consensi. Most weight matrix-based methods add some more weighting by comparison of the actual nucleotide distribution with random values or by other statistical measures eg. information content.
A number of weight matrix based search programs are available and some have been reviewed by Freeh et al. For example, SIGNAL SCAN (Prestridge et al. (1996) Comput.
Appl. Biosci. 12, 157-160) searches a sequence for matches to the library of IUPAC consensuses or offers a matrix-based approach. MATRIX-SCAN (Chen et al. (1995)
Comput. Appl. Biosci. 11, 563-566) uses a matrix library containing more than 200 matrices.
Matlnspector (Ghosh (1993) Nucleic Acid Res. 21, 3117-3118) allows testing of individually selected matrices. Conslnspector (Kondrakhin et al. (1995) Comput. Appl. Biosci. 11, 477-
488) includes analysis of the region around the binding site.
Weight matrices are advantageously used in the present invention because they are much less sensitive to sequence selection and provide a quantitative score. Even a single mismatch at a critical position will reduce the score of the match.
Preferably, the weight matrix is a position specific scoring matrix (PSSM). PSSMs as described by Freeh et al. (1997) use the complete composition of nucleotides for each position of the alignment to achieve a more differentiated rating of a matching sequence. More preferably, the HS consensus sequences are returned as a PSSM comprising the steps of: (a) computing a score for finding a matching sequence in the plurality of HS consensus sequences; and (b) returning the variability in the plurality of HS consensus sequences as a PSSM.
HS consensus sequences may be returned as a PSSM using various methods known in the art such as E-matrix maker (Thomas et al. (1999) Journal of Computational Biology 6: 219-235, 1999; Thomas et al. Bioinformatics 16: 233-244, 2000) which is available at http://motif.stanford.edu/ematrix-maker
IDENTIFYING AN HS SEQUENCE
In a further aspect, the present invention relates to a method for identifying an HS sequence comprising the steps of: (a) identifying a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the plurality of HS core sequences; (c) returning an HS consensus sequences comprising a plurality of the motifs identified in step (b); and (d) searching for an HS sequence comprising one or more HS consensus sequences.
Thus, it is possible to identify HS consensus sequences present in other DNA sequence(s) as a bioinformatic tool for the in silico prediction of HS sequences in the absence of experimentally derived information. The availability of large portions of the human genome sequence presents an opportunity for identifying, mapping and analysing HSs on a large scale using computational approaches.
Steps (a), (b) and (c) have been described previously.
(d) Searching for an HS sequence comprising one or more HS consensus sequences.
HS sequences comprising HS consensus sequences may be searched using various methods known in the art. Preferably, DNA sequences that are not part of the plurality of HS core sequence are searched for the presence of HS consensus sequences by searching for clusters of cw-elements.
More preferably, the most probable arrangement of czs-elements in the cluster are integrated using the Viterbi algorithm. By way of example, Cluster Of Motifs E-value Tool (COMET) fhttp://zlab.bu.edu/~mfrith/comet/form.html) finds statistically significant clusters of motifs in a DNA sequence. The motifs are represented using 4 x L matrices, which record the frequencies of the nucleotides A, C, G, and T at each position in the motif. COMET assigns a positive score to each motif using the standard method of log likelihood ratios, and subtracts a 'gap penalty' linearly proportional to the distances between motifs. Thus each motif cluster receives a score, which is higher if the individual motifs are stronger, but lower if they are further apart. The scoring scheme corresponds to a log likelihood ratio of explaining the data given a cluster model versus a background model. The cluster model is for czs-elements to occur in a uniform distribution, with some intensity, whereas the background model consists of random nucleotides. The gap penalty corresponds in a one-to-one fashion with the intensity parameter of the cluster model.
Most preferably, a forward-backward algorithm to consider the sum of all paths through a hidden Markov model is used. By way of example, CISTER (Frith et al, 2001) detects cis- element clusters by using a statistical model (a hidden Markov model) of what it expects these clusters to look like. The parameters allow the user to vary some aspects of the model, and it is quite possible that different model parameters are suitable for different types of motif cluster. Parameters include (i) the distance between neighbouring cis-elements within a cluster is assumed to be geometrically distributed with mean a; (ii) The number of cis- elements in a cluster is assumed to be geometrically distributed with mean b; and (iii) the distance between regulatory cis-element clusters is assumed to be geometrically distributed with mean g. The background states are programmed to represent the local abundances of the 4 bases in the query sequence. Examining local abundances accounts for the biological reality of heterogeneous base composition, and prevents, for example, many spurious GC-rich motifs being detected in a part of the sequence that happens to be generally GC-rich. Cister uses the technique of posterior decoding, with this hidden Markov model. Advantageously, a reformatting program may be used to convert the PSSMs into a format that a program subsequently used to process the PSSMs - such as CISTER - can understand. As described above, various search algorithms may be used to search a plurality of motifs that are shared by HS core sequences. Some programs - such as YEBIS - provide a fractional value for each of the four nucleotides for each position of the PSSM. The reformatting program typically extracts data that show the aligned motifs by providing the actual numbers of occurrences of each nucleotide in each position of the PSSM. The results that are returned are actual numbers rather than fractions. Consequently, the results that are returned are slightly more accurate, and the program that is subsequently used to process the PSSMs may expect the PSSMs in the latter format. As the number of HS consensus sequences increases, the methods of the present invention may be used to define high-resolution stochastic models to optimise the recognition rate of HSs. Accordingly, it may be possible to determine whether certain consensus sequences occur in a particular combination, whether the spacing between certain motifs is important, the relative frequency of each motif and whether some of the consensus sequences are more diagnostic than others.
Preferably, step (c) returns the HS consensus sequences as a PSSM comprising the steps of: (i) providing a plurality of HS consensus sequences; (ii) computing the log-odds score for finding a matching sequence in the plurality of HS consensus sequences; (iii) returning the variability in the plurality of HS consensus sequences as a PSSM; and (iv) identifying HS sequences in one or more DNA sequences that were not part of the plurality of HS core sequences using the PSSMs identified.
HS consensus sequences may be returned as a PSSM using various methods known in the art such as E-matrix maker (Thomas et al. (1999) Journal of Computational Biology 6: 219-235, 1999; Thomas et al. Bioinformatics 16: 233-244, 2000) which is available at http://motif.stanford.edu/ematrix-maker.
IDENTIFYING AN HS CORE SEQUENCE
In a further aspect, the present invention proves a method for identifying an HS core sequence comprising the steps of: (a) providing a DNA sequence in the sense or antisense orientation that is not part of the plurality of HS core sequences; (b) providing an HS sequence; and (c) searching the DNA sequence for the presence a hypersensitive restriction site.
The DNA sequence may be from a database of DNA sequences. Preferably, the DNA sequence is about 10 kb in length. The sequence may be converted to capitals and the ">" characters and whitespace may be removed. If the alignment in the sense orientation fails then the DNA sequence may be converted to an antisense orientation using routine methods known to in the art. An HS sequence is provided which may be converted to capitals and the ">" characters and whitespace may be removed. Preferably, the HS sequence is between about 50 nucleotides to about 200 nucleotides in length, for example the HS sequence may be about 50 nucleotides in length. The DNA sequence is then searched for the presence a hypersensitive site - such as hypersensitive enzyme site - in the sense or antisense orientation. The enzyme may be a sequence specific nuclease or a restriction enzyme including type I, II and III restriction enzymes. Preferably, the enzyme is a restriction enzyme - such as a restriction enzyme that recognises at least a 4 base pair (bp) target sequence. More preferably, the restriction enzyme is selected from the group consisting of DpnII, Mbol, Nlalll, Sau3A and Tsp509I.
APPLICATIONS
The in silico identification of constitutively active, and to a certain extent differentially regulated HSs, plays an important role in biomedical research by identifying the access points for artificial transcription factors (reviewed in Beerli and Barbas, 2002) to key regulatory genomic elements (Zhang et al, 2000; Liu et al, 2001) for numerous applications. Examples include the inhibition of transcription of mutated proto-oncogenes - such as erbB-2 and bcr- abl (cancer); activation of fetal haemoglobin (sickle cell anemia); growth hormone (dwarfism); and erythropoetin and vascular endothelial growth factor (cancer therapy, diabetes); and the regulation of human telomerase reverse transcriptase to control aging and cancer proliferation. Therefore, the ability to predict the locations of HSs by bioinformatic means as described herein, has numerous implications for biotechnological and medical applications. Much of the current experimental work in biotechnology relates to the identification of various human gene regulatory sequences. By way of example, enhancers contain clusters of transcription factor binding sites, and can stimulate the activity of adjacent genes substantially. Enhancers also play an important role in directing tissue-specific gene expression programmes. Other gene regulatory regions, such as silencers, are involved in switching off the expression of nearby genes. Research carried out over the last two decades has established a strong link between the locations of enhancers and silencers with HSs (reviewed in Gross and Garrard, 1988; Bonifer, 2000). The ability to detect HSs using a bioinformatic approach in various eukaryotic genome sequences (especially the human genome) may have the potential to identify systematically numerous constitutive and tissue-specific enhancer and silencer sequences. These identified enhancer and silencer sequences may be suitable for numerous applications in gene therapy. Constitutive enhancers that are active in a wide spectrum of cell types can be used to promote the expression of a target gene in any cell type. Tissue-specific enhancers provide an enhanced level of specificity and can be used to promote the expression of target genes in a particular tissue, or in a restricted range of cell types. On the other hand, silencers can be used to switch of the expression of unwanted genes (e.g. oncogenes) or to silence the expression of parasitic genomes (e.g. during viral infections). Examples to illustrate this approach can be found in Smith et al. (2000) and Phylactides et al. (2002). These researchers mapped a small number of HSs (using conventional experimental means) with the goal of identifying regulatory elements and enhancer regions that confer cell type-specific and correct temporal control of the expression of the cystic fibrosis transmembrane conductance regulator (CFTR) gene. The identified enhancers can be used to direct the tissue- and stage-specific expression of a synthetic CFTR gene in future gene therapeutic applications. In contrast, Harland et al. (2002) specifically set out to identify an enhancer that confers ubiquitous expression to adjacent genes. They successfully analysed the promoter of the universally expressed transcription factor TATA-binding protein (TBP) for the presence of a DNAase I hypersensitive site indicative of the location of such enhancers. The experimental approaches using prior art methods, as illustrated by Smith et al. (2000), Phylactides et al. (2002) and Harland et al. (2002), are capable of identifying a small number of HSs in a reliable manner, but are laborious and can only yield information concerning small regions of a genome. In contrast, bioinformatic tools for the predictions of HSs, as described herein, may be applied with ease to large regions of sequenced genomes and to identify and select HS candidate sequences located near a multitude of different genes. These candidate sequences can then be experimentally verified, thus providing substantial savings in cost and time.
Smith et al. (2000) suggest that the small number of HSs identified in their study can be used to screen patients with cystic fibrosis for mutations in these sequences. HSs identified by bioinformatic means near any gene known or suspected to cause genetic and other diseases may therefore be useful for the prediction of genomic regions that are important candidates for the development of diagnostic tests. Sequencing of such bioinformatically identified regions in genomes derived from normal individuals and patients will rapidly and efficiently identify such mutations and lead to possible therapeutic interventions (eg. by gene therapy).
The identification of HSs by bioinformatic means also has numerous implications for transcription factor-based therapeutic applications. Ma et al. (2002) identified two transcription factors binding to DNA sequences present in a HS near the platelet-derived growth factor (PDGF)-A gene (implicated in tumorigenesis, metastasis and tumour progression). These transcription factors are repressors and are capable of diminishing the transcription of the PDFG-A gene. The identified transcription factors may play an important role in dampening the expression of this oncogenic growth factor. The bioinformatic identification of HSs (especially those located near disease-causing genes) may be the starting point of large scale screens to identify transcription factors capable of interacting with them. The bioinformatically identified HS core sequences can be chemically synthesised as single- and double-stranded oligonucleotides and used to isolate transcription factors binding to them (e.g. using the cDNA expression cDNA library screening approach used by Ma et al. (2002)). Alternatively the oligonucleotides derived from predicted HSs may be used to prepare DNA affinity columns (Kadonaga and Tjian, 1986). The identified transcription factors may then be used as targets for the isolation and development of drugs capable of modulating their functional characteristics in a therapeutically useful manner.
Further applications for the bioinformatic prediction of HS locations can be seen in work leading to towards the development of artificial transcription factors for therapeutic applications. Liu et al. (2001) and Zhang et al. (2000) used conventional experimental methods to map the locations of a small number of HSs near the vascular endothelial growth factor A (NEGF-A) erythropoietin genes. The aim of their work was to identify genomic regions present in nuclease hypersensitive configuration that would be suitable for binding of transcription factors containing artificial DΝA-binding domains capable of binding to the exposed DΝA sequences. Bioinformatic prediction of such HS target sites will improve this process significantly.
Thus, HSs are an important regulatory access point for external agents to act upon the genome. It will be appreciated that the methods of the present invention have many applications in biotechnology and medicine, which may be carried out faster, and more economically using the methods described herein. The methods of the present invention are broadly applicable to all eukaryotic genomes.
NUCLEOTIDE SEQUENCE
As used herein, the term "nucleotide sequence" is synonymous with the term "polynucleotide".
Aspects of the present invention involve the use of nucleotide sequences, which are available in databases.
The nucleotide sequence may be DNA or RNA of genomic or synthetic or recombinant origin. The nucleotide sequence may be double-stranded or single-stranded whether representing the sense or antisense strand or combinations thereof.
The nucleotide sequence may be prepared by use of recombinant DNA techniques (e.g. recombinant DNA).
The nucleotide sequence may be the same as the naturally occurring form, or may be derived therefrom.
AMINO ACID SEQUENCE As used herein, the term "amino acid sequence" is synonymous with the term "polypeptide" and/or the term "protein". In some instances, the term "amino acid sequence" is synonymous with the term "peptide". In some instances, the term "amino acid sequence" is synonymous with the term "protein".
Aspects of the present invention concern the use of amino acid sequences, which may be available in databases.
The amino acid sequence may be isolated from a suitable source, or it may be made synthetically or it may be prepared by use of recombinant DNA techniques.
NARIANTS/HOMOLOGUES/DERJNATIVES
The present invention encompasses the use of variants, homologues and derivatives of nucleotide sequences. Here, the term "homologue" means an entity having a certain homology with nucleotide sequences. Here, the term "homology" can be equated with "identity".
An homologous sequence is taken to include a nucleotide sequence which may be at least 75, 85 or 90% identical, preferably at least 95 or 98% identical to the subject sequence.
Homology comparisons can be conducted by eye, or more usually, with the aid of readily available sequence comparison programs. These commercially available computer programs can calculate % homology between two or more sequences.
% homology may be calculated over contiguous sequences, i.e. one sequence is aligned with the other sequence and each nucleotide in one sequence is directly compared with the corresponding nucleotidein the other sequence, one residue at a time. This is called an "ungapped" alignment. Typically, such ungapped alignments are performed only over a relatively short number of residues.
Although this is a very simple and consistent method, it fails to take into consideration that, for example, in an otherwise identical pair of sequences, one insertion or deletion will cause the following nucleotide to be put out of alignment, thus potentially resulting in a large reduction in % homology when a global alignment is performed.. Consequently, most sequence comparison methods are designed to produce optimal alignments that take into consideration possible insertions and deletions without penalising unduly the overall homology score. This is achieved by inserting "gaps" in the sequence alignment to try to maximise local homology.
However, these more complex methods assign "gap penalties" to each gap that occurs in the alignment so that, for the same number of identical nucleotides, a sequence alignment with as few gaps as possible - reflecting higher relatedness between the two compared sequences - will achieve a higher score than one with many gaps. "Affine gap costs" are typically used that charge a relatively high cost for the existence of a gap and a smaller penalty for each subsequent residue in the gap. This is the most commonly used gap scoring system. High gap penalties will of course produce optimised alignments with fewer gaps. Most alignment programs allow the gap penalties to be modified. However, it is preferred to use the default values when using such software for sequence comparisons. For example when using the GCG Wisconsin Bestfit package the default gap penalty for nucleotide sequences is -12 for a gap and -4 for each extension.
Calculation of maximum % homology therefore firstly requires the production of an optimal alignment, taking into consideration gap penalties. A suitable computer program for carrying out such an alignment is the GCG Wisconsin Bestfit package (University of Wisconsin, U.S.A.; Devereux et al., 1984, Nucleic Acids Research 12:387). Examples of other software than can perform sequence comparisons include, but are not limited to, the BLAST package (see Ausubel et al, 1999 ibid - Chapter 18), FASTA (Atschul et al, 1990, J. Mol. Biol., 403- 410) and the GENEWORKS suite of comparison tools. Both BLAST and FASTA are available for offline and online searching (see Ausubel et al, 1999 ibid, pages 7-58 to 7-60). However, for some applications, it is preferred to use the GCG Bestfit program. A new tool, called BLAST 2 Sequences is also available for comparing nucleotide sequences (see FEMS Microbiol Lett 1999 174(2): 247-50; FEMS Microbiol Lett 1999 177(1): 187-8)
Although the final % homology can be measured in terms of identity, the alignment process itself is typically not based on an all-or-nothing pair comparison. Instead, a scaled similarity score matrix is generally used that assigns scores to each pairwise comparison based on chemical similarity or evolutionary distance. An example of such a matrix commonly used is the BLOSUM62 matrix - the default matrix for the BLAST suite of programs. GCG Wisconsin programs generally use either the public default values or a custom symbol comparison table if supplied (see user manual for further details). For some applications, it is preferred to use the public default values for the GCG package, or in the case of other software, the default matrix, such as BLOSUM62.
Once the software has produced an optimal alignment, it is possible to calculate % homology, preferably % sequence identity. The software typically does this as part of the sequence comparison and generates a numerical result.
Nucleotide sequences may include within them synthetic or modified nucleotides. A number of different types of modification to oligonucleotides are known in the art. These include methylphosphonate and phosphorothioate backbones and/or the addition of acridine or polylysine chains at the 3' and/or 5' ends of the molecule. Such modifications may be carried out to enhance the in vivo activity or life span of nucleotide sequences.
CONSTRUCTS
In a further aspect, the present invention relates to a method comprising the additional step of using the identified HS consensus sequences to prepare one or more nucleic acid constructs. The present invention also relates to a nucleic acid construct comprising one or more HS consensus sequences.
The term "construct" is synonymous with the term "vector" and includes expression vectors, transformation vectors and shuttle vectors.
The term "expression vector" means a construct capable of in vivo or in vitro expression.
The term "transformation vector" means a construct capable of being transferred from one entity to another entity - which may be of the species or may be of a different species. If the construct is capable of being transferred from one species to another - such as from an
Escherichia coli plasmid to a bacterium, such as of the genus Bacillus, then the transformation vector is sometimes called a "shuttle vector". It may even be a construct capable of being transferred from an E. coli plasmid to an Agrobacterium to a plant.
Vectors may be transformed into a suitable host cell as described below to provide for expression of a polypeptide encompassed in the present invention.
The vectors may be for example, plasmid, virus or phage vectors provided with an origin of replication, optionally a promoter for the expression of the said polynucleotide and optionally a regulator of the promoter.
Vectors may be used in vitro, for example for the production of RNA or used to transfect or transform a host cell.
Thus, polynucleotides for use in the present invention may be incorporated into a construct — such as a recombinant vector (typically a replicable vector), for example a cloning or expression vector. The vector may be used to replicate the nucleic acid in a compatible host cell. Thus, quantities of polynucleotides may be made by introducing a polynucleotide into a replicable vector, introducing the vector into a compatible host cell, and growing the host cell under conditions, which bring about replication of the vector. The vector may be recovered from the host cell. Suitable host cells are described below in connection with expression vectors.
Genetically engineered host cells may be used to express an amino acid sequence (or variant, homologue, fragment or derivative thereof) in screening methods for the identification of agents and antagonists. Such genetically engineered host cells could be used to screen peptide libraries or organic molecules. Antagonists and agents such as antibodies, peptides or small organic molecules will provide the basis for pharmaceutical compositions.
ASSAY
In still a further aspect, the present invention relates to a method comprising the additional step of using the identified HS consensus sequences in an assay (or assay development program). Any one or more appropriate targets - such as a nucleotide sequence of an HS consensus sequence - may be used for identifying a chromatin modulating (e.g. modifying) agent according to the present invention.
The target employed in such a test may be free in solution, affixed to a solid support, borne on a cell surface, or located intracellularly. The abolition of target activity or the formation of binding complexes between the target and the chromatin modulating (e.g. modifying) agent being tested may be measured.
The methods of the present invention may be a screen, whereby a number of chromatin modulating (e.g. modifying) agents are tested.
Techniques for drug screening may be based on the method described in Geysen, European Patent Application 84/03564, published on September 13, 1984. In summary, large numbers of different small peptide test compounds are synthesized on a solid substrate, such as plastic pins or some other surface. The peptide test compounds are reacted with a suitable target or fragment thereof and washed. Bound entities are then detected - such as by appropriately adapting methods well known in the art. A purified target may also be coated directly onto plates for use in a drug screening techniques. Alternatively, non-neutralising antibodies may be used to capture the peptide and immobilise it on a solid support.
It is expected that the methods of the present invention will be suitable for both small and large-scale screening of test compounds as well as in quantitative assays.
CHROMATIN MODULATING AGENT
As used herein, the term "chromatin modulating agent" may refer to a single entity or a combination of entities.
The chromatin modulating agent may be an organic compound or other chemical. The chromatin modulating agent may be a compound, which is obtainable from or produced by any suitable source, whether natural or artificial. The chromatin modulating agent may be an amino acid molecule, a polypeptide, or a chemical derivative thereof, or a combination thereof. The chromatin modulating agent may even be a polynucleotide molecule - which may be a sense or an anti-sense molecule. The chromatin modulating agent may even be an antibody.
The chromatin modulating agent may be designed or obtained from a library of compounds, which may comprise peptides, as well as other compounds, such as small organic molecules.
By way of example, the chromatin modulating (e.g. modifying) agent may be a natural substance, a biological macromolecule, or an extract made from biological materials such as bacteria, fungi, or animal (particularly mammalian) cells or tissues, an organic or an inorganic molecule, a synthetic agent, a semi-synthetic agent, a structural or functional mimetic, a peptide, a peptidomimetics, a derivatised agent, a peptide cleaved from a whole protein, a peptide synthesised synthetically (such as, by way of example, either using a peptide synthesizer or by recombinant techniques) or combinations thereof, a recombinant agent, an antibody, a natural or a non-natural agent, a fusion protein or equivalent thereof and mutants, derivatives or combinations thereof.
The chromatin modulating (e.g. modifying) agent may be an organic compound. Typically the organic compounds may comprise two or more hydrocarbyl groups. Here, the term "hydrocarbyl group" means a group comprising at least C and H and may optionally comprise one or more other suitable substituents. Examples of such substituents may include halo-, alkoxy-, nitro-, an alkyl group, a cyclic group etc. In addition to the possibility of the substituents being a cyclic group, a combination of substituents may form a cyclic group. If the hydrocarbyl group comprises more than one C then those carbons need not necessarily be linked to each other. For example, at least two of the carbons may be linked via a suitable element or group. Thus, the hydrocarbyl group may contain hetero atoms. Suitable hetero atoms will be apparent to those skilled in the art and include, for instance, sulphur, nitrogen and oxygen. The chromatin modulating (e.g. modifying) agent may comprise at least one cyclic group. The cyclic group may be a polycyclic group, such as a non-fused polycyclic group. The chromatin modulating (e.g. modifying) agent may comprise at least one of said cyclic groups linked to another hydrocarbyl group. The chromatin modulating (e.g. modifying) agent may contain halo groups. Here, "halo" means halogen compounds eg. halides and includes fluoro, chloro, bromo or iodo groups.
The chromatin modulating (e.g. modifying) agent may contain one or more of alkyl, alkoxy, alkenyl, alkylene and alkenylene groups - which may be unbranched- or branched-chain.
The chromatin modulating (e.g. modifying) agent may be in the form of a pharmaceutically acceptable salt - such as an acid addition salt or a base salt - or a solvate thereof, including a hydrate thereof. For a review on suitable salts see Berge et al, J. Pharm. Sci., 1977, 66, 1-19.
The chromatin modulating (e.g. modifying) agent of the present invention may be capable of displaying other therapeutic properties.
The chromatin modulating (e.g. modifying) agent may be used in combination with one or more other pharmaceutically active agents.
If combinations of active agents are administered, then they may be administered simultaneously, separately or sequentially.
PHARMACEUTICAL COMPOSITIONS
In another aspect, the present invention relates to a method comprising the additional step of using the identified HS consensus sequence(s) in a pharmaceutical (or in the preparation of or development of a pharmaceutical).
Pharmaceutical compositions useful in the present invention may comprise a therapeutically effective amount of chromatin modulating (e.g. modifying) agent(s) and pharmaceutically acceptable carrier, diluent or excipient (including combinations thereof).
Pharmaceutical compositions may be for human or animal usage in human and veterinary medicine and will typically comprise any one or more of a pharmaceutically acceptable diluent, carrier, or excipient. Acceptable carriers or diluents for therapeutic use are well known in the pharmaceutical art, and are described, for example, in Remington's Pharmaceutical Sciences, Mack Publishing Co. (A. R. Gennaro edit. 1985). The choice of pharmaceutical carrier, excipient or diluent may be selected with regard to the intended route of administration and standard pharmaceutical practice. Pharmaceutical compositions may comprise as - or in addition to - the carrier, excipient or diluent any suitable binder(s), lubricant(s), suspending agent(s), coating agent(s) or solubilising agent(s).
Preservatives, stabilizers, dyes and even flavoring agents may be provided in pharmaceutical compositions. Examples of preservatives include sodium benzoate, sorbic acid and esters of p-hydroxybenzoic acid. Antioxidants and suspending agents may be also used.
There may be different composition formulation requirements dependent on the different delivery systems. By way of example, pharmaceutical compositions useful in the present invention may be formulated to be administered using a mini-pump or by a mucosal route, for example, as a nasal spray or aerosol for inhalation or ingestable solution, or parenterally in which the composition is formulated by an injectable form, for delivery, by, for example, an intravenous, intramuscular or subcutaneous route. Alternatively, the formulation may be designed to be administered by a number of routes.
Chromatin modulating (e.g. modifying) agents may also be used in combination with a cyclodextrin. Cyclodextrins are known to form inclusion and non-inclusion complexes with drug molecules. Formation of a drug-cyclodextrin complex may modify the solubility, dissolution rate, bioavailability and/or stability property of a drug molecule. Drug- cyclodextrin complexes are generally useful for most dosage forms and administration routes. As an alternative to direct complexation with the drug the cyclodextrin may be used as an auxiliary additive, e.g. as a carrier, diluent or solubiliser. Alpha-, beta- and gamma- cyclodextrins are most commonly used and suitable examples are described in WO-A- 91/11172, WO-A-94/02518 and WO-A-98/55148.
If the chromatin modulating (e.g. modifying) agent is a protein, then said protein may be prepared in situ in the subject being treated. In this respect, nucleotide sequences encoding said protein may be delivered by use of non- viral techniques (e.g. by use of liposomes) and/or viral techniques (e.g. by use of retroviral vectors) such that the said protein is expressed from said nucleotide sequence. STEREO AND GEOMETRIC ISOMERS
The chromatin modulating (e.g. modifying) agents may exist as stereoisomers and/or geometric isomers - e.g. they may possess one or more asymmetric and/or geometric centres and so may exist in two or more stereoisomeric and/or geometric forms. The present invention contemplates the use of the entire individual stereoisomers and geometric isomers of those chromatin modulating (e.g. modifying) agents, and mixtures thereof. The terms used in the claims encompass these forms, provided said forms retain the appropriate functional activity (though not necessarily to the same degree).
PHARMACEUTICAL SALT
The chromatin modulating (e.g. modifying) agent may be administered in the form of a pharmaceutically acceptable salt.
Pharmaceutically-acceptable salts are well known to those skilled in the art, and for example include those mentioned by Berge et al, in J. Pharm. Sci., 66, 1-19 (1977). Suitable acid addition salts are formed from acids which form non-toxic salts and include the hydrochloride, hydrobromide, hydroiodide, nitrate, sulphate, bisulphate, phosphate, hydrogenphosphate, acetate, trifluoroacetate, gluconate, lactate, salicylate, citrate, tartrate, ascorbate, succinate, maleate, fumarate, gluconate, formate, benzoate, methanesulphonate, ethanesulphonate, benzenesulphonate and p-toluenesulphonate salts.
When one or more acidic moieties are present, suitable pharmaceutically acceptable base addition salts can be formed from bases which form non-toxic salts and include the aluminium, calcium, lithium, magnesium, potassium, sodium, zinc, and pharmaceutically- active amines such as diethanolamine, salts.
A pharmaceutically acceptable salt of a chromatin modulating (e.g. modifying) agent may be readily prepared by mixing together solutions of a chromatin modulating (e.g. modifying) agent and the desired acid or base, as appropriate. The salt may precipitate from solution and be collected by filtration or may be recovered by evaporation of the solvent. A chromatin modulating (e.g. modifying) agent may exist in polymorphic. form.
A chromatin modulating (e.g. modifying) agent may contain one or more asymmetric carbon atoms and therefore exist in two or more stereoisomeric forms. Where a chromatin modulating (e.g. modifying) agent contains an alkenyl or alkenylene group, cis (E) and trans (Z) isomerism may also occur. The present invention includes the individual stereoisomers of a chromatin modulating (e.g. modifying) agent and, where appropriate, the individual tautomeric forms thereof, together with mixtures thereof.
Separation of diastereoisomers or cis- and tr< y-isomers may be achieved by conventional techniques, e.g. by fractional crystallisation, chromatography or H.P.L.C. of a stereoisomeric mixture of an agent or a suitable salt or derivative thereof. An individual enantiomer of a chromatin modulating (e.g. modifying) agent may also be prepared from a corresponding optically pure intermediate or by resolution, such as by H.P.L.C. of the corresponding racemate using a suitable chiral support or by fractional crystallisation of the diastereoisomeric salts formed by reaction of the corresponding racemate with a suitable optically active acid or base, as appropriate.
The present invention also encompasses all suitable isotopic variations of a chromatin modulating (e.g. modifying) agent or a pharmaceutically acceptable salt thereof. An isotopic variation of a chromatin modulating (e.g. modifying) agent or a pharmaceutically acceptable salt thereof is defined as one in which at least one atom is replaced by an atom having the same atomic number but an atomic mass different from the atomic mass usually found in nature. Examples of isotopes that may be incorporated into a chromatin modulating (e.g. modifying) agent and pharmaceutically acceptable salts thereof include isotopes of hydrogen, carbon, nitrogen, oxygen, phosphorus, sulphur, fluorine and chlorine such as 2H, 3H, 13C, 14C, 15N, 17O, 18O, 31P, 32P, 35S, 18F and 36C1, respectively. Certain isotopic variations of a chromatin modulating (e.g. modifying) agent and pharmaceutically acceptable salts thereof, for example, those in which a radioactive isotope such as 3H or 14C is incorporated are useful in drug and/or substrate tissue distribution studies. Tritiated, i.e., 3H, and carbon- 14, i.e., 14C, isotopes are particularly preferred for their ease of preparation and detectability. Further, substitution with isotopes such as deuterium, z'.e., 2H, may afford certain therapeutic advantages resulting from greater metabolic stability, for example, increased in vivo half-life or reduced dosage requirements and hence may be preferred in some circumstances. Isotopic variations of chromatin modulating (e.g. modifying) agents and pharmaceutically acceptable salts thereof can generally be prepared by conventional procedures using appropriate isotopic variations of suitable reagents.
It will be appreciated by those skilled in the art that a chromatin modulating (e.g. modifying) agent may be derived from a prodrug. Examples of prodrugs include entities that have certain protected group(s) and which may not possess pharmacological activity as such, but may, in certain instances, be administered (such as orally or parenterally) and thereafter metabolised in the body to form an agent of the present invention which are pharmacologically active.
It will be further appreciated that certain moieties known as "pro-moieties", for example as described in "Design of Prodrugs" by H. Bundgaard, Elsevier, 1985 (the disclosured of which is hereby incorporated by reference), may be placed on appropriate functionalities of chromatin modulating (e.g. modifying) agents. Such prodrugs are also included within the scope of the invention.
The present invention also includes the use of zwitterionic forms of a chromatin modulating (e.g. modifying) agent of the present invention. The terms used in the claims encompass one or more of the forms just mentioned.
PHARMACEUTICALLY ACTIVE SALT
A chromatin modulating (e.g. modifying) agent may be administered as a pharmaceutically acceptable salt. Typically, a pharmaceutically acceptable salt may be readily prepared by using a desired acid or base, as appropriate. The salt may precipitate from solution and be collected by filtration or may be recovered by evaporation of the solvent.
CHEMICAL SYNTHESIS METHODS
The chromatin modulating (e.g. modifying) agent may be prepared by chemical synthesis techniques. It will be apparent to those skilled in the art that sensitive functional groups may need to be protected and deprotected during synthesis of a chromatin modulating (e.g. modifying) agent. This may be achieved by conventional techniques, for example as described in "Protective Groups in Organic Synthesis" by T W Greene and P G M Wuts, John Wiley and Sons Inc. (1991), and by P.J.Kocienski, in "Protecting Groups", Georg Thieme Verlag (1994).
It is possible during some of the reactions that any stereocentres present could, under certain conditions, be racemised, for example if a base is used in a reaction with a substrate having an having an optical centre comprising a base-sensitive group. This is possible during e.g. a guanylation step. It should be possible to circumvent potential problems such as this by choice of reaction sequence, conditions, reagents, protection/deprotection regimes etc. as is well-known in the art.
The compounds and salts may be separated and purified by conventional methods.
Separation of diastereomers may be achieved by conventional techniques, e.g. by fractional crystallisation, chromatography or H.P.L.C. of a stereoisomeric mixture of a compound or a suitable salt or derivative thereof. An individual enantiomer of a compound may also be prepared from a corresponding optically pure intermediate or by resolution, such as by H.P.L.C. of the corresponding racemate using a suitable chiral support or by fractional crystallisation of the diastereomeric salts formed by reaction of the corresponding racemate with a suitably optically active acid or base.
The chromatin modulating (e.g. modifying) agent or variants, homologues, derivatives, fragments or mimetics thereof may be produced using chemical methods to synthesise the chromatin modulating (e.g. modifying) agent in whole or in part. For example, if the chromatin modulating (e.g. modifying) agent is a peptide, then the peptide can be synthesized by solid phase techniques, cleaved from the resin, and purified by preparative high performance liquid chromatography (e.g., Creighton (1983) Proteins Structures And Molecular Principles, WH Freeman and Co, New York NY). The composition of the synthetic peptides may be confirmed by amino acid analysis or sequencing (e.g., the Edman degradation procedure; Creighton, supra). CHEMICAL DERIVATIVE
The term "derivative" or "derivatised" as used herein includes chemical modification of an chromatin modulating (e.g. modifying) agent. Illustrative of such chemical modifications would be replacement of hydrogen by a halo group, an alkyl group, an acyl group or an amino group.
CHEMICAL MODIFICATION
The chromatin modulating (e.g. modifying) agent may be a chemically modified agent.
The chemical modification of a chromatin modulating (e.g. modifying) agent may either enhance or reduce hydrogen bonding interaction, charge interaction, hydrophobic interaction, Van Der Waals interaction or dipole interaction.
In .one aspect, the chromatin modulating (e.g. modifying) agent may act as a model (for example, a template) for the development of other compounds.
ADMINISTRATION
The present invention provides a method of modulating (e.g. modifying) chromatin structure in a subject comprising administering to the subject an effective amount of one or more chromatin modulating (e.g. modifying) agents identified according to the methods of the present invention.
The chromatin modulating (e.g. modifying) agents of the present invention may be administered alone but will generally be administered as a pharmaceutical composition comprising one or more components - e.g. when the components are in admixture with a suitable pharmaceutical excipient, diluent or carrier selected with regard to the intended route of administration and standard pharmaceutical practice. For example, the components may be administered (e.g. orally) in the form of tablets, capsules, ovules, elixirs, solutions or suspensions, which may contain flavouring or colouring agents, for immediate-, delayed-, modified-, sustained-, pulsed- or controlled-release applications.
If the pharmaceutical is a tablet, then the tablet may contain excipients such as microcrystalline cellulose, lactose, sodium citrate, calcium carbonate, dibasic calcium phosphate and glycine, disintegrants such as starch (preferably corn, potato or tapioca starch), sodium starch glycollate, croscarmellose sodium and certain complex silicates, and granulation binders such as polyvinylpyrrolidone, hydroxypropylmethylcellulose (HPMC), hydroxypropylcellulose (HPC), sucrose, gelatin and acacia. Additionally, lubricating agents such as magnesium stearate, stearic acid, glyceryl behenate and talc may be included.
Solid compositions of a similar type may also be employed as fillers in gelatin capsules. Preferred excipients in this regard include lactose, starch, a cellulose, milk sugar or high molecular weight polyethylene glycols. For aqueous suspensions and/or elixirs, the chromatin modulating (e.g. modifying) agent may be combined with various sweetening or flavouring agents, colouring matter or dyes, with emulsifying and/or suspending agents and with diluents such as water, ethanol, propylene glycol and glycerin, and combinations thereof.
The routes for administration (delivery) include, but are not limited to, one or more of: oral (e.g. as a tablet, capsule, or as an ingestable solution), topical, mucosal (e.g. as a nasal spray or aerosol for inhalation), nasal, parenteral (e.g. by an injectable form), gastrointestinal, intraspinal, intraperitoneal, intramuscular, intravenous, intrauterine, intraocular, intradermal, intracranial, intratracheal, intravaginal, intracerebroventricular, intracerebral, subcutaneous, ophthalmic (including intravitreal or intracameral), transdermal, rectal, buccal, vaginal, epidural, sublingual.
It is to be understood that not all of the components of the pharmaceutical need be administered by the same route. Likewise, if the composition comprises more than one active component, then those components may be administered by different routes. If a component is administered parenterally, then examples of such administration include one or more of: intravenously, intra-arterially, intraperitoneally, intrathecally, intraventricularly, intraurethrally, intrasternally, intracranially, intramuscularly or subcutaneously adπiinistering the component; and/or by using infusion techniques.
For parenteral administration, the component is best used in the form of a sterile aqueous solution which may contain other substances, for example, enough salts or glucose to make the solution isotonic with blood. The aqueous solutions should be suitably buffered (preferably to a pH of from 3 to 9), if necessary. The preparation of suitable parenteral formulations under sterile conditions is readily accomplished by standard pharmaceutical techniques well-known to those skilled in the art.
As indicated, the component(s) useful in the present invention may be administered intranasally or by inhalation and is conveniently delivered in the form of a dry powder inhaler or an aerosol spray presentation from a pressurised container, pump, spray or nebuliser with the use of a suitable propellant, e.g. dichlorodifluoromethane, trichlorofluoromethane, dichlorotetrafluoroethane, a hydrofluoroalkane such as 1,1,1,2-tetrafluoroethane (HFA 134A™) or 1,1,1,2,3,3,3-heρtafluoroρropane (HFA 227EA™), carbon dioxide or other suitable gas. In the case of a pressurised aerosol, the dosage unit may be determined by providing a valve to deliver a metered amount. The pressurised container, pump, spray or nebuliser may contain a solution or suspension of the active compound, e.g. using a mixture of ethanol and the propellant as the solvent, which may additionally contain a lubricant, e.g. sorbitan trioleate. Capsules and cartridges (made, for example, from gelatin) for use in an inhaler or insufflator may be formulated to contain a powder mix of the agent and a suitable powder base such as lactose or starch.
Alternatively, the component(s) may be administered in the form of a suppository or pessary, or it may be applied topically in the form of a gel, hydrogel, lotion, solution, cream, ointment or dusting powder. The component(s) may also be dermally or transdermally administered, for example, by the use of a skin patch. They may also be administered by the pulmonary or rectal routes. They may also be administered by the ocular route. For ophthalmic use, the compounds may be formulated as micronised suspensions in isotonic, pH adjusted, sterile saline, or, preferably, as solutions in isotonic, pH adjusted, sterile saline, optionally in combination with a preservative such as a benzylalkonium chloride. Alternatively, they may be formulated in an ointment such as petrolatum.
For application topically to the skin, the component(s) may be formulated as a suitable ointment containing the active compound suspended or dissolved in, for example, a mixture with one or more of the following: mineral oil, liquid petrolatum, white petrolatum, propylene glycol, polyoxyethylene polyoxypropylene compound, emulsifying wax and water. Alternatively, it may be formulated as a suitable lotion or cream, suspended or dissolved in, for example, a mixture of one or more of the following: mineral oil, sorbitan monostearate, a polyethylene glycol, liquid paraffin, polysorbate 60, cetyl esters wax, cetearyl alcohol, 2- octyldodecanol, benzyl alcohol and water.
The term "administered" also includes delivery by viral or non-viral techniques. Viral delivery mechanisms include but are not limited to adenoviral vectors, adeno-associated viral (AAV) vectos, heφes viral vectors, retroviral vectors, lentiviral vectors, and baculoviral vectors. Non-viral delivery mechanisms include lipid mediated transfection, liposomes, immunoliposomes, lipofectin, cationic facial amphiphiles (CFAs) and combinations thereof.
DOSE LEVELS
Typically, a physician will determine the actual dosage which will be most suitable for an individual subject. The specific dose level and frequency of dosage for any particular patient may be varied and will depend upon a variety of factors including the activity of the specific compound employed, the metabolic stability and length of action of that compound, the age, body weight, general health, sex, diet, mode and time of administration, rate of excretion, drug combination, the severity of the particular condition, and the individual undergoing therapy.
FORMULATION
The component(s) may be formulated into a pharmaceutical composition, such as by mixing with one or more of a suitable carrier, diluent or excipient, by using techniques that are known in the art. GENERAL RECOMBINANT DNA METHODOLOGY TECHNIQUES
The present invention employs, unless otherwise indicated, conventional techniques of chemistry, molecular biology, microbiology, recombinant DNA and immunology, which are within the capabilities of a person of ordinary skill in the art. Such techniques are explained in the literature. See, for example, J. Sambrook, E. F. Fritsch, and T. Maniatis, 1989, Molecular Cloning: A Laboratory Manual, Second Edition, Books 1-3, Cold Spring Harbor Laboratory Press; Ausubel, F. M. et al. (1995 and periodic supplements; Current Protocols in Molecular Biology, ch. 9, 13, and 16, John Wiley & Sons, New York, N.Y.); B. Roe, J. Crabtree, and A. Kahn, 1996, DNA Isolation and Sequencing: Essential Techniques, John Wiley & Sons; M. J. Gait (Editor), 1984, Oligonucleotide Synthesis: A Practical Approach, Irl Press; and, D. M. J. Lilley and J. E. Dahlberg, 1992, Methods of Enzymology: DNA Structure Part A: Synthesis and Physical Analysis of DNA Methods in Enzymology, Academic Press. Each of these general texts is herein incorporated by reference.
The invention will now be further described by way of Examples, which are meant to serve to assist one of ordinary skill in the art in carrying out the invention and are not intended in any way to limit the scope of the invention.
EXAMPLES
Example 1
Providing a plurality of HS core sequences using 'Hypertag Display'.
HeLa S3 cells (obtained from the European Collection of Cell Cultures; ECACC Ref. No. 87110901) are grown to 80% confluency in 150 cm2 flasks at 37°C in Dulbecco's Minimal Essential Medium/10% newborn calf serum (Sigma) in a 5% CO2 humidified atmosphere. Before carrying out the procedure the appearance of cells is visually checked and their overall viability (>97%) assessed by trypan blue staining. After removing the medium the adherent cells are rinsed in Dulbecco's PBS (-Ca2+/Mg2+) and around 75% of the cells are detached by trypsin treatment. Isolation of nuclei is carried out using established protocols (Protocol
10.2; Carey and S ale, 1999). Briefly, 1.5 x 10^ cells are gently resuspended in "NP40 lysis buffer' (10 mM Tris-HCl [pH 7.5]; 10 mM NaCl; 3 mM MgCl2; 0.5% NP-40; 0.15 mM spermine-tetrachloride; 0.15 mM spermidine-trichloride) and incubated on ice for 10 minutes. The nuclei are purified from the lysate by low speed centrifugation and washed once in
Restriction enzyme buffer* (50 mM Tris-HCl [pH 8.0]; 100 mM NaCl; 3 mM MgCl2; 0.15 mM spermine-tetrachloride; 0.15 mM spermidine-trichloride). The purified nuclei are resuspended in 500 μl NEB Buffer 3 (100 mM NaCl, 50 mM Tris-HCl [pH7.9], 10 mM MgCl2, 1 mM dithiothreitol) to yield a final volume of around 800 μl. Six 100 μl aliquots of the nuclei suspension are distributed into six separate microcentrifuge tubes which are subjected to the following treatments:
Reaction 1 : no enzyme (negative control)
Reaction 2: 100 units Mbo I (recombinant; 5 units/μl; New England Biolabs) Reaction 3: 50 units Mbo I
Reaction 4: 25 units Mbo I
Reaction 5 : 10 units Mbo I
Reaction 6: 5 units Mbo I
Fragmentation of nuclease hypersensitive sites is initiated by incubating the reactions at 37°C for 10 minutes. Mbol is suitable for such studies because it specifically recognizes a 4 bp target site (5' GATC 3') and produces a single-stranded protruding end that can be ligated directly to BamHI-cleaved cloning vectors (see below). Previous work has established that proteins up to 500 kDa can diffuse freely through the nucleus (Seksek et al., 1997), suggesting that Mbo I (which is around 30 kDa in size) can in principle access hypersensitive sites throughout the genome during the duration of this incubation step. The cleavage reaction is terminated by the addition of EDTA (10 mM final concentration), proteinase K
(750 μg), and SDS (2% final concentration). After an overnight incubation at 37°C genomic DNA is extracted with phenol/chloroform, treated with RNAase A, precipitated with ethanol and finally resuspended in TE (10 mM Tris-HCl [pH 8.0]; 1 mM EDTA). There are at least two requirements for oligonucleotide adapters, including: i) presence of a single-stranded cohesive end that will base pair specifically with the DNA fragment ends produced by the genomic restriction nuclease fragmentation reaction, and ii) they need to have a unique sequence that is different from any other sequence found in eukaryotic genomes and can therefore act as a specific primer binding site during the PCR reaction. With these criteria in mind, a BamHI adapter is designed as follows:
SEQ ID No.1
5'0H cGCCAGGGTTTTCCCAGTCACGAC 3,(m
The residue underlined corresponds to the 5 'OH group of the top strand oligonucleotide that can be labeled with, for example, radioactive and/or fluorescent labels to facilitate the subsequent detection of the PCR products.
SEQ ID No.2
3'OH GCGGTCCCAAAAGGGTCAGTGCTGCTAG 5'OH
The residue underlined may be phosphorylated to ensure the formation of a covalent link between the adapter oligonucleotide and the cleaved genomic DNA during the ligation reaction.
The sequences shown in italics are derived from bacteriophage Ml 3 and used extensively as a 'universal' sequencing primer in standard plasmids and bacteriophage cloning vectors. The motif does not display any obvious homologies to any sequenced eukaryotic genome and will therefore not cross-hybridize to endogenous loci during PCR reactions. Any other unique oligonucleotide sequence not occurring in the tested genome is suitable for this task.
8 pmoles of the 'bottom strand' oligonucleotide of the BamHI adapter (containing the 5'GATC3' single-stranded extension) are phosphorylated in a final volume of 10 μl in a microcentrifuge tube at 37°C for 10 minutes in the presence of 2 μl of [γ-32p] rATP (New England Nuclear; 3000 Ci/mmol at 10 mCi/ml) and 10 units of T4 polynucleotide kinase (Promega). The reaction is stopped by the addition of 1 μl of 0.5 M EDTA.Na2 and 89 μl of
TE buffer. The phosphorylated oligonucleotide is purified away from unincoφorated rATP using a G25 microspin column (Amersham-Pharmacia). 8 pmoles of the 'top strand' oligonucleotide are added to the column eluate and, after heating to 75 °C for 5 minutes, the mixture is allowed to cool to room temperature to anneal the two strands. 2 μl of the annealed
BamHI adapter is ligated to 1 μg of hypersensitive site Mbol-cleaved genomic DNA at 16°C for 1 hour in a final volume of 10 μ (note that due to the small number of cleaved Mbo I sites in the genomic DNA the adapter is likely to be in substantial excess and therefore no alkaline phosphatase treatment of the genomic DNA is required to prevent ligation of genomic fragments to each other). 1 μl of the ligation reaction containing the adapter-tagged DNA is used for each PCR reaction.
PCR reactions are carried out in a total reaction volume of 50 μl with a Stratagene Robocycler using the MβP 'Easy Start' system (obtained from Merck). Amplification is carried out for 40 cycles (45 seconds at 95°C; 45 seconds at 55°C; 1 minute at 72°C). 10 μl of each PCR reaction are analyzed on a 0.7% agarose/TBE gel.
The application of this methodology to nuclei purified from the human HeLa cell line results in a data set of 53 HS core sequences comprising the sequences set forth in Figure 9.
Example 2
Identification of HS consensus sequences.
The HS data set identified in Example 1 and shown in Figure 9 is analysed for the presence of HS consensus sequences using the computer program YEBIS (www- scc.jst.go.jp/YEBIS/MotifExtraction/; Yada et al., 1998) with default parameters. 17 different sequences, ranging in length from 7 to 13 nucleotides are identified (Figure 10).
Some of the sequences (e.g. 'Motif 1') are relatively short, but are present as identical copies in several different HS sequences. Other sequences are substantially longer, display some degree of variability, but consensus sequences with highly conserved residues in particular positions emerge clearly in all cases.
The HS data set was also processed using the MOTIFSAMPLER algorithm incoφorating a higher-order background model (www.esat.kuleuven.ac.be/~thij sAVOrk/MotifSampler.html; Thijs et al, 2001). This program differs from YEBIS because the length of the motifs and number of detected motifs can be entered as part of the search criteria. Analysis of the HS data set was carried out by specifying the expected lengths of motifs as 8, 12 and 15, respectively, in three independent runs. Again, motifs shared by different members of the HS data set were successful identified. The MOTIFSAMPLER motifs of length 12 appeared to be most effective. The MOTIFSAMPLER output is shown in Figure 11.
This analysis illustrates that it is possible to extract motif sets that are shared by different members of the HS data set.
Example 3
Determining the variability in the consensus sequences.
The variability in the various consensus sequences identified in Example 3 is encoded as 'position-specific scoring matrices' ('PSSMs'; Freeh et al, 1997). The PSSMs obtained using the data obtained from YEBIS is shown in Figure 12. The PSSMs are rearranged in a different format from the aligned sequences using a custom-written PERL script YEBIS- MATRLX. The resulting PSSMs are listed in Figure 13.
The PSSMs obtained using the data obtained from MOTIFSAMPLER are shown in Figure 11.
Example 4
Identification of HS consensus sequences in other DNA sequences. DNA sequences are analysed for high-density clusters of consensus sequences identified in the HS data set using PSSMs based on the YEBIS and MOTIFSAMPLER results (described above). The relatively small HS data set available at this time does not yet allow the definition of high-resolution stochastic models to optimise the recognition rate of HSs. As a first approximation, however, the program CISTER that incoφorates a hidden Markov model to enhance the rate of detection of biologically significant cz's-element clusters (Frith et al, 2001) is used. The HS-PSSMs are fed into the program CISTER (sullivan.bu.edu/~mfrifh/cister.shtml) to locate the occurrence of similar motif clusters in new DNA sequences that were not part of the initial training set.
The resulting program identifies with a high degree of accuracy a number of constitutive HSs present in viral and cellular gene promoters. This result indicates that the prediction of novel HSs is feasible using information extracted from our HS data set.
Example 5
Identification of HS sequences motifs in model systems.
The following experimentally studied model systems are analysed to investigate the performance of the algorithm of the present invention in more detail. In both cases a set of 12 MOTIFSAMPLER PSSMs of length 12 are used on CISTER with the following settings (a=10; b=6; g=3000; w=1000; motif probability threshold=0.1; pseudocount=l).
Human β-globin constitutive HS5 (Genbank Accession No. AF064190).
A single strong signal is detected which coincides precisely with the experimentally mapped constitutive HS5 (Figure 2; Dhar et al, 1990).
Mouse mammary tumour virus 3 ' long terminal repeat (MMTV-3' LTR; Genbank Accession No. MMTPRO).
Both experimentally mapped HSs, including a constitutive and a glucocorticoid-inducible site, are reliably detected (Figure 3; Zaret and Yamamoto, 1984). Human vascular endothelial growth factor A promoter (Genbank Accession No. AF005785).
A strong signal is detected at position 2600 (Figure 4). The experimental data shows the presence of two HSs in this area (Liu et al., 2001). The presence of a single broad peak suggests that in some cases the clustering algorithm of CISTER causes artifactual merging of motif clusters from adjacent HSs. Also, two other experimentally mapped sites are only weakly detected. This indicates that the currently used PSSMs are not yet capable of detecting all sites present. Use of a larger HS data training set for motif extraction will probably overcome this limitation in future.
The performance of the system with larger sequences is investigated with larger sequences by analysing the human c-myc and erythropoietin loci. In both cases HSs in the immediate neighborhood of each gene have been mapped experimentally, but no mapping has been carried out in the surrounding area. By combining studied and unstudied DNA sequences, these sequences act as indicators to gain an impression of the overall signal noise ratio of the system. The CISTER parameters used are the same as quoted above, with the sole exception of g=30000 for c-myc to take the extra length of the sequence into account.
Human erythropoietin (embedded in 13 kb of human genome sequence downloaded from the NCBI server at www.ncbi.nlm.nih.gov/genome/guide/liunian).
Strong signals are detected from an experimentally mapped regulated 5' located HS and from two HSs located at the 3' end of the gene (Figure 5; Zhang et al., 2000). Similar to example 3, some merging of the predicted HS signals from the two separate 3' HSs is observed in the computer prediction due to the CISTER algorithm. The program also shows a HS signal within the transcribed region, which is compatible with the experimentally observed emergence of hypersensitivity of the gene at the onset of active expression.
Human c-myc (embedded in 55 kb of human genome sequence downloaded from the NCBI server at www.ncbi.nlm.nih.gov/genome/guide/human). This is the largest region analysed (Figure 6). HSs surrounding the 5' and 3' end of the c-myc gene are reliably detected (Mautner et al, 1995). There are some additional strong signals in the surrounding regions for which no experimental data is available. The results indicate the high signal/noise ratio achievable with the current system.
Example 6
Uniformity of the HS sequence data set
The DNA sequence tested consists of a continuous string of the 59 HS core sequences (HS Core Sequences Group A & B), preceded by the same sequence randomised (10 bp Randomized HS Core Sequences) (Figure 7). This procedure creates a test sequence which contains two halves: the first half is random (and thus should lack HS-specific motifs), whereas the other half is 'packed' with all the HS sequences. The randomisation is carried out in blocks of lObp, so any local variation in base composition is preserved, whereas regulatory motifs get effectively scrambled.
PSSMs are compiled from two non-overlapping subsets of HS core sequences: motifs were separately derived from 'Group A' and 'Group B'. This has two puφoses:
1. By compiling the PSSMs from a subset of data it is determined whether the resulting PSSMs can also recognise sequences from the other collection subset that is not used. If this is the case, it shows that the two collections share common motifs and thus provides strong support for the idea that the collection contains structurally (rather than random) similar set of sequences.
2. The signal to noise ratio is be assessed by comparing the specific signal obtained in the right half against the randomised sequence.
Studies like this provide a preliminary insight into the amount of sequence data required to compile reliable PSSMs. This is shown by the observation that the PSSMs derived from collection A (containing only 22 HS sequences) give us only a relatively low degree of specificity (compare the signal in the dark blue or dark grey area with the background signal in the red or medium grey are). If the PSSMs are derived from the larger subset (Collection B with 37 sequences) than there is a reduction of noise and a more consistent recognition of the HS sequences in both collections is obtained. If the PSSMs are derived from all 59 sequences little background noise in the randomised area and a high signal in most areas of the right half of the sequence containing the combined HSs. Tests such as this can be used to assess the uniformity of the HS sequence data set. It also shows that there is a distinct increase in the signal to noise ratio using larger data sets to compute the PSSMs.
Conclusions
Using the methods of the present invention, it is possible to predict the position of HSs using in silico tools. Using a sufficient number of DNA sequences known to be in hypersensitive configuration it is possible to identify common motifs that occur in HSs, even in the absence of precise knowledge regarding the functional role of these motifs in HS architecture. Similar approaches have been used to predict (with various degrees of success) the presence of other sequences involved in regulating gene expression, such as promoters (e.g. Werner et al, 1999; Hannenhali & Levy 2001), nuclear scaffold/matrix attachment regions ('S/MARs'; Frisch et al, 2001), or stress-induced duplex destabilization ('SIDD') regions (Benham et al, 1997; Mielke et α/., 2002).
Thus, computer-based rules from a set of newly identified HSs are defined that enable the prediction of the occurrence and positions of HSs in DNA sequences with a high degree of accuracy. Further increases in the size of the proprietary HS database will help to refine the search for HS-consensus sequences and improve the reliability of this approach even further. Accordingly, the positions of HSs can now be predicted using bioinformatic tools, rather than using exclusively experimental tools; it may be possible to define the HS consensus sequences with even higher accuracy and identify further HS consensus sequences once there is access to a larger HS data set ie. a larger number of HS sequences; it is possible to apply bioinformatic tools to predict the presence and positions of HSs near key genes for biotechnological and medical interventions.
The current data set and software demonstrate the feasibility of the present invention. At this stage, only a limited amount of 'raw' information is available. Using larger data sets (i.e. hundreds/thousands of HS core sequences) will almost certainly allow the HS consensus sequences to be identified more clearly in the future. At present, unknown parameters are used in the identification of HS consensus sequences and in the use of the identified motifs in the prediction of HSs in unknown genome sequences.
In accordance with the present invention, it is possible to link up the HS -predication tools of the present invention with other bioinformatic programs designed to detect other gene- regulatory elements. A combination of these may result in an even more comprehensive view of gene regulation with biomedical and biotechnological implications.
Summary
Thus, certain aspects of the present invention relate to:
A computer automated method for identifying one or more HS consensus sequences comprising the steps of: (a) providing a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the HS core sequences; and(c) returning one or more HS consensus sequences comprising a plurality of motifs identified in step (b).
A computer automated method for identifying one or more HS sequences comprising the steps of: (a) identifying a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the plurality of HS core sequences; (c) returning one or more HS consensus sequences comprising a plurality of the motifs identified in step (b); and (d) searching for one or more HS sequences comprising one or more HS consensus sequences.
A computer automated method for identifying an HS core sequence comprising the steps of: (a) providing a DNA sequence in the sense or antisense orientation that is not part of the plurality of HS core sequences; (b) providing an HS sequence; and (c) searching the DNA sequence for the presence a hypersensitive restriction site. All publications mentioned in the above specification are herein incoφorated by reference. Various modifications and variations of the described methods and system of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific preferred embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention which are obvious to those skilled in molecular biology or related fields are intended to be within the scope of the following claims.
REFERENCES
Beerli, R.R., and Barbas, C.F. (2002). Engineering polydactyl zinc-finger transcription factors. Nat. Biotech. 20, 135-141.
Benham, C.J., Kohwi-Shigematsu, T., and Bode, J. (1997). Stress-induced duplex DΝA destabilization in scaffold/matrix attachment regions. J. Mol. Biol. 274, 181-196.
Bonifer, C. (2000). Developmental regulation of eukaryotic gene loci. Trends Genet. 16, 310- 315.
Carey, M., and Smale, S.T. (1999). Transcriptional Regulation in Eukaryotes: Concepts, Strategies and Techniques. Cold Spring Harbor Laboratory Press. Cold Spring Harbor, New York.
Dhar, V., Nandi, A., Schildkraut, C.L., and Skoultchi, A.I. (1990). Erythroid-specific nuclease-hypersensitive sites flanking the human E-globin domain. Mol. Cell. Biol. 10, 4324- 4333.
Freeh, K., Quandt, K., and Werner, T. (1997). Finding protein-binding sites in DNA sequences: the next generation. Trends Biochem. Sci. 22, 103-104.
Frisch, M., Freeh, K., Klingenhoff, Cartharius, K., Liebich, I., and Werner, T. (2001). In silico prediction of scaffold/matrix attachment regions in large genomic sequences. Genome Res. 12, 349-354.
Frith, M.C., Hansen, U., and Weng, Z. (2001). Detection of czs-element clusters in higher eukaryotic DNA. Bioinformatics 17, 878-889.
Gasser, S.M., and Laemmli, U.K. (1986). Cohabitation of scaffold binding regions with upstream/enhancer elements of three developmentally regulated genes of D. melanogaster. Cell 46, 521-530 Gross, D.S., and Garrard, W.T. (1988). Nuclease hypersensitive sites in chromatin. Ann. Rev. Biochem. 57, 159-197.
Hannenhalli, S., and Levy, S. (2001). Promoter prediction in the human genome. Bioinformatics 17, S90-S96 (Supplement).
Huber, M.C., Graf, T., Sippel, A.E., and Bonifer, C. (1995). Dynamic changes in the chromatin of the chicken lysozyme gene during differentiation of multipotent progenitors to macrophages. DNA Cell Biol. 14, 397-402.
Kadonaga, J.T., and Tjian, R. (1986). Affinity purification of sequence-specific DNA binding proteins. Proc. Natl. Acad. Sci. USA 83, 5889-5893.
Kontaraki, J., Chen, H.-H., Riggs, A., and Bonifer, C. (2000). Chromatin fine structure profiles for a developmentally regulated gene: reorganization of the lysozyme locus before trans-activator binding and gene expression. Genes Dev. 14, 1206-2122.
Liu, Y., and Beveridge, D.L. (2001). A refined prediction model for gel retardation of DNA oligonucleotides from dinucleotide step parameters: reconciliation of DNA bending models with crystal structure data. J. Biomol. Struct. Dyn. 18, 505-526.
Liu, P.-Q., Rebar, E.J., Zhang, L., Liu, Q., Jamieson, A.C., Liang, Y., Qi, H., Li, P.-X., Chen, B., Mendel, M.C., Zhong, X., Lee, Y.-L., Eisenberg, S.P., Spratt, S.K., Case, C.C., and Wolffe, A.P. (2001). Regulation of an endogenous locus using a panel of designed zinc finger proteins targeted to accessible chromatin regions. J. Biol. Chem. 216, 11323-11334.
Ma, D., Xing, Z., Liu, B., Pedigo, N.G., Zimmer, S.G., Bai, Z., Postel, E.H., and Kaetzel, D.M. (2002). NM23-H1 and NM23-H2 repress transcriptional activities of nuclease-hypersensitive elements in the platelet-derived growth factor-A promoter. J. Biol. Chem. 277, 1560-1567. Mautner, J., Joos, S., Werner, T., Eick, D., Bornkamm, G.W., and Polack, A. (1995). Identification of two enhancer elements downstream of the human c-myc gene. Nucl. Acids Res. 23, 72-80.
Mielke, C, Maass, K., Tiimmler, M., and Bode, J. (1996). Anatomy of highly expressing chromosomal sites targeted by retroviral vectors. Biochemistry 35, 2239-2252.
Mielke, C, Christensen, M.O., Westergaard, O., Bode, J., Benham, C.J., and Breindl, M. (2002). Multiple collagen I gene regulatory elements have sites of stress-induced DNA duplex destabilization and nuclear scaffold/matrix association potential. J. Cell. Biochem. 84, 484- 496.
Nedospasov, S.A., and Georgiev, G.P. (1980). Non-random cleavage of SV40 DNA in the compact minichromosome and free in solution by micrococcal nuclease. Biochem. Biophys. Res. Commun. 92, 532-539.
Phylactides, M., Rowntree, R., Nuthall, H., Ussery, D., Wheeler, A., and Harris, A. (2002). Evaluation of potential regulatory elements identified as DNase I hypersensitive sites in the CFTR gene. Eur. J. Biochem. 269, 553-559.
Seksec, O., Biwersi, J., and Verkman, A.S. (1997). Translational diffusion of macromolecule- sized solutes in cytoplasm and nucleus. J. Cell Biol. 138, 131-142.
Smith, D.J., Nuthall, H.N., Majetti, M.E., and Harris, A. (2000). Multiple potential intragenic regulatory elements in the CFTR gene. Genomics 64, 90-96.
Stalder, J., Larsen, A., Engel, J.D., Dolan, M., Groudine, M., and Weintraub, H. (1980). Tissue-specific cleavage in the globin chromatin domain introduced by DNAase I. Cell 20, 451-460.
Svetlova, E., Avril-Fournout, N., Ira, G., Deschavanne, P., and Filipski, J. (1998). DNase- hypersensitive sites in yeast artificial chromosomes containing human DNA. Mol. Gen. Genet. 257, 292-298. Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouze, P., and Moreau, Y. (2001). A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17, 1113-1122.
Werner, T. (1999). Models for prediction and recognition of eukaryotic promoters. Mamm. Genome 10, 168-175.
Wolffe, A.P., and Hayes, J.J. (1999). Chromatin disruption and modification. Nucl. Acids Res. 27, 711-720.
Wu, C. (1980). The 5' ends of Drosophila heat shock genes in chromatin are hypersensitive to DNAase I. Nature 286, 854-860.
Yada, T., Totoki, Y., Ishikawa, M., Asai, K., and Nakai, K. (1998). Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences. Bioinformatics 14, 317-325.
Zhang, L., Spratt, S.K., Liu, Q., Johnstone, B., Qi, H., Raschke, E.E., Jamieson, A.C., Rebar, E.J., Wolffe, A.P., and Case, C.C. (2000). Synthetic zinc finger transcription factor action at an endogenous chromosomal site. J. Biol. Chem. 275, 33850-33860.
Zaret, K.S., and Yamamoto, K.R. (1984). Reversible and persistent changes in chromatin structure accompany activation of a glucocorticoid-dependent enhancer element. Cell 38, 29- 38.

Claims

CLAΓMS
1. A method for identifying one or more HS consensus sequences comprising the steps of: (a) providing a plurality of HS core sequences; (b) using a search algorithm to search for a plurality of motifs that are shared by the HS core sequences; and (c) returning one or more HS consensus sequences comprising a plurality of motifs identified in step (b).
2. A method according to claim 1 wherein the search algorithm includes a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian-statistical model, a Poisson-statistical model and a Monte Carlo-statistical model.
3. A method according to claim 2 wherein the search algorithm comprises a word counting method and/or a probabilistic method.
4. A method according to any one of the preceding claims wherein the HS consensus sequences are returned as a regular expression or a sequence logo.
5. A method according to any one of the preceding claims wherein the HS consensus sequences are returned as a weight matrix.
6. A method according to claim 5 wherein the weight matrix is a position specific scoring matrix (PSSM).
7. A method according to claim 6 wherein returning the HS consensus sequences as a PSSM comprises the step of computing a score for finding a matching sequence in the plurality of HS consensus sequences.
8. A method according to any one of the preceding claims wherein the plurality of HS core sequences are identified using Global Analysis of Chromatin Topology or Hypergenomic Display.
9. A method for identifying one or more HS sequences comprising the steps of:
(a) identifying a plurality of HS core sequences;
(b) using a search algorithm to search for a plurality of motifs that are shared by the plurality of HS core sequences;
(c) returning one or more HS consensus sequences comprising a plurality of the motifs identified in step (b); and
(d) searching for one or more HS sequence comprising the HS consensus sequence.
10. A method according to claim 9 wherein step (c) returns the HS consensus sequences as a PSSM comprising the steps of:
(i) providing a plurality of HS consensus sequences; and
(ii) computing the score for finding a matching sequence in the plurality of HS consensus sequences; (iii) identifying HS sequences in one or more DNA sequences that were not part of the plurality of HS core sequences using the PSSMs identified.
11. A method according to claims 9 and 10 wherein the search algorithm is a word counting method or a probabilistic method.
12. A method according to any one of claim 9 to 11 wherein the search algorithm includes a statistical model such as a Gibbs-statistical model, a Markov-statistical model, a Gaussian- statistical model, a Poisson-statistical model and a Monte Carlo-statistical model
13. A method according to any one any of claims 9 to 12 wherein the HS consensus sequences are returned as a regular expression or a sequence logo.
14. A method according to any one of of claims 9 to 13 wherein the HS consensus sequences are returned as a weight matrix.
15. A method according to claim 14 wherein the weight matrix is a position specific scoring matrix (PSSM).
16. A method according to claim 15 wherein returning the HS consensus sequences as a PSSM comprises the step of computing a score for finding a matching sequence in the plurality of HS consensus sequences.
17. A method according to any one of claims 9 to 16 wherein the plurality of HS core sequences are identified using Global Analysis of Chromatin Topology or Hypergenomic Display.
18. A method according to any one of claims 9 to 17, wherein the DNA sequences are from a database of DNA sequences.
19. A method according to any one of claims 9 to 18 wherein one or more HS sequences comprising the HS consensus sequences are searched by searching for clusters of cis- elements.
20. A method according to claim 19 wherein the most probable arrangement of cw-elements in the cluster are integrated using the Viterbi algorithm.
21. A method according to claim 20 wherein a forward-backward algorithm to consider the sum of all paths through a hidden Markov model is used.
22. A method for identifying an HS core sequence comprising the steps of:
(a) providing a DNA sequence in the sense or antisense orientation that is not part of the plurality of HS core sequences;
(b) providing an HS sequence; and
(c) searching the DNA sequence for the presence a hypersensitive restriction site.
23. A method according to claim 22 wherein the HS sequence is between about 50 nucleotides to about 200 nucleotides in length.
24. A method according to claim 23 wherein the DNA sequence is 10 kb in length.
25. A method according to any one of the preceding claims comprising the additional step of using the identified HS consensus sequences or identified HS sequences to prepare a nucleic acid construct.
5 26. A method according to any one of the preceding claims comprising the additional step of using the identified HS consensus sequences or HS sequences in an assay (or assay development program) and/or a pharmaceutical (or in the preparation of or development of a pharmaceutical).
10 27. A method of treating a disease associated with chromatin structure in a subject, the method comprising administering to the subject an effective amount of a chromatin modulating (e.g. modifying) agent capable of modulating (e.g. modifying) the chromatin structure to a non-diseased form.
15 28. A pharmaceutical composition comprising a chromatin modulating agent according to claim 27 and a pharmaceutically acceptable carrier, diluent, excipient or adjuvant or any combination thereof.
29. A method of preventing and/or treating a disorder comprising administering a chromatin 0 modulating agent according to claim 27 wherein said chromatin modulating agent is capable of modulating an HS to cause a beneficial preventative and/or therapeutic effect.
30. Use of a chromatin modulating agent according to claim 27 in the preparation of a pharmaceutical composition for the treatment of an HS related disorder. 5
31. An HS consensus sequence identifiable, preferably identified using the methods of any one of claims 1 to 21 or a variant, derivative, or homologue thereof.
32. An HS sequence identifiable, preferably identified using the methods of any one of claims 0 1 to 24 or a variant, derivative, or homologue thereof.
33. An HS sequence according to claim 32, comprising the sequence set forth in SEQ ID NO.3 to SEQ ID NO.55 or a variant, derivative, or homologue thereof.
34. A weight matrix identifiable, preferably identified by the methods of any one of claims 1 to 21.
35. A weight matrix according to claim 34, wherein the weight matrix is a PSSM.
36. A recording medium bearing machine-readable instructions for implementing the methods according to any one of claims 1-27 and 29.
37. A computer system loaded with machine-readable instructions for implementing the methods according to any one of claims 1-27 and 29.
PCT/GB2003/002895 2002-07-04 2003-07-04 Method for identifying hypersensitive site consensus sequences Ceased WO2004005547A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003281288A AU2003281288A1 (en) 2002-07-04 2003-07-04 Method for identifying hypersensitive site consensus sequences

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB0215547A GB0215547D0 (en) 2002-07-04 2002-07-04 Method
GB0215547.1 2002-07-04
PCT/GB2002/003080 WO2003004702A2 (en) 2001-07-05 2002-07-04 Method for determining chromatin structure
GBPCT/GB02/003080 2002-07-04

Publications (2)

Publication Number Publication Date
WO2004005547A2 true WO2004005547A2 (en) 2004-01-15
WO2004005547A3 WO2004005547A3 (en) 2004-03-25

Family

ID=30117081

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2003/002895 Ceased WO2004005547A2 (en) 2002-07-04 2003-07-04 Method for identifying hypersensitive site consensus sequences

Country Status (2)

Country Link
AU (1) AU2003281288A1 (en)
WO (1) WO2004005547A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1692262B1 (en) * 2003-10-27 2018-08-15 Merck Sharp & Dohme Corp. Method of designing sirnas for gene silencing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0116453D0 (en) * 2001-07-05 2001-08-29 Imp College Innovations Ltd Method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1692262B1 (en) * 2003-10-27 2018-08-15 Merck Sharp & Dohme Corp. Method of designing sirnas for gene silencing

Also Published As

Publication number Publication date
WO2004005547A3 (en) 2004-03-25
AU2003281288A8 (en) 2004-01-23
AU2003281288A1 (en) 2004-01-23

Similar Documents

Publication Publication Date Title
Spielmann et al. Structural variation in the 3D genome
US20230304000A1 (en) Compositions and methods of improving specificity in genomic engineering using rna-guided endonucleases
Kim et al. Variation in human chromosome 21 ribosomal RNA genes characterized by TAR cloning and long-read sequencing
Pandey et al. Methods for analysis of circular RNAs
Desmet et al. Human Splicing Finder: an online bioinformatics tool to predict splicing signals
Clouaire et al. Recruitment of MBD1 to target genes requires sequence-specific interaction of the MBD domain with methylated DNA
US20130196863A1 (en) Method of determining chromatin structure
US20160004814A1 (en) Methods and compositions related to regulation of nucleic acids
Wang et al. RNA-DNA differences are generated in human cells within seconds after RNA exits polymerase II
Eirín‐López et al. H2A. Bbd: a quickly evolving hypervariable mammalian histone that destabilizes nucleosomes in an acetylation‐independent way
Tharakan et al. Minireview: novel micropeptide discovery by proteomics and deep sequencing methods
JP2002511252A (en) Human nucleic acid sequence from ovarian tumor tissue
Lin et al. Evolution of alternative splicing in primate brain transcriptomes
JP2008301825A (en) Mammalian genes involved in viral infection and tumor suppression
CN102618549B (en) NCSTN mutant gene, and its identification method and tool
Mathov et al. Harnessing epigenetics to study human evolution
Endrizzi et al. Comparative sequence analysis of the mouse and human Lgn1/SMA interval
Cheng et al. Dynamic chromatin architectures provide insights into the genetics of cattle myogenesis
Deng et al. Cdyl2-60aa encoded by CircCDYL2 accelerates cardiomyocyte death by blocking APAF1 ubiquitination in rats
WO2004005547A2 (en) Method for identifying hypersensitive site consensus sequences
Lafontaine et al. ADAPT: a molecular mechanics approach for studying the structural properties of long DNA sequences
CN104561015B (en) MYL4 gene mutation bodies and its application
Hannon Bozorgmehr Four classic “de novo” genes all have plausible homologs and likely evolved from retro-duplicated or pseudogenic sequences
EP4165182A2 (en) Genetic modification
Cinque et al. A novel complex genomic rearrangement affecting the KCNJ2 regulatory region causes a variant of Cooks syndrome

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP