[go: up one dir, main page]

WO2001043051A9 - Procede et appareil informatiques pour mettre en evidence des motifs promoteurs - Google Patents

Procede et appareil informatiques pour mettre en evidence des motifs promoteurs

Info

Publication number
WO2001043051A9
WO2001043051A9 PCT/US2000/042469 US0042469W WO0143051A9 WO 2001043051 A9 WO2001043051 A9 WO 2001043051A9 US 0042469 W US0042469 W US 0042469W WO 0143051 A9 WO0143051 A9 WO 0143051A9
Authority
WO
WIPO (PCT)
Prior art keywords
repeats
genome
motifs
series
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2000/042469
Other languages
English (en)
Other versions
WO2001043051A3 (fr
WO2001043051A2 (fr
Inventor
Betsey D Dyer
Mark D Leblanc
Glen Aspeslagh
Nathan P Buggia
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOARD OF TRUSTEES OF WHEATON COLLEGE
TRUSTEES OF WHEATON COLLEGE BO
Original Assignee
BOARD OF TRUSTEES OF WHEATON COLLEGE
TRUSTEES OF WHEATON COLLEGE BO
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOARD OF TRUSTEES OF WHEATON COLLEGE, TRUSTEES OF WHEATON COLLEGE BO filed Critical BOARD OF TRUSTEES OF WHEATON COLLEGE
Publication of WO2001043051A2 publication Critical patent/WO2001043051A2/fr
Anticipated expiration legal-status Critical
Publication of WO2001043051A9 publication Critical patent/WO2001043051A9/fr
Publication of WO2001043051A3 publication Critical patent/WO2001043051A3/fr
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • Genomics is the field concerning the analysis of the structure and function of the complete DNA sequence (genome) of any organism or in the case of viruses, the complete DNA or RNA sequence. This includes the parts of the sequence designated as genes (or putative genes) as well as all of the intergenic sequences, some of which regulate gene use and chromosomal structure, but much of which is of yet unknown function.
  • genes or putative genes
  • a cell has an operational center called the nucleus which contains structures called chromosomes.
  • chromosomes are formed of deoxyribonucleic acid (DNA) and associated protein molecules. Structurally, each chromosome may have tens of thousands of genes.
  • DNA molecules encode all the information necessary for creating and maintaining life of the organism. See Human Genome Program, U.S. Department of Energy, "Primer on Molecular Genetics", Washington, D.C., 1992.
  • the shape of a DNA molecule can be thought of as a twisted ladder. That is, the DNA molecule is formed of two parallel side strands of sugar and phosphate molecules connected by orthogonal/cross pieces (rungs) of nitrogen-containing chemicals called bases. Each long side strand is formed of a particular series of units called nucleotides.
  • Each nucleotide comprises one sugar, one phosphate and a nitrogenous base.
  • the order of the bases in this series is called the DNA sequence.
  • Each rung forms a relatively weak bond between respective bases, one on each side strand.
  • base pairs refers to the bases at opposite ends of a rung, with one base being on one side strand of the DNA molecule and the other base being on the second side strand of the DNA molecule. Genome size or sequence length is typically stated in terms of number of base pairs.
  • A adenine
  • T thymine
  • C cytosine
  • G guanine
  • Adenine will pair only with thymine
  • C cytosine
  • G guanine
  • a DNA sequence is represented in writing using A's, C's, T's and G's (respective abbreviations for the bases) in corresponding series or character strings. That is, the ACTG's are written in the order of the nucleotides of the subject DNA molecule.
  • each DNA molecule contains many genes.
  • a gene is a specific sequence of nucleotide bases. These sequences carry the information required for constructing proteins.
  • a protein is a large molecule formed of one or more chains of amino acids in a specific order. Order is determined by base sequence of nucleotides in the gene coding for the protein. Each protein has a unique function.
  • protein-coding sequences genes
  • exons protein-coding sequences
  • introns interspersed within many genes. The balance of DNA sequences in the genome are other non- coding regions or intergenic regions.
  • the DNA sequence specifies the genetic instructions required to create a particular organism with its own unique traits and at the same time provides a text (character string) environment in which to study the same.
  • the completion of the genome sequence of Caenorhabditis elegans marks the beginning of what is likely to be years of database mining of this genome, for the purpose of cataloguing, organizing and interpreting actual or putative regulatory motifs (i.e., interesting gene subsequences) by which this multicellular eukaryote coordinates the development and maintenance of differentiation (e.g., Brown, S.M., Biotechniques 26, 266-268 (1999); Clarke, N.
  • the present invention provides a timely and potentially useful in silico (computer-based) discovery tool for promotor elements.
  • the principle behind the invention is that repeat sequences of all kinds including inverted repeats, direct repeats, mirror repeats and everted repeats are known to be functional motifs in the promotor regions of many genes in a diversity of organisms. Functions of such repeats include, but are not limited to: (1) binding sites for the binding of regulatory factors; (2) Opportunities for internal base-pairing and subsequent regulation in rnRNAs; and (3) Transposon-like mechanisms for the rearrangement and regulation of genes.
  • functional repeats of all kinds are not limited to perfect ones but may include a certain number of mismatches and other irregularities.
  • Applicants have produced a computer-based searchable, annotated catalogue for Caenorhabditis chromosome HI and chromosome X of all possible repeats of sizes ranging from 20-200 base pairs with 0- 10% mismatches and loops of up to one third the length of the repeat.
  • the database may be accessed through queries based on size, location or sequence.
  • Each repeat is identified in respect to location, nearest downstream gene (with links to the Caenorhabditis genome project), and frequency, and includes a list of similar sequences.
  • Applicants have produced a more general, computer-based, searchable lexicon of any possible motif.
  • the lexicon database serves as a "dictionary" for queries of specific motifs of any length.
  • Annotated results are returned from a query and include the locations of each occurrence, the nearest downstream gene, and statistics that include but are not limited to comparing the likelihood of finding "this" particular motif of a certain length in an organism versus the likelihood of finding the motif in a sequence of random pairs.
  • the present invention provides a method for analyzing a known genome, where the genome is represented by a series of characters in a certain sequential order.
  • the method includes the steps of: (i) locating motifs having inverted, everted, direct and/or mirror repeats in the genome; (ii) recording the located motifs and repeats in a data store; and (iii) connecting the database to a computer network and enabling network users to search and browse the motifs from the data store and hypothesize functionality of the same. For each motif recorded in the data store, there are indications of location in the genome series, length of the motif and nearest gene.
  • Computer apparatus of the present invention thus provides a user-interactive computer search tool for analyzing the subject genome and for considering motifs in context of each other across the genome, as well as comparatively to known promotor motifs/regulatory sites of other genomes.
  • the invention apparatus (a) includes a search engine, a browser and drawing member supported by the database, and (b) enables revelation of promotor motifs of the subject genome as a function of motifs rendered from the database.
  • computer apparatus searches the provided genomic information for motifs, stores found motifs and corresponding information about the motifs into a data store, and enables various subsequent display of the repeats for visual analysis. Before storage, the invention computer apparatus verifies each motifs uniqueness. In other words, the computer apparatus verifies that there are no nested motifs.
  • the data store may be located on a separate computer system connected via a computer network. The data store is accessible by end users throughout the computer network such as the World Wide Web.
  • the invention computer apparatus locates individual repeats, motifs or any other specified sequence upon user command.
  • the user selectable search criteria may include the location, the sequence, or the length of the repeat or motif.
  • the invention computer apparatus enables organizing and arranging of the repeat or motif information as a function of user specified terms and provides screen displays for comparison with the existing genes.
  • the screen displays employ coloring or other visual effects (underlining, windowing, highlighting, etc.), such that the invention computer apparatus allows the analysis of repeats or motifs on the basis of other highlighted or known motifs.
  • the present invention generates custom screen view displays annotating motifs which lead to deciphering gene regulation of the subject genome.
  • FIGs. 1 A and IB are block diagrams of a computer system 14 embodying the present invention, including a database builder procedure employed to find desired motifs in a given genome and store repeat and motif data in a database of the present invention.
  • Fig. 2 is a flow diagram illustrating use of the search engine, drawing utility and browser of Fig. IB in the present invention.
  • Fig. 3 is a schematic representation of one embodiment of the invention database of Figs. 1A and IB.
  • Figs. 4A and 4B are graphical illustrations of display screen views supported by the present invention.
  • the present invention provides a computer method and apparatus for analyzing genomes and revealing likely promotor motifs as illustrated in Figs. 1 A, IB and 2.
  • Fig. 1 A Illustrated in Fig. 1 A is a computer system embodying the present invention. Included in the computer system is a digital processor 12 and a set of software programs or other digitally executed means 14 for forming a database 23 of repeat and motif information from an input sequence 10. That is, digital processor 12 executes invention software 14 to perform the steps discussed below in Fig. IB.
  • genomic DNA sequence is downloaded (step 11) from a file such as those at the NCBI web-site into computer memory or stored on disk. The bases are ordered beginning with the 5' end of the DNA sequence to the 3' end. At step 13, an 8 base pair (bp) window is used to analyze one portion of the input sequence at a time.
  • the motif is recorded in database 23 and then the subject window is analyzed (step 15) as to whether it is an inverted repeat, mirror repeat, everted repeat or direct repeat (with or without loops of designated size) subject to allowable tolerance levels. If the windowed sequence portion is one of these types of repeats, it is stored 17 in a database 23 along with other information.
  • the 8bp window is then moved one bp to the right (toward the 3' end of the DNA sequence) and the analysis is repeated 19.
  • steps 13 and 15 are repeated for a larger window of 9 bp and so on, up to a window length of 200 bp (or greater) using loop 21.
  • step 16 the windowed sequence portion determined to be a repeat is checked for uniqueness (step 16). If the repeat is unique, then step 17 stores the repeat in the database 23. In other words, a check is made to ensure that the repeat does not contain nested repeats already in the database 23. If the repeat is not unique (step 20), the previous repeat is removed from the database 23 and the new repeat information is added (step 22) to the database 23.
  • the database 23 stores information for each repeat discovered by invention software routine 14 in what is known as a flat file format. In one embodiment illustrated in Fig.
  • the information may consist of a unique identifier or site name 57, repeat type 66, location with respect to the nearest upstream gene 59, name of the nearest upstream gene 61, location with respect to the 5' end of the DNA sequence 63, the sequence 65 of the repeat, and length 67 of the repeat.
  • the database 23 is a relational database such as a Unix flat file or Microsoft Access database.
  • the database links to or otherwise provides a data store of known genes located within the input DNA sequence 10. This enables the database 23 to provide the downstream position for a given repeat relative to the known gene. Referring back to Fig. IB, the database storage 23 is accessed by a search engine 25, drawing utility 27, and browser or other front end applications 29.
  • the front end application 29 queries the database 23 via the String Query Language (SQL) or the like.
  • the front end application 29 may also provide information such as frequency, pattern recognition of the sequence data, and may filter or sort the data. For example, AT rich repeats can be filtered.
  • An AT rich repeat consists exclusively of only A's and T's in the character representation of the repeat sequence.
  • digital processor 12 is a server or network node (i.e., Internet Web site)
  • the search engine 25, drawing utility 27, and browser/front end 29 are available via the Internet (HTTP protocol) and provide in essence a sharable, searchable, browsable catalogue of repeats for a particular genome.
  • a typical procedure utilizing the invention system of Figs. 1A and IB for promotor discovery might include the following user activities shown in Fig. 2.
  • An end user queries the database/catalogue of repeats 23 and extracts repeat information.
  • the browser 31 is used to note any anomalous abundances or absences of repeats (especially in promotor regions), as well as which genes and other repeats are associated and in which configurations (steps 33, 35).
  • the browser 31 is used to search in the vicinity of genes of interest (step 37) to see what other genes or repeats are associated.
  • “Genes of interest” might include those that appear to be up or downstream and/or regulated in transcription profiling studies as well as those known to be part of co-regulated pathways. Indeed, transcription profiling is an important complementary tool to this invention.
  • search function 39 (such as implemented by a search engine 25 of Fig. IB). This may be used in conjunction with the browser 31 to generate and note frequency lists of repeats 41, 47, to search published or putative motifs (step 43) and to search in the vicinity of particular genes (step 45). Whether the browser 31 or the search function 39 is used, the subsequent steps include the comparison of intergenic regions either intra or interspecifically to find potential new motifs and patterns of repeats 49. In these cases, the browser 31 and especially the search function 39 may be used again to determine frequencies and positions of potential motifs (step 51).
  • the accumulated data on repeats in promotors from the invention system may then be used to interpret the results of transcription profiles (step 53) or to focus mutational studies in the lab (step 55) as well as other lab studies.
  • this invention is an in silico approach to guiding promotor research in the laboratory by making possible the rapid, extensive and methodical exploration of promotor regions.
  • the drawing utility 27 provides visual display and graphical illustration of the repeat and gene data retrieved from data store 23.
  • the drawing utility 27 presents a graphical view of repeats clustering near a certain gene. Such a view is supported by data retrieved from database 23 on repeats located near the gene, i.e., data resulting from a search on the nearest gene field 61 (being set to the subject gene) and from location with respect to that gene (field 59 in Fig. 3).
  • Other information e.g., repeat length, input sequence position/location, etc.
  • each displayed repeat is also provided as a function of user selection or action in the screen view of Fig. 4 A.
  • the drawing utility 27 supports different colored display of different types of repeats and different regions. Common display coloration techniques are employed.
  • Fig. 4B shows inverted repeats displayed in one color (boxed) and 14bp overlap regions in another color (square brackets). Motifs GTGAC and TAGGTCA are highlighted (underlined) which visually reveals/illustrates their numerous count in this region. With such a display, the end user is able to visually see certain patterns and make certain assessments, such as the satellite-like configuration here resembles those reported in other heat shock promoters. Restated, the present invention generates custom screen views annotating motifs which leads to deciphering gene regulation of the subject genome.
  • the present invention enables end users to view data and graphical illustrations of motifs and repeats and analyze motifs including repeats relative to each other and/ or across the given genome.
  • the present invention enables comparisons of motifs in the database 23 to known regulatory sites of other known genomes.
  • the present invention in effect provides annotation to repeats and motifs in the given genome. In turn, this allows the end user to make initial determinations about regulatory sites at the motif sites in the subject genome.
  • the present invention provides a novel apparatus and method for discovering or revealing promotor motifs (sites that regulate the expression of genes) in a genome sequence.
  • repeats may be included. Such other types include mirrored repeats and everted repeats.
  • data store of repeat information may include indications of motifs, regulatory sites or sequences known to have interesting functionality.

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un outil de recherche et une base de données complémentaire informatiques destinés à l'analyse de génomes. La base de données est constituée à partir d'une recherche de séquences répétées inverses et d'autres séquences répétées dans un génome donné, particulièrement de motifs présentant de telles séquences répétées. Les indications sur les séquences répétées, leur longueur et leur localisation dans la séquence génomique et le gène le plus proche, telles qu'elles ont été trouvées, sont enregistrées dans la base de données. L'outil de recherche informatique comprend un moteur et un navigateur de recherche et un élément de dessin. Le moteur de recherche répond aux demandes de l'utilisateur cherchant certaines séquences répétées (par ex., des séquences spécifiques en termes de longueur et/ou de sous-séquence intégrée). Le navigateur de recherche permet de visualiser une représentation graphique (générée par l'élément de dessin) du génome, les séquences répétées étant mises en évidence. L'outil de recherche informatique de la présente invention permet d'analyser les séquences répétées mises en évidence en opposition aux sites de régulation connus/aux motifs promoteurs d'autres séquences connues ou comparativement les unes aux autres à travers le génome.
PCT/US2000/042469 1999-11-30 2000-11-30 Procede et appareil informatiques pour mettre en evidence des motifs promoteurs Ceased WO2001043051A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16805099P 1999-11-30 1999-11-30
US60/168,050 1999-11-30

Publications (3)

Publication Number Publication Date
WO2001043051A2 WO2001043051A2 (fr) 2001-06-14
WO2001043051A9 true WO2001043051A9 (fr) 2002-08-01
WO2001043051A3 WO2001043051A3 (fr) 2002-10-10

Family

ID=22609894

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/042469 Ceased WO2001043051A2 (fr) 1999-11-30 2000-11-30 Procede et appareil informatiques pour mettre en evidence des motifs promoteurs

Country Status (1)

Country Link
WO (1) WO2001043051A2 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491449B (zh) 2012-07-06 2023-08-08 河谷控股Ip有限责任公司 健康护理分析流的管理
CN110993033A (zh) * 2019-11-14 2020-04-10 北京诺禾致源科技股份有限公司 一种基因组数据的处理方法、系统及装置

Also Published As

Publication number Publication date
WO2001043051A3 (fr) 2002-10-10
WO2001043051A2 (fr) 2001-06-14

Similar Documents

Publication Publication Date Title
US8428882B2 (en) Method of processing and/or genome mapping of diTag sequences
van Helden et al. A web site for the computational analysis of yeast regulatory sequences
CN106068330B (zh) 将已知等位基因用于读数映射中的系统和方法
US20050267693A1 (en) Method, system, and apparatus for compactly storing a subject genome
Lundström et al. WebSTR: a population-wide database of short tandem repeat variation in humans
Xia et al. AMADA: analysis of microarray data
US20030220820A1 (en) System and method for the analysis and visualization of genome informatics
Nunez Villavicencio-Diaz et al. Bioinformatics tools for the functional interpretation of quantitative proteomics results
EP1608786A2 (fr) Profilage genomique de sites de liaison de facteurs regulateurs
WO2001043051A9 (fr) Procede et appareil informatiques pour mettre en evidence des motifs promoteurs
Krishnan et al. Integrative approaches for mining transcriptional regulatory programs in Arabidopsis
Thirumalai et al. Organization and dynamics of chromosomes
Segal et al. GeneXPress: a visualization and statistical analysis tool for gene expression and sequence data
Tammi et al. TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences
KR100513266B1 (ko) 클라이언트/서버 기반 est 서열 분석 시스템 및 방법
Trumbly Accessing genomic databases
Schattner Genomes, browsers and databases: data-mining tools for integrated genomic databases
Turner et al. Visualization challenges for a new cyber-pharmaceutical computing paradigm
Xue et al. Bioinformatics technologies in autophagy research
US20220246235A1 (en) System and method for gene editing cassette design
Benard et al. Computation of direct and inverse mutations with the SEGM web server (Stochastic Evolution of Genetic Motifs): an application to splice sites of human genome introns
Trumbly 23 Accessing Genomic
Glusman et al. Visualizing large-scale genomic sequences
Singh et al. Databases, models, and algorithms for functional genomics: a bioinformatics perspective
US20030077643A1 (en) Method for analyzing trait map

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): CA

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: C2

Designated state(s): CA

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

COP Corrected version of pamphlet

Free format text: PAGES 1/6-6/6, DRAWINGS, REPLACED BY NEW PAGES 1/6-6/6; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

AK Designated states

Kind code of ref document: A3

Designated state(s): CA

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

122 Ep: pct application non-entry in european phase