[go: up one dir, main page]

WO2002033590A1 - Systeme d'extraction d'interactions de proteines - Google Patents

Systeme d'extraction d'interactions de proteines Download PDF

Info

Publication number
WO2002033590A1
WO2002033590A1 PCT/SG2001/000217 SG0100217W WO0233590A1 WO 2002033590 A1 WO2002033590 A1 WO 2002033590A1 SG 0100217 W SG0100217 W SG 0100217W WO 0233590 A1 WO0233590 A1 WO 0233590A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
pathways
new
interactions
abstracts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/SG2001/000217
Other languages
English (en)
Other versions
WO2002033590A8 (fr
Inventor
See Kiong Ng
Lim Soon Wong
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Singapore
Original Assignee
National University of Singapore
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Singapore filed Critical National University of Singapore
Priority to AU2002211194A priority Critical patent/AU2002211194A1/en
Priority to EP01979208A priority patent/EP1327208A1/fr
Publication of WO2002033590A1 publication Critical patent/WO2002033590A1/fr
Publication of WO2002033590A8 publication Critical patent/WO2002033590A8/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present invention relates to the field of research, particularly the searching, scanning and / or analysis of a voluminous amount of information available in databases where the latest scientific discoveries are often lodged and first reported online and are accessible by scientists worldwide. More particularly, the present invention relates to the research undertaken and reported related to the biotechnology and pharmaceutical industries.
  • the main problems in existing prior art include their inability to perform automatic extraction, their inability to let a user (biologist) specify the specific type of protein interaction/pathways to extract, their inability to let a user (biologist) modify the visual presentation of pathways, their inability to let a user (biologist) attach queries to the visual presentation of pathways, and / or their inability to let a user (biologist) specify a schedule for monitoring new additions/changes to his pathways.
  • biologicalst specifies a schedule for monitoring new additions/changes to his pathways.
  • an object of the present invention is to provide a research system and method, which addresses prior art problems.
  • the present invention seeks to address at least one of the problems mentioned above.
  • One aspect of the present invention may be referred to as 'PIES', meaning Protein Interaction Extraction System, and is directed to providing a means for automatic discovery and presentation of biological pathway from on-line text abstracts.
  • the present invention is also adapted to (a) perform automatic extraction of protein interaction and other information from online scientific literature, (b) let a user/biologist specify the specific type of protein interaction/pathways to extract, (c) generate pathway maps hyperlinked to supporting research articles on the World Wide Web ; (d) let a user/biologist modify the visual presentation of pathways, (e) let a user/biologist attach queries to the visual presentation of pathways, and / or (f) let a user/biologist specify a schedule for monitoring new additions/changes to his pathways.
  • the PIES combines technologies that (1) retrieve research abstracts from online sources, (2) extract relevant information from the free texts, (3) present the extracted information graphically and intuitively, and optionally (4) allow (possibly customised) queries to be attached and launched from the graphical interface. It can also be set to periodically/routinely scan online scientific literature for automatic discovery of knowledge, giving modern scientists the necessary competitive edge in managing the information explosion in this electronic age.
  • Another aspect of the present invention is directed to a method and apparatus adapted to: a. scanning information related to a selected topic, b. extracting information based on predetermined criteria, c. generating an initial graphical representation of the extracted information, the improvement comprising d. Providing an updated graphical representation by reiterating a, b and c on new information, not yet scanned, and e. alerting a user to the presence of new extracted information based on the new information.
  • the extracted information is obtained using natural language processing as published in SK Ng, M. Wong, "Toward routine automatic pathway discovery from on-line scientific text abstracts", Genome Informatics, 10:104 — 112, December 1999.
  • the extracted information is obtained using a keyword based search.
  • updated representation is represented in a visually distinctive manner compared to the initial graphical representation.
  • the manner in which the visual distinctiveness is rendered is not important, merely that is it visually distinctive. For example, a different colour may be used to show each update.
  • Another aspect of the present invention is directed to a more sophisticated method and apparatus to extend and manipulate the extracted protein interaction pathways by a. allowing the user (biologist) to specify additional keywords and then carrying out the following tasks automatically: search the Internet using these additional keywords for more scientific text abstracts, extract additional protein interaction information from these abstracts, and extend the current pathways with these additional protein interaction information. b. allowing the user (biologist) to specify two sets of extracted pathways and then automatically constructing a set of new pathways by merging these two sets of extracted pathways. c. allowing the user (biologist) to specify one or more distinct nodes (which correspond to proteins) in a set of pathways and then automatically constructing a new set of pathways by merging these two nodes and their associated interactions and other information. d.
  • allowing the user (biologist) to specify a node and then automatically constructing a new set of pathways by replicating the specified node and its associated interactions and information h. allowing the user (biologist) to specify an edge and then automatically constructing a new set of pathways by deleting this edge. i. allowing the user (biologist) to specify an edge and then automatically constructing a new set of pathways by reversing the direction of this edge, j. allowing the user (biologist) to specify an edge and then automatically inverting the nature of this edge; in other words, make an inhibition edge into an activation edge and vice versa. k. allowing the user (biologist) to specify a new edge and then automatically constructing a new set of pathways by adding this new edge.
  • Figure 1 illustrates a keyword type search used for scanning.
  • Figure 2 illustrates an abstract of extracted information.
  • Figure 3 illustrates a graphical representation of the extracted information, in which the edge (from p21 to Cdk2) is highlighted in red to indicate that it is a newly discovered protein interaction.
  • the basic idea of the PIES is outlined in the following steps:
  • the user describes the kind of online text abstracts he is interested in by filling up a form such as that shown in Figure 1.
  • This specification e.g. "universal kinase inhibitor”
  • This specification is recorded in a folder kept for that user.
  • the PIES then retrieves text abstracts satisfying his specification. This retrieval can be repeated automatically by the PIES at regular intervals, to monitor newly published abstracts.
  • Some examples satisfying the specification e.g. "universal kinase inhibitor" are given in Figure 2.
  • the PIES then identifies important sentences from these abstracts, such as: p21 effectively inhibits Cdk2, Cdk3, Cdk4, and Cdk6 kinases (Ki 0.5-15 nM) but is much less effective toward Cdc2/cyclin B (Ki approximately 400 nM) and Cdk5/p35 (Ki > 2 microM). and does not associate with Cdk7/cyclin H. and extracts from them precise protein interactions such as: p21 - - inhibit - -> Cdk2 p21 - - inhibit - -> Cdk3 p21 - - inhibit - -> Cdk4 p21 - - inhibit - ->Cdk6.
  • the PIES then creates a graphical presentation of these protein interactions.
  • the arcs in the presentation can be hyperlinked to Medline articles from which that particular protein interaction information was extracted. The user can then be notified (by email or other means) and access the presentation stored in his folder.
  • FIG. 3 An example presentation is shown in Figure 3. If an old presentation already exists in the user's folder, the newly discovered protein interactions can be highlighted for the user, as shown in red in Figure 3. It is also possible for the user to edit and directly manipulate the graphical presentation. For example, the user points and clicks on two nodes and causes the two nodes and their associated interactions and information to be merged automatically. It is also possible for him to attach some standard or customised queries to the arcs and nodes in the presentation. An example standard queries is "retrieve the amino acid sequence corresponding to this protein.” Implementation details are now described.
  • the PIES is composed of five modules and an underlying logical representation of the extracted pathways. We first provide an overview of the purpose of each of these modules, depicted below. user inputs domain selection keywords ⁇ l /
  • the BioKleisli-Abstracts module is a software module constructed using the Kleisli query system (SY Chung, L Wong, "Kleisli, a new tool for data integration in biology", Trends in Biotechnology, 1 (9):351-355, 1999). It is our query engine for retrieving scientific abstracts from online bibliographic resources on the Internet.
  • the BioNLP module is a software module constructed to process the free texts in scientific abstracts retrieved by the BioKleisli-Abstracts module. It identifies protein names mentioned in the free texts and performs function word pattern matching to discover protein-protein interactions expressed in these abstracts (SK Ng, M Wong, "Toward routine automatic pathway discovery from on-line scientific text abstracts", Genome Informatics, 10:104—
  • the BioKleisli-Graphs module is a software module also constructed using the Kleisli query system. It is our query engine that provides the logical representation of the extracted pathways. It also provides operations for sophisticated manipulations of the logical representation. These operations include: • allowing the user (biologist) to specify additional keywords and then carrying out the following tasks automatically: search the Internet using these additional keywords for more scientific text abstracts, extract additional protein interaction information from these abstracts, and extend the current pathways with these additional protein interaction information. • allowing the user (biologist) to specify two sets of extracted pathways and then automatically constructing a set of new pathways by merging these two sets of extracted pathways.
  • the Graph-Layout module is a software module for computing a visually pleasant layout for displaying the logical representation. In particular, this module assigns preliminary x-y co-ordinates that are to be used in the graphical representation of the extracted pathways.
  • the BioJAKE module is a visualisation engine to graphically display the information in the logical representation and to manage the constructed pathway maps in an intuitive manner to the user (Salamonsen et al., "BioJAKE: A tool for the creation, visualisation, and manipulation of metabolic pathways", Proc. Pacific Symposium on Biocomputing, 392-400, 1999). NOTATIONS
  • the types unit, .... string are the usual base or atomic types.
  • the type (#1. : t. , . . . , #l n : t lake) is a record type having fields 1. l n and these fields have types t 1? .... t n respectively.
  • the type ⁇ t ⁇ is a set whose elements have type t.
  • is a bag whose elements have type t.
  • the type [t] is a list whose elements have type t.
  • a variant type is similar to the concept of tagged union (R. Hull et. al, "The Format model: A theory of database organisation", J. ACM, 31 (3):518-537, 1984) or the union type of the C programming language with explicit user-created tags.
  • the main programming constructs are the followings:
  • E 3 defines a set such that E x y is in it if and only if x is in E ⁇ y is E 2 X , and E 3 x y is true for this particular x and y.
  • the construct E . #1 means the value of the field 1 of
  • E The construct (#1. : E. , . . . , #l n : E n ) builds a record having fields l lf ..., l n having values E. , ..., E n respectively. If f is a function, the construct f (E) means the result of applying f to E.
  • the construct ⁇ #1 : E> builds a variant whose value is E explicitly tagged by the label 1.
  • the construct case E of ⁇ #l ⁇ x L > ⁇ > E. or . . . or ⁇ #l n : E n checks to see if E is a 1., .... or l n variant; if it is a variant ⁇ # ⁇ . : E ' >, then it assigns E • to 4 and return the value of E *
  • Comparison operations such as equality test, logical connectives such as negative, string operations such as substring test, and speciality bioinformatics operations such as sequence alignment, hidden Markov models, and access to MEDLINE, are provided in the Kleisli query system.
  • BioKleisli-Abstracts retrieves relevant abstracts from MEDLINE and organises them for subsequent analysis by the BioNLP module. It can be implemented on top of the Kleisli query system using the following program script. writefile ⁇ x
  • the Kleisli operation ml -get -uid-general is used to obtain unique identifiers of MEDLINE abstracts that match it. Then for each unique identifier u, the Kleisli operation ml - get -abstrac -by-uid is used to obtain the corresponding MEDLINE abstract. Finally, store each abstract x into a file called articles.
  • the file articles has the following schema or type. ⁇ ( #muid : num. #authors : string, #address : string ,
  • Each record has a muid field which stores the unique identifier of the article, an authors field which stores names of authors, an address field which stores addresses of authors, a title field which stores the title of the article, an abstract field which stores the abstract of the article, and a journal field which stores the journal issue in which the article was published.
  • BioNLP MODULE The details of the BioNLP module is given in the paper SK Ng, M. Wong, "Toward routine automatic pathway discovery from on-line scientific text abstracts", Genome Informatics, 10:104-112, December 1999. We provide an outline here.
  • the BioNLP module is a rule-based system that performs simple natural language processing on the extracted scientific abstracts using pattern matching. There are two major tasks in extracting protein-protein interaction information from scientific abstracts:
  • Protein name identification Straightforward use of a dictionary of protein names is inadequate in this domain because new names are continuously being invented and quoted in medical and biological papers. The names of the new proteins must therefore be identified by linguistic means.
  • Information extraction Co-occurrence of protein names in an article abstract, a sentence, or a phrase generally implies that the proteins are related in some way. Such co-occurrence is a useful heuristic for extracting specific protein- protein interaction from free texts. There are two corresponding sets of rules in BioNLP specifying the patterns for identifying protein names and for extracting specific protein-protein interactions from free texts.
  • protein names can still be difficult to identify, as some of the protein names are long compound words or have multiple variants.
  • BioNLP Exclusion by standard dictionaries. BioNLP filters out most of the non-proper nouns in the abstracts by looking up the words in a classical dictionary, (ii) Inclusion with semantic clues. Proper nouns that are not recognised, but are linked together by protein-protein interaction function words (e.g., "activate" or
  • BioNLP has a protein dictionary for the rapid identification of common protein names. This dictionary also allows the re-inclusion of protein names that are made up of the nouns excluded by (i).
  • the dictionary may be manually edited by a user, or automatically learned from the protein-protein interactions subsequently extracted from the abstracts.
  • BioNLP further to recognise names of small molecules and drugs. Then it would be possible to use BioNLP to extract interactions of proteins, small molecules, and drugs from scientific abstracts, as well as pure protein-protein interactions.
  • the names of small molecules can be recognised by incorporating a dictionary of small molecules and lexical rules of the popular SMILES notations used for denoting names of small molecules.
  • the names of drugs can be recognised by incorporating a dictionary of drug names, which can be obtained from various public and/or government drug registries.
  • BioNLP maintains a set of function words for each interaction type. These function words can be edited by the user. Their roles are as keys into the literature for seeking out sentences that may contain protein-protein interaction information. For example, some of the key function words for the inhibit-activate relationship are inhibitor: ⁇ inhibit, suppress, negatively regulate, ... ⁇ activator: ⁇ activate, induce, upregulate, positively regulate, ... ⁇
  • BioNLP seeks out sentences containing any of the function words and then searches for any protein names mentioned. These protein names are then associated with the function words using a suite of pattern matching rules to determine their actor-patient roles. Some examples of the pattern matching rules are shown below. In these examples, both ⁇ A> and ⁇ B> can denote individual or a conjunction of protein names, while ⁇ fn> denotes a matched function word: (i) ⁇ A> . . . ⁇ f n> . . . ⁇ B> : This rule models the basic sentence pattern such as "A inhibits B, C, and D". (ii) ⁇ A> . . . ⁇ fn> of . . .
  • ⁇ B> This rules models sentences such as "A, an activator of B, is found to be lacking in the patient population", (iii) ⁇ A> . . . ⁇ f n> by . . . ⁇ B> : This models sentences in passive voice, such as
  • BioNLP module as a program that processes the articles file retrieved by the BioKleisli-Abstracts module. It produces a file that we denote by the name interactions here.
  • the file interactions has the following schema or type.
  • the file is a set of records.
  • Each record has the following fields.
  • the field muid stores the unique identifier of the article in which a protein-protein interaction is extracted.
  • the field sentence stores the sentence in the article in which the protein-protein interaction is extracted.
  • the matched field stores the particular rule used to recognise that protein-protein interaction.
  • the interaction field stores the protein-protein interaction extracted.
  • the protein-protein interaction is stored either as an inhibit variant or as an activate variant, in either variant, the name of the protein in the actor role and the name of the protein in the patient role are stored.
  • BioNLP is generalised to recognise also small molecules and drugs, it would also produce an additional file called molecules.
  • the file molecules is a set of records. Each record stores the name of a molecule and its type (i.e., whether the molecule is a protein, a small molecule, or a drug). THE LOGICAL REPRESENTATION
  • the files articles and interactions constitute the logical representation of the extracted pathways. It can be thought of as a graph whose nodes are the actors and patients of the interaction extracted, whose directed edges connect up the interacting actors and patients, and these edges are annotated with the type of the interactions and associated information.
  • This conceptual graph is a set of records or edges.
  • Each record or edge has an edge- start field which stores the starting point or node of the edge, an edge- end field which stores the ending point or node of the edge, an edge -type field which stores whether the edge is " inhibit " or is "activate " , and an edge- anno field which stores associated information of that edge.
  • the associated information is the set of evidence from which the edge is derived. Each evidence comprises the sentence that mentions the interaction, the unique identifier (muid) of the article that contains that sentence, the authors and title of that article, and the BioNLP rule used to match that sentence matched).
  • delta graph is just a copy of the conceptual graph, but each record is augmented with two additional fields: a is -new-interaction field which is set to true if and only if the interaction corresponding to that record is an interaction found in delta- interactions, and a has -new- evidence field which is set to true if and only if the edge -anno field of that record contains an muid of an article in delta-articles.
  • the BioKleisli-Graphs module provides sophisticated manipulations on the extracted pathways by operating on the underlying logical representation. We describe each of these manipulations and their embodiment in the BioKleisli-Graphs module. We use the files articles and interactions to denote the logical representation of the current set of extracted pathways. We use the files new- articles and new-interactions to denote the resulting new logical representation of each manipulation. Note that the embodiments given here are chosen to maximise understanding rather than performance, as anyone skilled in the programming art would be able to produced more optimised (but harder to understand) implementations once the purpose of the embodiments is understood.
  • One of the manipulations is to allow the user (biologist) to specify a protein in the extracted pathways and then automatically delete that protein and its associated interactions from the extracted pathways.
  • KILL denote the protein specified by the user. Then this manipulation corresponds to iterating through each interaction stored in the interactions file and keeping only those whose actor and patient is not the protein KILL, as implemented in the Kleisli program script below. writefile ⁇
  • this manipulation is accomplished by taking the union of the files articles and a-articles and the union of the files interactions and a -interactions, eliminating all duplicated records. It is implemented in the Kleisli program script below, where ⁇ + ⁇ is the set union operator provided by the Kleisli query system.
  • One of the manipulations is to allow the user (biologist) to specify several proteins in the current set of pathways and then to automatically extract all interactions up-stream or down-stream of these proteins involving up to a specified number of intermediary proteins. This manipulation is particularly useful in the following two situations:
  • the user wishes to export a specific portion of the current set of pathways that he is working on to another biologists. • the user wishes to concentrate on a specific portion of the current set of pathways that he is working on. This manipulation is useful in this situations because as the set of pathways grows, it is likely that a set of proteins that are logically close together (that is, they interact through a small number of intermediary proteins) may be separated by a large distance in the graphical layout.
  • NODES denote the subset of proteins specified by the user.
  • RADIUS denote the maximum number of intermediary proteins allowed by the user. Then this manipulation corresponds to the process of starting at each node in the conceptual graph that corresponds to a proteins in NODES and traversing the conceptual graph up to RADIUS many edges. In terms of the logical representation of the pathways, this manipulation can be accomplished by following these steps:
  • One of the manipulations is to allow a user to specify two differently-named proteins and to indicate that they are actually the same protein and then to automatically merge them and their associated interactions in the current set of pathways.
  • This manipulation is useful because it is often the case that the same protein is named differently by different biologists.
  • this manipulation implies that edges connected to the two nodes should now be connected to the merged node. Thus, it is equivalent to renaming the first of the two nodes to the second node.
  • FIRST and SECOND denote the two specified proteins. Then, in terms of the logical representation of pathways, this manipulation can be accomplished by iterating through the file interactions and renaming each actor and patient that matches FIRST to SECOND.
  • One of the manipulations is to allow the user to specify a previously obtained set of pathways and then to automatically extract its differences from the current set of pathways.
  • PIES regularly re-executes the search specified by the user.
  • a use for this manipulation is for identifying what is new in the current set of pathways (obtained after a re-execution of the search) relative to an older set of pathways (obtained before the re-execution of the search).
  • the files a-articles and a-interactions denote the logical representation of the specified previously obtained set of pathways. Then this manipulation is implemented in the Kleisli program script below by taking the set difference between the respective logical representations.
  • the delta-articles and delta-interactions can be used to highlight the new interactions found in the current set of pathways.
  • edges that correspond to delta- interactions can be highlighted in red to indicate that they are new discoveries.
  • edges whose annotations reference some articles in delta-articles (and thus the has-new-evidence field of the corresponding record in the delta graph is true) can be highlighted in green to indicate that there are new evidence for these known interactions.
  • One of the manipulations is to allow the user to specify a protein and a new name and then to automatically introduce that new name into the current set of pathways as a new protein and to automatically replicate the interactions and other information of the specified protein for this new protein.
  • this manipulation gives the newly named protein exactly the same interactions and information as the specified protein.
  • two different groups of biologists may use the same name for two different proteins.
  • NODE-A denote this node. Then this manipulation can be used to rectify this problem as follows.
  • NODE-B The user uses this manipulation to replicate NODE-A and its interactions and other information to a new node, which we denote NODE-B here.
  • NODE-A the protein of the first group of biologists
  • NODE- B the protein of the second group of biologists. So he uses the manipulation for deleting individual interaction to delete appropriate interactions from NODE-A and NODE-B.
  • the implementation of this manipulation in terms of the logical representation is an iteration through the file interactions and for each record that mentions NODE-A as the actor or the patient, a new copy of that record is made with NODE-A replaced by NODE- B.
  • NODE-B the "small-scale" manipulations that affect one interaction at a time.
  • One of the manipulations is to allow the user to specify one interaction and to delete it from the current set of pathways. Let KILL denote the interaction to be deleted. Then it can be realised on the logical representation by an iteration through the file interactions and deleting each record whose interaction field matches KILL.
  • INVERT the specified interaction. It can be implemented in terms of the logical representation by an iteration through the interactions file and modify each record whose interaction field matches INVERT by changing that field from an inhibit variant to an activate variant.
  • Kleisli program script is a possible implementation. writefile
  • One of the manipulations is to allow the user to specify a new interaction and to insert it into the current set of pathways.
  • this manipulation is a straightforward addition of a new record into the interactions file.
  • the "graph” corresponding to the logical representation is a conceptual one. It has nodes and edges that connect these nodes. To make it a physical one that is visible in a two dimensional screen, we need to assign to each node a x-y coordinate that indicates its position on the screen, so that directed lines or arcs connecting the nodes can be drawn to represent the edges.
  • the Graph Layout module is a program for computing suitable x-y co-ordinates to assign to the nodes so that when displayed, the graph will be visually pleasant.
  • the Graph Layout module is based on an algorithm for drawing directed graphs previously disclosed in the paper, ER Gansner et al., "A Technique for Drawing Directed Graphs", IEEE Trans. Software Engineering, 19(3):214--230, 1993. We describe the basic idea here. For convenience of description, we layout the graph in a top-to-bottom manner on the screen. The method can be easily adapted for a left-to-right layout.
  • the graph is divided into connected components and each connected component is layout separately. Given a connected component, a depth-first traversal is used to obtain a tree from it. Each node in the tree is assigned a rank. The node at the root of the tree is assigned rank 0, its children rank 1 , and so on.
  • the screen is divided into horizontal strips. Nodes at rank k is assigned to strip k. Each strip is divided into as many vertical regions as there are nodes assigned to this trip. For each strip, the nodes assigned to it are sorted to minimise crossing of connecting arcs. So a node of rank k and is in position j in the sorted ordering among all nodes at that rank is assigned to region j of strip k. The initial co-ordinates for that node is the centre of that region.
  • the BioJAKE module is based on the BioJAKE system previously disclosed in
  • red colour is used for that line, as a means to indicate that this line is a new interaction. If the is-new-interaction field is false and the has -new- evidence field is true, green colour is used for that line, as a means to indicate that this line is an existing interaction for which there is a new evidence. Otherwise, black colour is used for the line.
  • the molecules file is provided, the nodes are displayed using different graphical icons so that nodes denoting proteins, small molecules, and drugs can be differentiated visually. After displaying the graph as above, the user is allowed to freely modify the display by the means of drag-and-drop and point-and-click to re-position the nodes and connecting arrowed lines as desired. He is also allowed to carry out other BioJAKE operations disclosed in Salamonsen et al., "BioJAKE: A tool for the creation, visualisation, and manipulation of metabolic pathways", Proc. Pacific Symposium on Biocomputing, 392-400, 1999.
  • BioJAKE A menu is also added to BioJAKE to allow the user to invoke the additional ("large-scale”) manipulation operations provided by the BioKleisli-Graphs module. Note that the "small-scale” manipulations of the BioKleisli-Graphs module are already available in the original implementation of BioJAKE disclosed in Salamonsen et al., "BioJAKE: A tool for the creation, visualisation, and manipulation of metabolic pathways", Proc. Pacific Symposium on Biocomputing, 392 — 400, 1999.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Physiology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

Ce système d'extraction d'interactions de protéines constitue un moyen de découverte automatique de mécanismes d'action, à partir d'abrégés textes en ligne, et il combine les technologies consistant: (a) à extraire des abrégés à partir de sources en ligne, (b) à extraire des informations pertinentes sur des interactions de protéines, à partir de textes libres, (c) à présenter ces informations extraites de manière graphique et intuitive et, le cas échéant (d) à permettre que des demandes (éventuellement personnalisées) soient attachées à l'interface gratuite et lancées à partir de celle-ci. En outre, ce système peut supporter des manipulations sophistiquées des mécanismes d'action extraits, et il peut également être réglé pour explorer de façon périodique/routinière la littérature scientifique en ligne, de manière à permettre la découverte de nouvelles interactions de protéines, offrant ainsi aux scientifiques de maintenant un avantage sur le plan de la concurrence, dans la gestion de l'explosion d'informations de cette ère électronique.
PCT/SG2001/000217 2000-10-18 2001-10-18 Systeme d'extraction d'interactions de proteines Ceased WO2002033590A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU2002211194A AU2002211194A1 (en) 2000-10-18 2001-10-18 A protein interaction extraction system
EP01979208A EP1327208A1 (fr) 2000-10-18 2001-10-18 Systeme d'extraction d'interactions de proteines

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US69073800A 2000-10-18 2000-10-18
US09/690,738 2000-10-18

Publications (2)

Publication Number Publication Date
WO2002033590A1 true WO2002033590A1 (fr) 2002-04-25
WO2002033590A8 WO2002033590A8 (fr) 2002-07-18

Family

ID=24773741

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2001/000217 Ceased WO2002033590A1 (fr) 2000-10-18 2001-10-18 Systeme d'extraction d'interactions de proteines

Country Status (3)

Country Link
EP (1) EP1327208A1 (fr)
AU (1) AU2002211194A1 (fr)
WO (1) WO2002033590A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1739584A4 (fr) * 2004-03-30 2008-07-23 Shigeo Ihara Système de traitement d'informations de documents

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000049540A1 (fr) * 1999-02-19 2000-08-24 Cellomics, Inc. Procede et systeme de recherche dynamique dans une memoire et analyse de donnees experimentales a relations determinees

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000049540A1 (fr) * 1999-02-19 2000-08-24 Cellomics, Inc. Procede et systeme de recherche dynamique dans une memoire et analyse de donnees experimentales a relations determinees

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
P.D. KARP: "Metabolic databases", TRENDS IN BIOCHEMICAL SCIENCES, vol. 23, no. 2, 1 March 1998 (1998-03-01), pages 114 - 116, XP004111317 *
S.M. PALEY AND P.D. KARP: "Adapting EcoCyc for use on world wide web", GENE, vol. 172, no. 1, 1 June 1996 (1996-06-01), pages GC43 - GC50, XP004042696 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1739584A4 (fr) * 2004-03-30 2008-07-23 Shigeo Ihara Système de traitement d'informations de documents

Also Published As

Publication number Publication date
EP1327208A1 (fr) 2003-07-16
WO2002033590A8 (fr) 2002-07-18
AU2002211194A1 (en) 2002-04-29

Similar Documents

Publication Publication Date Title
Ng et al. Toward routine automatic pathway discovery from on-line scientific text abstracts
Vailaya et al. An architecture for biological information extraction and representation
US5893087A (en) Method and apparatus for improved information storage and retrieval system
US7890533B2 (en) Method and system for information extraction and modeling
US20050120030A1 (en) Visualization of large information networks
James‐Zorn et al. Xenbase: Core features, data acquisition, and data processing
Cuna et al. Improving the effectiveness of subject facets in library catalogs and beyond: a MARC-based semiautomated approach
US20050240583A1 (en) Literature pipeline
Feldman et al. Mining biomedical literature using information extraction
Wildgaard et al. Advancing PubMed? A comparison of third-party PubMed/Medline tools
Valencia Search and retrieve: Large‐scale data generation is becoming increasingly important in biological research. But how good are the tools to make sense of the data?
EP1327208A1 (fr) Systeme d'extraction d'interactions de proteines
Lieberman et al. Visual exploration across biomedical databases
Bogatyrev et al. Application of formal contexts in the analysis of heterogeneous biomedical data
Granitzer et al. Webrat: Supporting agile knowledge retrieval through dynamic, incremental clustering and automatic labelling of web search result sets
Doms GoPubMed: Ontology-based literature search for the life sciences
Miled et al. BACIIS: Biological and chemical information integration system
Samuel et al. Mining online full-text literature for novel protein interaction discovery
Nualart et al. Texty, a visualization tool to aid selection of texts from search outputs
Doms et al. Ontologies and text mining as a basis for a semantic web for the life sciences
Allkin et al. Handling the taxonomic structure of biological data
Qiu et al. An architecture for cell-centric indexing of datasets
Cooper Visualization of relational text information for biomedical knowledge discovery
Noriega-Atala et al. Visualizing Interaction Networks and Evidence in Biomedical Corpora
Coessens et al. Ontology guided data integration for computational prioritization of disease genes

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

AK Designated states

Kind code of ref document: C1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

CFP Corrected version of a pamphlet front page
CR1 Correction of entry in section i
WWE Wipo information: entry into national phase

Ref document number: IN/PCT/2002/00741/DE

Country of ref document: IN

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2001979208

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2001979208

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: JP