WO2003019417A1

WO2003019417A1 - System and method for proteome analysis and data management

Info

Publication number: WO2003019417A1
Application number: PCT/KR2002/001624
Authority: WO
Inventors: Yong-Ho In; Jee-Hyub Kim; Yong-Wook Kim; Hyung-Yong Kim; Jin-Hee Kim; Su-Jin Chae; Jae-Eun Chung; Tae-Jin Eom
Original assignee: BIOINFOMATIX Inc
Current assignee: BIOINFOMATIX Inc
Priority date: 2001-08-29
Filing date: 2002-08-29
Publication date: 2003-03-06
Anticipated expiration: 2004-02-29

Abstract

A system and method for storing, searching for, and analyzing proteome related information obtained through 2-dimensional electrophoresis, which is a common biological research technique, are provided. The system provides a proteme search and analysis function in a client/server environment using an experimental database storing experimental proteome data and the results of identifying unknown experimental proteins and a reference proteome database storing a large amount of validated reference proteome data. The system can search for desired data using keywords, 2-dimensional gel images, protein expression profiles, isoelectric point, molecular weight, peptide mass fingerprinting (PMF) data, protein sequence information, etc., and characterizes an unknown protein based on the searched data.

Description

SYSTEM AND METHOD FOR PROTEOME ANALYSIS AND DATA MANAGEMENT

Technical Field

The present invention relates to a proteome analysis system, and more particularly, to a system and method for storing, searching for, and analyzing proteome-related information collected through 2-dimensional electrophoresis, which is a common research technique applied in the biological field.

Background Art

"Proteome" is a compound word of "protein" and "ome" used as a meaning of integrally indicating all kinds of proteins. In general, single-cellular organisms have a consistent proteome pattern in a cell, whereas multi-cellular organisms have different proteome patterns in individual cells with an identical genome. In other words, although multi-cellular organisms have their original genome, the genome is expressed into different proteome patterns in particular cells or under particular conditions. Identifying the kinds of proteins in cells, degrees of protein expressions, any modification and its site in a cell, and the interaction of proteins is integrally referred to as "proteomics". The identification of all kinds of proteins expressed in cells and of a network of the proteins through proteomic methods leads to people to a better understanding of the life phenomena, originated from genes and expressed by the proteins.

Current proteomic research can be classified into two categories: one involving isolation of a group of proteins and then individual constituent proteins from a cell: and the other involving assaying the isolated proteins, for example, through 2-dimensional polyacrylamide gel electrophoresis (2D-PAGE). 2D-PAGE includes processes of staining proteins of interest and an enzymatic cleavage using proteases, which are performed using an automated assay device and computer. Scientists often use high-performance mass spectrometry (MS) in order to accurately identify individual proteins isolated through 2D-PAGE.

A huge amount of proteome information collected through such an assay process is stored in a medium for analysis by researchers. An effective analysis of proteome data requires a database storing the results of proteomic experiments and an integrated search using both the experimental database and a reference database. However, experimental proteome data, reference databases, and a variety of proteome analysis programs use different data formats. Thus, it was inconvenient for users to store, search for, and analyze data under different environments.

Disclosure of the Invention

The present invention provides a system and method for proteome analysis and data management, in which proteins are isolated and identified, and the results are organized on a project base in a co-research environment through network.

The present invention provides an efficient integrated system and method for proteome analysis and data management, in which an experimental proteome database and a previously established reference database, which are physically separated from one another, are efficiently integrated in a client/server environment for data storage, search, and analysis.

The present invention provides a system and method for proteome analysis and data management, in which desired proteome data can be searched for using keywords, 2-dimensional gel images, protein expression profiles, isoelectric point, molecular weight, peptide mass fingerprinting (PMF), and protein sequence information. Also, proteins can be characterized based on the searched data.

The present invention provides a system and method for proteome analysis and data management, in which proteome data in an experimental database and a reference database can be exchanged and integrated with data from another system.

According to an aspect of the present invention, there is provided a system for proteome analysis and data management, the system comprising a first database, a second database, a proteome identification unit, a data management unit, an interface, a proteome search unit, and a proteome analysis unit. The first database stores a large amount of validated reference proteome data. The second database stores experimental proteome data obtained through experiments. The proteome identification unit identifies experimental proteome data using the reference proteome data stored in the first database. The data management unit controls an input and output of data to and from the first and second databases. The interface receives one of experimental proteome data and a search parameter input from a user. The proteome search unit searches for experimental proteome data throughout the second database corresponding to one of the experimental proteome data and the search parameter from a user, and extracts detailed information on the experimental proteome data identified through searching from the first proteome database. The proteome analysis unit analyzes the searched results from the proteome search unit to characterize the identified experimental proteome data.

According to another aspect of the present invention, there is provided a method for establishing an experimental proteome database, the method comprising: (a) inputting experimental proteome data; (b) searching throughout a first database storing a large number of validated reference proteome data for similar proteome data to the experimental proteome using PMF (peptide mass fingerprinting) data, a ratio of isoelectric point and molecular weight and protein sequence information amount the input experimental proteome data; (c) performing proteome identification based on the searched result; and (d) storing the experimental proteome data and the identified result in a second database.

According to another aspect of the present invention, there is provided a method for storing and editing experimental proteome data, the method comprising: (a) determining whether to create a new project file; (b) if it is determined in step (a) to create a new project file, inputting project management information required to create the new project file; (c) determining whether to retrieve data of a previous project file for the new project file; (d) if it is determined in step (c) to retrieve data from the previous project file, loading the data of the previous project file that are common to the new project; (e) if it is determined in step (a) not to create a new project file, it is determined whether to edit a previous project file stored in the experimental proteome database; and (f) if it is determined in step (e) to edit a previous project file, selecting a project file to be edited among a number of previous project files stored in experimental proteome database and editing the data of the selected project file.

According to another aspect of the present invention, there is provided a proteome analysis method comprising: (a) selecting a search method; (b) if a keyword search option is selected as the search method, inputting a keyword for searching; (c) searching a first database storing experimental proteome data and a second database storing a large amount of validated reference proteome data for proteome data corresponding to the input keyword; (d) if an image search option is selected as the search method, loading a 2-D gel image for searching; (e) designating a spot of protein on the 2-D gel image; (f) searching the first and second databases for proteome data corresponding to the position of the designated spot; (g) if an advanced search option is selected as the search method, searching the first and second databases for similar proteome data by using at least one of peptide mass fingerprinting (PMF) data, a ratio of isoelectric point and molecular weight, and protein sequence information among the proteome data as a search parameter; and (h) displaying the search result obtained in step (c), (f), or (g).

According to another aspect of the present invention, there is provided a similar protein search method comprising: (a) determining whether to use a 2-D gel image obtained through electrophoresis for a similar protein search; (b) if it is determined to use a 2-D gel image for a similar protein search, designating a spot of protein of interest on the 2-D gel image; (c) obtaining the experimental isoelectric point and molecular weight of the protein from the coordinate value of the spot; (d) if it is determined not to use a 2-D gel image for a similar protein search, directly inputting the experimental isoelectric point and molecular weight of a protein of interest, a search range, name of species, and a ratio of isoelectric point and molecular weight; (e) adjusting the ratio of the x-axis and y-axis of the 2-D gel image by the ratio of isoelectric point and molecular weight and calculating the Euclidian distance between the protein of interest and each identified protein stored in a reference proteome database using the experimental isoelectric point and molecular weight obtained in step (c) or (d) and the theoretical isoelectric points and molecular weights extracted from the reference proteome database storing a large amount of validated reference proteome data; and (f) sorting and outputting the searched proteins in order of increasing Euclidian distance.

According to another aspect of the present invention, there is provided a protein expression ratio variation analysis method comprising: (a) defining at least two different experimental conditions and a protein expression variation range; (b) extracting quantitative protein information for the defined experimental conditions from a first database storing experimental proteome data obtained through experiments; (c) calculating protein expression ratio variations of the extracted quantitative protein information for the defined experimental conditions; and (d) screening proteins having a protein expression ratio variation within the defined protein expression variation range.

According to another aspect of the present invention, there is provided a similarity expression pattern search method comprising: (a) selecting a protein of interest for a similar expression pattern search; (b) extracting quantitative protein profile information on a plurality of proteins from a first database storing experimental proteome data obtained through experiments; (c) calculating the Euclidian distance between the protein of interest and each of the proteins stored in the first database using the quantitative protein profile information of the protein of interest and the extracted quantitative protein profile information; and (d) sorting and outputting the proteins stored in the first database in order of increasing Euclidian distance. According to another aspect of the present invention, there is provided a method for hierarchically clustering proteins by similarity in protein expression profile, the method comprising: (a) define an experimental condition for clustering; (b) extracting the quantitative protein information of proteins for the experimental condition from a first database storing experimental proteome data obtained through experiments; (c) calculating the Euclidian distance between all pairs of proteins using the extracted quantitative protein information; (d) hierarchically clustering the proteins using the calculated Euclidian distances, wherein a smaller Euclidian distance indicates a higher similarity in expression profile; and (e) displaying the clustered result. Brief Description of the Drawings

FIG. 1 is a block diagram of a system for proteome analysis and data management according to an embodiment of the present invention; FIG. 2 systematically illustrates the functions of the system for proteome analysis and data management shown in FIG. 1 ;

FIG. 3 illustrates the kinds of information stored and the correlation thereof in a reference proteome database and a experimental proteome database shown in FIG. 1 ; FIG. 4 is a flowchart illustrating a data storage/edition/deletion function performed in a data management unit shown in FIG. 1 ;

FIGS. 5 through 7 show program execution windows for performing a data storage/edition/deletion function on a project basis according to the method illustrated in FIG. 4; FIG. 8 is a flowchart illustrating a method for identifying proteins in a proteome identification unit shown in FIG. 1 ;

FIG. 9 shows an example of an initial search window for proteome search according to an embodiment of the present invention;

FIG. 10 is a flowchart illustrating a proteome search function performed in a proteome search unit shown in FIG. 1 ;

FIG. 1 1 shows a keyword search window opened upon selection of a keyword search option 310 shown in FIG. 9, and FIG. 12 shows a window displaying the results of the keyword search performed according to the procedure illustrated in FIG. 10; FIGS. 13 and 14 show windows displaying the results of an image search performed according to the procedure illustrated in FIG. 10 by selecting an image search option 320 shown in FIG. 9;

FIG. 15 shows a pl/MW similarity search window opened upon selection of a pl/MW similarity search 342 shown in FIG. 9, and FIG. 16 shows a window displaying the results of the pl/MW similarity search performed according to the procedure illustrated in FIG. 10;

FIG. 17 is a flowchart illustrating a pl/MW similarity search method according to an embodiment of the present invention;

FIG. 18 is a diagram illustrating an Euclidean distance calculating method used in a pl/MW similarity search according to the present invention;

FIG. 19 shows a PMF similarity search window opened upon selection of a PMF similarity search shown in FIG. 9, and FIG. 20 shows a window displaying the results of the PMF similarity search; FIG. 21 shows a sequence similarity search window opened upon selection of a sequence similarity search 346 shown in FIG. 9, and FIG. 22 shows a window displaying the results of the sequence similarity search;

FIG. 23 illustrates a protein expression profile analysis function performed in a proteome analysis unit shown in FIG. 1 ;

FIG. 24 is a flowchart illustrating a protein expression ratio variation analysis function shown in FIGS. 2 and 23;

FIG. 25 shows a window for inputting data for the protein expression ratio variation analysis illustrated in FIG. 24 and displaying the results of the protein expression ratio variation analysis;

FIG. 26 shows an example of quantitative protein information used for the protein expression ratio variation analysis as illustrated in FIG. 23;

FIG. 27 is a flowchart illustrating a similar expression pattern search function illustrated in FIGS. 2 and 23;

FIG. 28 is a window displaying the results of the similar expression pattern search performed according to the method illustrated as an embodiment of the present invention in FIG. 27;

FIG. 29 is a flowchart illustrating a hierarchical clustering method using protein expression profiles illustrated in FIG. 2 and 23; and FIG. 30 is a window showing the results of the hierarchical clustering using protein expression profiles according to an embodiment of the present invention.

Best mode for carrying out the Invention

FIG. 1 is a block diagram of a system for proteome analysis and data management according to an embodiment of the present invention. Referring to FIG. 1 , a proteome analysis and data management system according to the present invention includes at least one clients 1001 , 100b, ..., and 100z connected to a network 10 and a proteome analysis and management server 200 providing the clients 100a, 100b, ..., and 10Oz with proteome analysis services. The proteome analysis and data management server 200 includes an interface 210, a proteome identification unit 220, a proteome search unit 230, a proteome analysis unit 240, a reference proteome database (DB) 250, an experimental proteome DB 260, a data management unit 270, a flat file conversion unit 280, and an extensible markup language (XML) formation unit 290. Prior to descriptions on the functions of these constituent elements, a variety of functions provided by the proteome analysis and data management system will be described with reference to FIG. 2.

FIG. 2 systematically illustrates the functions of the system for proteome analysis and data management shown in FIG. 1. Referring to FIG. 2, a proteome analysis and data management function 2000 of the system according to the present invention is roughly classified into a proteome identification function 2200, a proteome search function 2300, a proteome analysis function 2400, and a data management function 2700.

The proteome search function 2300 performed in the proteome search unit 230 is further classified into a keyword search function 3100, an image search function 3200, and an advanced search function 3400. The advanced search function 3400 is further classified into an isoelectric point (pl)/molecular weight (MW) similarity search function 3420, a peptide mass fingerprinting (PMF) similarity search function 3440, and a sequence similarity search function 3460. The pl/MW similarity search function 3420 refers to a function of searching for similar proteins to an unknown protein using its isoelectric point and molecular weight in order to identify the unknown protein, without needs for costly mass spectroscopic equipment and skilled persons capable of operating the same. The proteome analysis function 2400 performed in the proteome analysis unit 240 is further classified into a protein expression profile analysis function 2410, a comparative image analysis function 2420, an advanced search result analysis function 2430, a post-translational modification analysis function 2440, and a protein interaction analysis function 2450. The protein expression profile analysis function 2410 is further classified into a protein expression ratio variation analysis function 2411 , a similar expression pattern search function 2412, and a clustering function 2413.

The data management function 2700 performed in the data management unit 270 is classified into an image analysis result loading function 2710, a data storage/edition/deletion function 2720, an flat file conversion function 2800, and an XML construction function 2900. The data management function 2700 is performed on a project basis, as indicated by reference numerals 2701 , 2702 , and 270n, in FIG. 2. In particular, the image analysis result loading function 2710 is a function of loading a gel image analysis file, for example, performed using an external program, such as Biorad's PDQuest software, or performed in the proteome analysis and data management system itself.

The gel image analysis result is loaded in a predetermined file format, for example, as a Microsoft's Excel file, common gel image data between separate projects is not stored in duplicate, and the one stored common gel image data is referred to for another project. As a result, the amount of used memory space of the database can be conserved.

The functions of the constituent elements in the client/server-based proteome analysis and data management system according to the present invention will be described with reference to FIGS. 1 and 2.

The interface 210 receives one of experimental proteome data and a search parameter from the clients 100a, 100b, ..., and 100z to transmit the received data to the proteome analysis and data management server 200 and receives a proteome analysis result and a proteome search result from the proteome analysis and data management server 200 to transmit the received results to the clients 100a, 100b, ..., and 100z. The data management unit 270 connected to the interface 210 provides the image analysis result loading function 2710 to load a gel image analysis result, the data storage/edition/deletion function 2720 to store, edit, or delete data received from the clients 100a, 100b, ..., and 100z on a project basis, the flat file conversion function 2800, and the XML construction function 2900.

The flat file conversion unit 280 converts the flat file format of the reference proteome DB 250 under the control of the data management unit 270, so that reference proteome data can be input in a format compatible with the format of the experimental proteome DB 260. The XML formation unit 290 formats the proteome data stored in the reference proteome DB 250 and the experimental proteome DB 260 into an XML format under the control of the data management unit 270, so that the proteome data in the two DBs 250 and 260 can be exchanged with and integrated into data stored in another system. As such, the data management unit 270 manages the reference proteome DB 250 and the experimental proteome DB 260.

The reference proteome DB 250 is a protein sequence DB storing a huge amount of validated proteome related data. Examples of databases that can be used for the reference proteome DB 250 include SWISS-PROT database storing protein related information, for example, on protein sequence, function, and structure, domain information, and post-translational modification, and PIR (Protein Information Resource), InterPro, and trEMBL databases.

The experimental proteome DB 260 is a database storing proteome data obtained through experiments. For example, the experimental proteome DB 260 may be a rat liver aging proteome database. The experimental proteome DB 260 is set up as follows.

When a user inputs experimental protein data, the input experimental protein data are transmitted to the proteome identification unit 220 via the data management unit 270. The proteome identification unit 220 identifies an experimental protein of interest based on 2-D gel images, PMF, molecular weight, isotropic point, and other protein related information input from the data management unit 270, through a comparison with the reference proteome data stored in the reference proteome DB 260. When the experimental protein is identified to be an arbitrary reference protein based on the reference proteome data stored in the reference proteome DB 250, the data management unit 270 stores data on the experimental protein in the experimental proteome DB 260 with the attachment of the entry numbers of the corresponding reference data stored in the reference proteome DB 250. Once the experimental proteome DB 260 has set up, the experimental proteome DB 160 and the reference proteome DB 150 are connected to each other. As a result, the proteome search unit 230 can perform a search using the data stored in the experimental proteome DB 260, and the proteome analysis unit 240 can analyze the data. As described later, the experimental proteome DB 260 is constructed and managed on a project basis.

The proteome search unit 230 performs the keyword search function 3100, the image search function 3200, and the advanced search function 3400 using search parameters, such as 2-D gel images, PMF, isoelectric point, molecular weight, and protein sequence information, input from the user, and searches the experimental proteome DB 250 and the reference proteome DB 260 to retrieve the corresponding data therefrom. For example, when an arbitrary search parameter is input by a user, the proteome search unit 230 responses by searching the experimental proteome DB 260 for particular proteome data and extracts detailed information on the corresponding proteome data from the reference proteome DB 250 with reference to the entry numbers of reference proteome data. The proteome analysis unit 240 performs a proteome analysis function 2400, such as the protein expression profile analysis function 2410, the comparative image analysis function 2420, the advanced search result analysis function 2430, the post-translational modification analysis function 2440, and the protein interaction analysis function 2450. The protein expression profile analysis function 2410, a function of analyzing protein expression patterns for different experimental conditions, which are obtained through 2-D electrophoresis, and similar expression patterns, is further classified into the protein expression ratio variation analysis function 2411 , the similar expression pattern search function 2412, and the clustering function 2413. Through the protein expression profile analysis function 2410, protein expression variations through 2-D electrophoresis between different experimental conditions can be analyzed, and similar expression profile patterns can be searched for and then hierarchically clustered. The protein expression profile analysis function 2410 will be described later with reference to FIGS. 23 through 30.

The comparative image analysis function 2420 is a function of comparing at least two 2-D gel images and analyzing the difference between the 2-D gel images. The advanced search result analysis function 2430 is a function of analytically characterizing a protein of interest using the advanced search result. The post-translational modification analysis function 2440 is a function of analyzing the difference between the experimental data of proteins and the theoretical data of reference proteins, which are considered to be similar to the experimental protein, stored in the reference proteome DB 250, for example, a difference between the experimental and theoretical isoelectric points and a difference between the experimental and theoretical molecular weights in order to provide basic information on a post-translational modification. The protein interaction analysis function 2450 is a function of analyzing the interaction between at least two proteins to characterize a protein of interest. Through the above-described various kinds of analysis functions, all kinds of proteins expressed in cells can be integrally identified and characterized.

As described above, the experimental proteome DB 250 is constructed by storing the experimental data and the entry numbers of the corresponding reference data, rather than the entire corresponding reference data, stored in the reference proteome DB 250. Data models of the experimental proteome DB 260 and the reference proteome DB 250 are as follows. FIG. 3 illustrates the kinds of information stored and the correlation thereof in the reference proteome DB 250 and the experimental proteome DB 260 shown in FIG. 1. In FIG. 3, arrows indicate the information tables that are being referred to. Referring to FIG. 3, the reference proteome DB 250 includes a reference database information table (DB_REFERENCE) 2501 , a reference literature information table (REFERENCE) 2502, a protein annotation table (PROTEIN_ANT) 2503, a comment information table (COMMENTS) 2504, a proteome sequence information table (SEQ_VALUE) 2505, and a proteome feature information table (FEATURE) 2506. These kinds of information are categorized from a huge amount of validated reference data and stored in the reference proteome DB 250. Besides, name of protein, name of gene, name of species, individual classification, name of organelle, keyword information, protein sequence number, and other information may be included. The proteome sequence information table 2505 stores the proteome sequence information. The theoretical isoelectric point and molecular weight of all reference proteins are calculated using the proteome sequence information and stored in the protein annotation information table 2530. The theoretical isoelectric points and molecular weights stored in the protein annotation information table 2503 are used in a pl/MW similarity search described later to search for similar proteins to a particular protein having a predetermined isoelectric point and molecular weight.

The experimental proteome DB 260 includes a protein information table (PROTEINJNFO) 2601 , a project information table (PROJECTJNFO) 2602, a project user information table (PROJECTJJSERJNFO) 2603, a user information table (USERJNFO) 2604, a normal gel information table (NORM_GEL_INFO) 2605, a normal gel image information table (NORM_GEL_IMAGE) 2606, a standard gel information table (STD_GEL_INFO) 2607, a standard gel image information table (STD_GEL_IMAGE) 2608, a normal spot information table (NORM_SPOT_INFO) 2609, and a standard spot information table (STD_SPOT_INFO) 2610.

In particular, the protein information table (PROTEINJNFO) 2601 links the experimental proteome DB 260 to the reference proteome DB 250 and is used to manage detailed information on spots in 2-D gel images and identified (reference) proteins. The protein information table (PROTEINJNFO) 2601 stores the entry numbers of identified protein data stored in the reference proteome DB 250, rather than storing the entire corresponding protein related reference data. To this end, the protein information table (PROTEINJNFO) 2601 includes a project identifier, a standard spot identifier, an identified protein annotation identifier, an annotation identifier for identified proteins, and other information. The project information table (PROJECTJNFO) 2602 is used to manage information on a plurality of projects on different research subjects, and particularly, to manage, search, and analyze the experimental data for each research subject based on the project. The project information table (PROJECTJNFO) 2602 includes a project identifier; name of project; project start date; project end date; names of researchers (members) involved; status of project; experimental parameters, such as time, diet therapies, etc.; comments; whether or not to open the project to the public; name of species used; name of genus used; experimental methods; and other information. The project user information table (PROJECT_USER_INFO) 2603 is used to manage information on users who are authorized or are directly involved in a project. The project user information table (PROJECTJJSERJNFO) 2603 includes a project identifier, a user identifier, duties of user, and other information. The user information table (USER JNFO) 2604 is used to manage information on the users involved in a project. The user information table (USERJNFO) 2604 includes a user identifier; a password; name of user; position of user; degree of user's authority ranging, for example, from level 1 to level 5; descriptions on user; and other information. The normal gel information table (NORM_GEL_INFO) 2605 is used to manage detailed information on a normal gel image integrated from a plurality of 2-D gel images obtained through electrophoresis. The normal gel information table (NORM_GEL_INFO) 2605 includes a project identifier; a gel identifier; experimental parameters, such as time, diet therapies, etc.; description on gel used; data unload date; last gel image data process date; a flag indicating whether the image has been processed or not; and other information.

The normal gel image information table (NORMJ3ELJMAGE) 2606 is used to manage the normal gel image. The normal gel image information table (NORM_GELJMAGE) 2606 includes a gel identifier, a project identifier, a gel image file, a gel image file format, such as TIFF, GIF, or JPG, and other information.

The standard gel information table (STD_GEL_INFO) 2607 is used to manage detailed information on individual gel images constituting the normal gel image. The standard gel information table (STD_GEL_INFO) 2607 includes a standard gel identifier; a project identifier; the largest and smallest molecular weights in a gel of interest; the largest and smallest isoelectric points' the slope between isoelectric points; gel formation date; and other information. The standard gel image information table (STDJ3ELJMAGE)

2608 is used to manage individual gel images. The standard gel image information table (STDJ3ELJMAGE) 2608 includes a standard gel identifier, a project identifier, a gel image file, a gel image file format, such as TIFF, GIF, or JPG, and other information. The normal spot information table (NORM_SPOT_INFO) 2609 is used to manage detailed information on a normal spot image of a plurality of spots that is integrated from individual gel images. The normal spot information table (NORM_SPOTJNFOR) 2609 includes a spot identifier; a gel identifier; a project identifier; a standard spot identifier; x- and y-coordinates of spot; spot intensity information; molecular weight and isoelectric point of spot; spotting date; PMF information on spot; and other information.

The standard spot information table (STD_SPOTJNFO) 2610 is used to integrally manage detailed information on individual spots constituting the normal spot image. The standard spot information table (STD_SPOT_INFO) 2610 includes a standard spot identifier; a standard gel identifier; a project identifier; average of the x-coordinates of individual spots; average of the y-coordinates of individual spots; quantitative information on individual spots; the molecular weights and isoelectric electric points of individual spots; and other information.

The experimental proteome DB 260 having the above-described configuration manages the experimental data on a project basis. Therefore, a user can store and edit data to comply with a current project, and only a portion of the database that relates to the current project can be searched during an analysis of particular data. Common gel image data between different projects are not stored in duplicate in the experimental proteome DB 260, and the one stored common gel image is referred to for another project. As a result, the amount of used memory space of the database can be conserved. A method for building up the experimental proteome DB 260 having the configuration as described above will be described below.

FIG. 4 is a flowchart illustrating a data storage/edition/deletion function 2720 (see FIG. 2) performed in the data management unit 270 shown in FIG. 1. In particular, a method for storing data in the experimental proteome DB 260 and a method for editing the previously stored protein data in the experimental proteome DB 260. FIGS. 5 through 7 show program execution windows for performing a data storage/edition/deletion function on a project basis according to the method illustrated in FIG. 4. Referring to FIG. 4, it is determined whether to create a new project file in the experimental proteome DB 260 (step 2721 ). If it is determined in step 2721 to create a new project file, a window for new project is opened, as shown in FIG. 5, and information on the new project, i.e., for the information tables 2602 through 2604 shown in FIG. 3, is input (step 2722). Information required to create a new project file includes a project identifier; name of project; project start date; project end date; names of researchers (members) involved; status of project; experimental parameters, such as time, diet therapies, etc.; comments; whether or not to open the project to the public; name of species used; name of genus used; experimental methods; and other information.

Next, it is determined whether to retrieve the experimental data of a previous project file that are common to the new project (step 2723). If it is determined in step 2723 to retrieve the common experimental data of the previous project file, the common experimental data are loaded onto the new project file (step 2724). For example, when there is a common predetermined image file between two projects, instead of storing the common image file for each of the projects in the experimental proteome DB 260, the image file used in the previous project is loaded for the new project. As a result, the amount of used memory space of the experimental proteome DB 260 can be conserved.

If it is determined in step 2720 not to create a new project file, it is determined whether to edit a previous project file stored in the experimental proteome DB 260 (step 2725). If it is determined in step

2725 to edit a previous project file, a project file to be edited is selected among a plurality of previous project files displayed on a window of project list, as shown in FIG. 6 (step 2726). Once an arbitrary previous project file has been selected, a window informing the selected previous project file is displayed, as shown in FIG. 7, to allow a user to edit desired data, for example, for the information tables 2606 through 2610 of FIG. 3 (step 2727). Such editing of data can be performed regardless of the type of data, including numeric data, symbolic data, and image file data.

The experimental protein data stored in the experimental proteome DB 260 on a project basis according to the method as described above are input to the proteome identification unit 220 via the data management unit 270. The project identification unit 220 identifies an experimental protein of interest on a project basis and stores the identified result in the experimental proteome DB 260, i.e., the protein information table 2601 shown in FIG. 3. These processes of identifying a protein of interest and storing the identified result will be described in detailed below.

FIG. 8 is a flowchart illustrating a method for identifying proteins in the proteome identification unit 220 shown in FIG. 1. Referring to FIG. 8, the proteome identification unit 220 initially receives experimental data from the data management unit 270 (step 2210). When data required to identify the experimental proteins are retrieved through an advanced search (step 2220), the proteome identification unit 220 identifies proteins of interest based on the searched result (step 2230). Next, it is determined whether the identification of the proteins of interest has been completed (step 2240).

Once the experimental proteins of interest have been completely identified, the data management unit 2670 stores the experimental protein data and the identified result in the experimental proteome DB 260, i.e., the protein information table 2601 of FIG. 3 (step 2250).

Through these processes the experimental proteome DB 260 is built up.

The experimental proteome DB 260 built up as described above can be applied for a keyword search, an image search, a protein expression pattern search, and an advanced search, which is a kind of combination search of the forgoing search techniques. The experimental proteome DB 260 can be used for a proteome (reference) data search in connection with the reference proteome DB 250. According to the present invention, a proteome search is performed as follows. FIG. 9 shows an example of an initial search window 300 for proteome analysis according to an embodiment of the present invention. Referring to FIG. 9, the initial search window 300 providing a graphic user interface for data search includes a search menu for a keyword search option 310, an image search option 320, and an advanced search option 340. The advanced search option 340 provides a list of choices for pl/MW similarity search 342, PMF similarity search 344, and sequence similarity search 346.

FIG. 10 is a flowchart illustrating a proteome search function 2300 performed in the proteome search unit 230 shown in FIG. 1. Referring to FIG. 10, a search method is selected among a keyword search option, an image search option, and an advanced search option (step 3010).

If a keyword search option is selected in step 3010, a keyword for searching is received from a user (step 3110). Next, protein information relating to the received keyword is searched for (step 3120), and the searched result is displayed (step 3500).

If an image search option is selected in step 3010, a desired 2-D gel image is selected among a plurality of reference images stored in the reference proteome DB and is loaded for search (step 3210). Next, a spot of protein of interest is designated on the selected 2-D gel image by the user (step 3220). For a more accurate designation of the spot of protein, the present invention provides a zoom-in function of magnifying and displaying a region around the protein spot. This function allows a user to more accurately designate a spot of protein which is to be searched for. Next, information on the designated protein spot is searched for (step 3230), and the searched result is displayed (step 3500).

If an advanced search option is selected in step 3110, the process goes to step 3410 for selecting a detailed advanced search parameter. If a pl/MW similarity search is selected, the pl/MW data of an experimental protein of interest are received from the user (step 3420), proteins having a similar pl/MW to the experimental protein are searched for (step 3422), and the searched result is displayed (step 3500). If a PMF similarity search is selected in step 3410, the PMF data of an experimental protein of interest are received from the user (step 3441 ), proteins having a similar PMF to the experimental protein are searched for (step 3422), and the searched result is displayed (step 3500). If a sequence similarity search is selected in step 3410, the protein sequence information of an experimental protein of interest are received from the user (step 3461 ), proteins having a similar sequence to the experimental protein are searched for (step 3462), and the searched result is displayed (step 3500).

As described above, the proteome analysis and data management system according to the present invention provides detailed data on experimental proteins through keyword and image searches and identifies experimental proteins by analyzing the isoelectric point, molecular weight, PMF, or protein sequence information thereof through an advanced search.

FIG. 11 shows a keyword search window opened upon selection of the keyword search option 310 shown in FIG. 9, and FIG. 12 shows a window displaying the results of the keyword search performed according to the procedure illustrated in FIG. 10. If a user designates the isoelectric point and molecular weight of a protein of interest, and name of the protein, and a search range, the searched results as shown in FIG. 12 are displayed as a tree view. Then, if the user clicks on an arbitrary one of the searched proteins displayed on the screen, proteome information on the selected protein, including a 2-D gel image, the detailed features and sequence of the protein, comments, reference literatures, and species information, is searched for throughout the reference proteome DB 250 and displayed. FIGS. 13 and 14 show windows displaying the results of an image search performed according to the procedure illustrated in FIG. 10 by selecting the image search option 320 shown in FIG. 9. As shown in FIG. 13, when a user designates an arbitrary protein (expressed as a spot) on a displayed gel image, a proteome search is performed on the spot designated by the user, and detailed information on the spot, including its x- and y-coordinates, is displayed, as shown in FIG. 14. Although not illustrated, if the user clicks on an arbitrary one of the searched proteins displayed on the screen, detailed proteome information on the selected protein, including a 2-D gel image, is searched for throughout the reference proteome DB 250 and displayed.

FIGS. 15 through 22 are for illustrating advanced searches performed according to the procedure illustrated in FIG. 10 by selecting the advanced search option 340 shown in FIG. 9. According to the present invention, advanced searches are classified into the pl/MW similarity search 342, the PMF similarity search 344, and the sequence similarity search 346 according to the search parameter input.

In particular, FIG. 15 shows a pl/MW similarity search window opened upon selection of the pl/MW similarity search 342 shown in FIG.

9, and FIG. 16 shows a window displaying the results of the pl/MW similarity search performed according to the procedure illustrated in FIG.

10. Referring to FIG. 15, when a user inputs the measured molecular weight (MW) and isoelectric point (pi) of an unknown experimental protein, proteins having a similar molecular weight and isoelectric point to the unknown protein are displayed in order, as shown in FIG. 16. Although not illustrated, if the user clicks on an arbitrary one of the searched proteins displayed on the screen, detailed proteome information on the selected protein, including a 2-D gel image, is searched for throughout the reference proteome DB 250 and displayed. As described above, the pl/MW similarity search for looking for a protein similar to an unknown experimental protein can be achieved by directly inputting the measured molecular weight and isoelectric point of the unknown experimental protein and, for example, a isoelectric point range, a molecular weight range, a ratio of molecular weight and isoelectric point, and name of species to be searched for. Alternatively, the pl/MW similarity search may be performed using a 2-D gel image.

Therefore, two types of user input interfaces are provided for the pl/MW similarity search according to the present invention. One allows a user to directly input the isoelectric point and its range of reference proteins to be searched for, the molecular weight and its range of reference proteins to be searched for, name of species to be searched for, and a ratio of isoelectric point and the logarithm of molecular weight (pl/log(MW)). The other one allows a user to directly click on a spot on a 2-D gel image that is of interest to be searched for. For example, when a user clicks on an arbitrary spot in the image search window as shown in FIG. 13, the isoelectric point and the molecular weight of the selected spot corresponding to its x- and y-coordinates are obtained. Next, proteins having a similar isoelectric point and molecular weight to the spot are searched for based on the isoelectric point and molecular weight of the spot, and displayed in order, as shown in FIG. 16.

According to the spot designating method, the coordinate values of the designated protein spot are transformed into experimental isoelectric point and molecular weight values. Unlike the type of user interface illustrated in FIG. 15, which allows a user to directly type data, an isoelectric point range, a molecular weight range, name of species, and a ratio of isoelectric point and molecular weight are not input by the user. However, the ratio of isoelectric point and molecular weight can be calculated from the experimental isoeletric point and molecular weight of the protein spot. A pl/MW similarity search according to the present invention can be performed using an arbitrary default value and name of species by designating a particular database established through experiments, for example, a rat liver aging database.. Through such pl/MW similarity searches, the user can identify unknown proteins to a certain extent.

FIG. 17 is a flowchart illustrating a pl/MW similarity search method according to an embodiment of the present invention. FIG. 18 is a diagram illustrating an Euclidean distance calculating method used in the pl/MW similarity search according to the present invention. Referring to FIG. 17, a pl/MW similarity search method according to an embodiment of the present invention involves determining whether to use a 2-D gel image for the pl/MW similarity search (step 3412). If it is determined to use a 2-D gel image for the pl/MW similarity search, a spot of protein to be searched for is designated on the 2-D gel image (step 3422). Based on the position (x- and y-coordinates) of the selected protein spot, the experimental isoelectric point and molecular weight of the protein spot are obtained (step 3423).

Referring to FIG. 18, the x-axis of the 2-D gel image obtained through electrophoresis represents isoelectric point (pi), whereas the y-axis thereof represents the logarithm of molecular weight. The ratio of x-axis and y-axis is controlled by the ratio of isoelectric point and the logarithm of molecular weight. This will be described in detail later.

Referring back to FIG. 17, once the isoelectric point and the molecular weight of the selected spot have been obtained, the proteome search unit 230 extracts the theoretical isoelectric points and molecular weights of identified proteins stored in the protein annotation information table 2503 of the reference proteome DB 250 and calculates the Euclidian distance between the protein spot designated by the user on a plane of the logarithm of molecular weight vs. isoelectric point, and the position of each of the identified proteins, wherein the isoelectric point and the molecular weight of the protein spot are experimental values, whereas those of the identified proteins are theoretical values (step 3424). The calculated Euclidian distances are sorted in order of increasing Euclidian distance (step 3425), and the searched proteins are displayed in order of increasing Euclidian distance, i.e., in order of decreasing similarity as a result of the pl/MW similarity search (step 3426). Sorting the searched results in step 3425 is performed using a sort function provided for a relational database.

If it is determined in step 3421 not to use a 2-D gel image for the pl/MW similarity search, information for the pl/MW similarity search, such as the isoelectric point and molecular weight of a protein of interest, a search range, name of species, a ratio of isoelectric point and molecular weight are directly input by the user (step 3427). Once the information for the pl/MW similarity search has been input by the user, the proteome search unit 230 extracts the theoretical isoelectric points and molecular weights of identified proteins stored in the protein annotation information table 2503 of the reference proteome DB 250 and calculates the Euclidian distance between the protein designated by the user and each of the identified proteins using their isoelectric points and molecular weights. The calculated Euclidian distances are sorted in order of increasing Euclidian distance (step 3425), and the searched proteins are displayed in order of increasing Euclidian distance, i.e., in order of decreasing similarity as a result of the pl/MW similarity search (step 3426). A method for calculating the Euclidian distance applied for the pl/MW similarity search according to the present invention will be described with reference to FIG. 18.

The Euclidian distance means the shortest distance between two points in N-dimensional space. Therefore, the Euclidian distance between two points, (P1 , M1 ) and (P2, M2), on a 2-D gel image can be expressed as equation (1 ) below, and the Euclidian distance between two points, (P1 , M1 ) and (P3, M3), on the 2-D gel image can be expressed as equation (2) below.

distl = (PI - P2)² +{ Ratio xlog( l) -R t oxlog( 2)}² ...(1)

distl = τ](Pl- P3)^z + {Ratio xlog( l) - /?αtϊøxlog( 3) }² ... (2)

In equations (1 ) and (2) above, "Ratio" indicates a ratio of isoelectric point and molecular weight, pl/log(MW), used to adjust the ratio of the x-axis and the y-axis of the 2-D gel image. Alternatively, the reciprocal of the ratio of isoelectric point and molecular weight, i.e., 1/Ratio = log(MW)/pl, may be applied to equations (1 ) and (2), according to the programming scheme used. As is apparent in FIG. 18, point (P2, M2) is closer to point (P1 ,

M1 ) than point (P3, M3) is. Accordingly, the Euclidian distance distl between two points (P1 , M1 ) and (P2, M2) is smaller than the Euclidian distance dist2 between two points (P1 , M1 ) and (P3, M3). Therefore, point (P2, M2) is determined to have a position of higher similarity to that of point (P1 , M3). In the pl/MW similarity search method according to the present invention, the Euclidian distance between the position of a particular protein which a user wishes to search for, on a plane of the logarithm of molecular weight vs. isoelectric point, and the position of each identified protein stored in the protein annotation information table 2504 of the reference proteome DB 250, is calculated using their isoelectric point and molecular weight based on the above principles. The searched proteins are sorted in order of increasing Euclidian distance. Therefore, the proteome analysis and data management system according to the present invention can display proteins similar to a particular unknown protein of interest in order of decreasing similarity, so that the unknown protein can be identified, without needs for costly mass spectroscopy equipment and skilled personnel capable of operating the same. The proteome analysis and data management system according to the present invention can provide a user with basic information on a post-translational modification by comparing the theoretical isoelectric point and theoretical molecular weight of an identified protein with the experimental isoelectric point and molecular weight of a protein which are input by the user. For example, with the assumption that no post-translational modification occurs, the experimental isoelectric point and molecular weight almost match the theoretical isoelectric point and molecular weight, respectively. However, when the experimental isoelectric point and molecular weight are greatly different from the theoretical isoelectric point and molecular weight, respectively, a higher likelihood of post-translational modification is expected. Based on the likelihood of post-translational modification revealed through the pl/MW similarity search according to the present invention, a user can continue to research in depth the problem of post-translational modification that is crucial in the proteome research field.

FIG. 19 shows a PMF similarity search window opened upon selection of the PMF similarity search 344 shown in FIG. 9, and FIG. 20 shows a window displaying the results of the PMF similarity search. Referring to FIG. 19, when the PMF data of a protein which the user wishes to search for are input, proteins having similar PMF characteristics to the input protein are searched for and displayed in order of decreasing similarity. Although not illustrated, if the user clicks on an arbitrary one of the searched proteins displayed on the screen as shown in FIG. 20, detailed proteome information on the selected protein, including a 2-D gel image, is searched for throughout the reference proteome DB 150 and displayed.

FIG. 21 shows a sequence similarity search window opened upon selection of the sequence similarity search 346 shown in FIG. 9, and FIG. 22 shows a window displaying the results of the sequence similarity search. Referring to FIG. 21 , when a user designates a database used to search for similar proteins to a protein of interest having an arbitrary sequence and a sequence similarity search program, proteins having a sequence similar to the input protein are displayed in order, as shown in FIG. 22. Although not illustrated, if the user clicks on an arbitrary one of the searched proteins displayed on the screen as shown in FIG. 22, detailed proteome information on the selected protein, including a 2-D gel image, is searched for throughout the reference proteome DB 250 and displayed.

FIG. 23 illustrates a protein expression profile analysis function 2410 performed in the proteome analysis unit 240 shown in FIG. 1 . Referring to FIG. 23, a protein expression profile analysis function 2410 is further classified into the protein expression ratio variation analysis function 241 1 , the similar expression pattern search function 2412, and the clustering function 2413. The protein expression ratio variation analysis function 241 1 is a function of comparing quantitative information extracted from the experimental proteome DB 260 on different experimental conditions designated by the user to calculate protein expression ratio variations between the experimental conditions and searching for and outputting proteins having a protein expression ratio variation within an expression variation range designated by the user. The similar expression pattern search function 2412 is a function of searching for proteins having a similar protein expression pattern to a protein selected by the user by calculating the Euclidian distance between the protein selected by the user and proteins stored in the experimental proteome DB 260 using their quantitative protein expression profile information. The clustering function 2413 is a function of hierarchically clustering proteins of a particular experimental condition designated by the user by similarity in expression pattern using their quantitative protein profile information extracted from the experimental proteome DB 260.

For a protein expression profile analysis, quantitative protein information 241 is extracted from the experimental proteome DB 260. The extracted quantitative protein information 241 is used to analyze protein expression patterns between different experimental conditions (function 2411 ), to search for proteins having a similar protein expression pattern to a protein of interest (function 2412), or to hierarchically cluster proteins using expression profiles (function 2413).

In connection with the reference proteome DB 250, detailed information on proteins screened as a result 251 of the analyses performed in the proteome analysis unit 240 is searched for throughout the reference proteome DB 250 and displayed. The protein expression profile analysis processes performed in the proteome analysis unit 240 will be described below in detail.

FIG. 24 is a flowchart illustrating a protein expression ratio variation analysis function 2411 shown in FIGS. 2 and 23. FIG. 25 shows a window for inputting data for the protein expression ratio variation analysis illustrated in FIG. 24 and displaying the results of the protein expression ratio variation analysis. FIG. 26 shows an example of quantitative protein information 241 used for the protein expression ratio variation analysis as illustrated in FIG. 23. A method for analyzing protein expression ratio variations for different experimental conditions according to the present invention will be described with reference to FIGS. 24 through 26.

Referring to FIG. 24, experimental conditions for a protein expression variation comparison and a protein expression variation range to be searched for are input (step 24111 ), via the window as illustrated in FIG. 25. The experimental conditions may be input using a drop down menu so that a user can input data easily. The protein expression variation range may be manually input by the user. When the experimental conditions for an expression variation comparison and the protein expression variation range are input via the window, the quantitative protein information 241 on the input experimental conditions is extracted from the experimental proteome DB 260 (step 24112). The quantitative protein information 241 refers to the intensity of spots on a gel image. Although an example of expressing the quantitative protein information 241 is illustrated in FIG. 26, the quantitative protein information 241 may be expressed in various forms.

Next, a quantitative protein information variation ratio between two experimental conditions compared with each other, which hereinafter will be referred to as an experimental quantitative protein information variation ratio, is calculated (step 24113). The experimental quantitative protein information variation ratio is calculated using equation (3) below.

quantitative protein information for experimental condition 1 quantitative protein information for ex perimental condition 2

In other words, the experimental quantitative protein information variation ratio is a ratio of variations in protein expression between different experimental conditions.

After experimental quantitative protein information variation ratios are calculated using equation (3) above, experimental quantitative protein information variation ratios that are within the protein expression variation range defined by the user are extracted and displayed as shown in a lower portion of the window shown in FIG. 25 (step 24114). The results of the protein expression ratio variation analysis displayed in step 24114 as shown in FIG. 25 include protein spot information, such as protein ID No. and name, experimental quantitative protein information variation ratios between different groups, and whether the experimental quantitative protein information variation ratio changes or not. Such analyzed results may be tabled and color-coded so that an increase and a decrease in protein expression ratio are made more distinct. Accordingly, the user can easily distinguish between proteins having a similar tendency of expression ratio increasing or decreasing.

In a protein expression ratio variation analysis method according to the present invention, optionally the analyzed results may be linked to the reference proteome DB 250 (step 24115). Then, it is determined whether to search the reference proteome DB 250 for detailed information on the proteins extracted through the analysis (step 24116). Next, the reference proteome DB 250 is searched for detailed information on the proteins extracted through the analysis (step 24117).

Therefore, the user can analyze the protein expression patterns between different experimental conditions by calculating quantitative protein information variation ratios. In addition, the user can search for detailed information on the proteins screened through the protein expression pattern analysis, if necessary, by clicking on a desired protein of interest in the list of the screened proteins. Detailed information on reference proteins is provided by an owner of the reference proteome DB 250. The reference proteome DB 150 provides users with a huge amount of validated protein related information useful for protein analysis.

FIG. 27 is a flowchart illustrating a similar expression pattern search function 2412 illustrated in FIGS. 2 and 23. FIG. 28 is a window displaying the results of the similar expression pattern search performed according to the method illustrated as an embodiment of the present invention in FIG. 27.

Referring to FIG. 27, a protein of interest is selected (step 24121 ). Next, it is determined whether an expression profile of the protein has been input (step 24122). If it is determined that the expression profile of the protein has been input, quantitative profile information on other proteins is extracted from the experimental proteome DB 260 (step 24123). The Euclidian distance between the position of the protein which is of interest and the position of each of the other experimental proteins is calculated using the quantitative profile information (step 24124). The experimental proteins are sorted (step 24125) and displayed (step 24126) in order of increasing Euclidian distance, i.e., in order of decreasing similarity. The Euclidian distance indicates the shortest distance between two points in N-dimensional space. A smaller Euclidian distance means a higher similarity in expression pattern between two proteins.

In a similar expression pattern search method according to the present invention, optionally the results of the similar expression pattern search may be linked to the reference proteome DB 250 (step 24127). Then, it is determined whether to search the reference proteome DB 250 for detailed information on the proteins determined to have a similar expression pattern through the search (step 24128). If it is determined to search the reference proteome DB 250 for detailed information, the reference proteome DB 250 is searched for detailed information on the similar proteins, and the searched detailed information is output (step 24129). Therefore, the user can search for proteins having a similar protein expression profile using quantitative profile information and can refer to detailed information on the proteins by searching the reference proteome DB 250 if necessary. FIG. 29 is a flowchart illustrating a hierarchical clustering method using protein expression profiles illustrated in FIGS. 2 and 3. Referring to FIG. 29, a hierarchical clustering method using protein expression profiles according to the present invention involves user inputting an experimental condition for hierarchical clustering (step 24131), extracting quantitative protein profile information on the experimental condition (step 24132), calculating the Euclidian distance between all pairs of proteins using the extracted quantitative protein profile information (step 24133), hierarchically clustering the proteins by similarity in expression profile using the calculated Euclidian distances (step 24134), and displaying the clustered result (step 24135).

The clustered result is linked to the reference proteome database 250 (step 24136), and it is determined whether to search the reference proteome database 250 for individual proteins in clusters (step 24137). If it is determined to search the reference proteome database 250 for individual proteins in clusters, the reference proteome database 250 is searched for detailed information on each of the clustered proteins (step 24138).

Therefore, the user can hierarchically cluster proteins by similarity in protein expression profile using the quantitative information stored in the experimental proteome database 260. In addition, the user can refer to detailed information on each of the proteins in clusters by searching the reference proteome database 250 if necessary.

FIG. 30 shows a window displaying the results of the hierarchical clustering using protein expression profiles according to an embodiment of the present invention. "Hierarchical clustering" is a statistical analysis technique for grouping multivariate data by similarity in each parameter, wherein similar objects or variables are clustered in groups according to predetermined rules. Various cluster linkage rules are useful. Single linkage methods determine the distance between the two closest objects. After clustering the two closest objects in a group, one of the two objects and another closest object are clustered in a group. Complete linkage methods determine the greatest distance between any two objects in different clusters in a matrix, wherein each cluster includes the closest objects, and cluster objects in different clusters but which are closest in a group. Average linkage methods determine the average distance between all pairs of objects in two different clusters.

As shown in FIG. 30, the result of the hierarchical clustering is visualized as an image to allow the user to easily understand the correlation of proteins in their expression profile. In other words, the distance between adjacent proteins that is calculated from their expression pattern is expressed as a branch of a tree, as shown in the tree view of FIG. 30. The degree of similarity between two proteins in expression pattern is expressed by the x-axial length of a branch in the tree view. Therefore, the similarity between two proteins in expression pattern can be verified by comparing the lengths of branches on the x-axis. For example, the x-axial length between proteins Nos. 008 and 009 is shortest, the two proteins are considered to have the most similar expression pattern. Such a hierarchical clustering allows a user to perceive the similarity of all the proteins stored in a database in expression pattern at a glance and to easily compare degrees of similarity in protein expression pattern from the x-axial lengths of branches of the tree. Although protein information in a proteome database is changed or added thereto, such a hierarchical clustering can be performed on all the proteins in the modified database in real time. Therefore, the user can acquire a result of clustering in real time that varies according to changes in the information of the database.

As described above, protein expression profile can be analyzed in various aspects through an expression pattern ratio variation analysis for different experimental conditions, a similar expression pattern search, and a hierarchical clustering. In addition, detailed information on the proteins that have been analyzed to have a similar protein expression profile can be searched for throughout the reference proteome database 250.

Although in the above embodiments of the present invention a proteome analysis and data management system capable of integrally searching for and analyzing data using two databases, a reference proteome database and an experimental proteome database, in a client/server environment, wherein the experimental proteme database is built up with reference to the format of the reference proteome database, is described, the present invention can be applied in a local environment or in a web environment.

The invention may be embodied in a general purpose digital computer by running a program from a computer readable medium, including but not limited to storage media such as magnetic storage media (e.g., ROM's, floppy disks, hard disks, etc.), optically readable media (e.g., CD-ROMs, DVDs, etc.) and carrier waves (e.g., transmissions over the Internet). The present invention may be embodied as a computer readable medium having a computer readable program code unit embodied therein for causing a number of computer systems connected via a network to effect distributed processing.

Industrial Applicability

As described above, according to the present invention, an experimental proteome database and an established reference database, which are physically separated from one another, can be efficiently integrated on a project basis for data storage, search, and analysis. The establishment of a database of experimental data and a protein search and analysis can be easily implemented on a project basis in a client/server environment.

Claims

What is claimed is:

1 . A system for proteome analysis and data management, the system comprising: a first database storing a large amount of validated reference proteome data; a second database storing experimental proteome data obtained through experiments; a proteome identification unit identifying experimental proteome data with reference to the reference proteome data stored in the first database; a data management unit controlling an input and output of data to and from the first and second databases; an interface receiving one of experimental proteome data and a search parameter input by a user; a proteome search unit searching for experimental proteome data throughout the second database corresponding to one of the experimental proteome data and the search parameter input by the user and extracting detailed information on the experimental proteome data identified through searching from the first proteome database; and a proteome analysis unit analyzing the searched results output from the proteome search unit to characterize the identified experimental proteome data.

2. The system of claim 1 , wherein the interface receives one of the experimental proteome data and the search parameter from at least one of clients connected to a network to transmit the received one to one of the data management unit, the proteome search unit, and the proteome analysis unit and transmits the searched results from the proteome search unit and the analyzed results from the proteome analysis unit via the network to the client.

3. The system of claim 1 , wherein the data management unit comprises: an image analysis result loading portion loading an image analysis result with respect to a 2-D gel image which is obtained through electrophoresis; a data management unit receiving the experimental proteome data through the interface to store the received experimental proteome data in the second database and controlling editing and deleting the experimental proteome data stored in the second database; a flat file conversion portion converting the flat file format of the reference proteome data stored in the first database into a format compatible with the format of the second database such that the reference proteome data can be input to the second database; and an XLM (extensible markup language) formation portion formatting the proteome data stored in the first and second database into an XLM format to be exchanged and integrated with data stored in another system.

4. The system of claim 3, wherein the data management unit manages the image analysis result and the proteome data on a project basis.

5. The system of claim 3, wherein the image analysis result loading portion loads the image analysis result in a predetermined file format.

6. The system of claim 1 , wherein the first database is a reference proteome database storing detailed information on a number of identified proteins.

7. The system of claim 1 , wherein the second database is an experimental proteome database comprising: a protein information table for managing detailed information on identified proteins; a project information table for managing information on a plurality of projects on different research subjects; a project user information table for managing information on users involved in a project; a normal gel information table for managing detailed information on a normal gel image integrated from a plurality of gel images; a normal gel image information table for managing the normal gel image; a standard gel information table for managing detailed information on individual gel images constituting the normal gel image; a standard gel image information table for managing individual gel images constituting the normal gel image; a normal spot information table for managing detailed information on a normal spot image of a plurality of spots that is integrated from individual gel images; and a standard spot information table for managing detailed information on individual spots constituting the normal spot image.

8. The system of claim 7, wherein the protein information table includes the entry numbers of reference proteome data stored in the first database that correspond to the identified experimental proteome data.

9. The system of claim 1 , wherein the proteome search unit performs: a keyword search to search for proteome data corresponding to a keyword input by the user; an image search to search for proteome data corresponding to a spot on a 2-D gel image that is designated by the user; and an advanced search to search for similar proteome data according to at least one of a ratio of isoelectric point and molecular weight, PMF (peptide mass fingerprinting), and protein sequence information which are input by the user.

10. The system of claim 9, wherein the proteome search unit performs an advanced search by obtaining the theoretical isoelectric points and molecular weights of identified proteins from protein sequence data stored in the first database and comparing the experimental isoelectric point and molecular weight of a protein of interest with the theoretical isoelectric point and molecular weight of each of the identified proteins to search for similar proteins to the experimental protein data.

11. The system of claim 10, wherein the experimental isoelectric point and molecular weight of the experimental protein data are directly input by the user or are acquired from the position of the spot on the 2-D gel image that is designated by the user.

12. The system of claim 11 , wherein the experimental isoelectric point corresponds to the x-coordinate value of the spot, the logarithm of the experimental molecular weight corresponds to the y-coordinate value of the spot, and the ratio of isoelectric point and molecular weight corresponds to a value obtained by dividing the experimental isoelectric point by the logarithm of the experimental molecular weight.

13. The system of claim 12, wherein the proteome search unit adjusts the ratio of the x-axis and the y-axis of the 2-D gel image using the ratio of isoelectric point and molecular weight, calculates the Euclidian distance between the spot and each of the identified proteins using the experimental isoelectric and molecular weight and the theoretical isoelectric points and molecular weighs, and sorts the searched similar reference proteins in order of decreasing Euclidian distance.

14. The system of claim 13, wherein the searched similar reference proteins are sorted using a sort function provided for a relational database.

15. The system of claim 1 , wherein the proteome analysis unit comprises: a protein expression profile analysis portion analyzing expression variations and similar expression patterns between different experimental conditions, according to the proteome data; a comparative image analysis portion comparing at least two 2-D gel images to analyze the difference between the 2-D gel images; an advanced search result analysis portion characterizing the experimental proteome data using the results of an advanced search; a post-translational modification analysis unit analyzing the difference between the experimental proteome data and the reference proteome data of similar proteins stored in the first database to determine whether a post-translational modification has occurred; and a protein interaction analysis unit analyzing the interaction of at least two proteins to characterize the experimental proteome data.

16. The system of claim 15, wherein the protein expression profile analysis portion comprises: a protein expression ratio variation analysis portion extracting quantitative protein profile information for at least two different experimental conditions from the second database, calculating protein expression ratio variations between the different experimental conditions using the extracted quantitative protein profile information, and screening proteins having a protein expression ratio variation within a predetermined protein expression variation range; a similar expression pattern search portion searching for proteins having a similar protein expression pattern to a particular protein by calculating the Euclidian distance between the particular protein and proteins stored in the second database using their quantitative protein profile information; and a clustering portion extracting the quantitative protein profile information for a predetermined experimental condition from the second database and hierarchically clustering proteins by similarity in expression profile using the extracted quantitative protein profile information.

17. The system of claim 10 or 16, wherein the proteome analysis unit compares the experimental isoelectric point and molecular weight with the theoretical isoelectric point and molecular weight, and determines that a post-translational modification has occurred if the difference between the compared values is greater than a predetermined value.

18. The system of claim 16, wherein the quantitative protein profile information corresponds to the intensities of protein spots on a 2-D gel image.

19. The system of claim 16, wherein the protein expression ratio variation between two different experimental conditions is calculated using the following equation:

quantitative protein information for experimental condition 1 quantitative protein information for experimental condition 2

20. The system of claim 16, wherein the similar expression pattern search portion outputs the searched proteins by sorting in order of increasing Euclidian distance.

21. The system of claim 16, wherein the clustering portion calculates the Euclidian distance between all pairs of proteins using the extracted quantitative protein profile information and hierarchically clusters the proteins using the calculated Euclidian distances, wherein a smaller Euclidian distance indicates a higher similarity in expression profile.

22. The system of claim 21 , wherein the clustering portion performs hierarchical clustering using an average linkage method.

23. The system of claim 16, wherein the clustering portion outputs the clustered result as a tree view image in which the degree of similarity in expression profile between two proteins is expressed by the x-axial length of a branch.

24. A method for establishing an experimental proteome database, the method comprising:

(a) inputting experimental proteome data; (b) searching throughout a first database storing a large number of validated reference proteome data for similar proteome data to the experimental proteome using PMF (peptide mass fingerprinting) data, a ratio of isoelectric point and molecular weight and protein sequence information amount the input experimental proteome data;

(c) performing proteome identification based on the searched result; and

(d) storing the experimental proteome data and the identified result in a second database.

25. The method of claim 24, wherein the experimental proteome data are input and managed for each research subject based on a project.

26. The method of claim 24, wherein the first database is a reference proteome database storing detailed information on a number of identified proteins.

27. The method of claim 24, wherein the second database corresponding to the experimental proteome database comprises: a protein information table for managing detailed information on identified proteins; a project information table for managing information on a plurality of projects on different research subjects; a project user information table for managing information on users involved in a project; a normal gel information table for managing detailed information on a normal gel image integrated from a plurality of gel images; a normal gel image information table for managing the normal gel image; a standard gel information table for managing detailed information on individual gel images constituting the normal gel image; a standard gel image information table for managing individual gel images constituting the normal gel image; a normal spot information table for managing detailed information on a normal spot image of a plurality of spots that is integrated from individual gel images; and a standard spot information table for managing detailed information on individual spots constituting the normal spot image.

28. The method of claim 27, wherein the protein information table includes the entry numbers of reference proteome data stored in the first database that correspond to the experimental proteome data.

29. A method for storing and editing experimental proteome data, the method comprising:

(a) determining whether to create a new project file;

(b) if it is determined in step (a) to create a new project file, inputting project management information required to create the new project file;

(c) determining whether to retrieve data of a previous project file for the new project file;

(d) if it is determined in step (c) to retrieve data from the previous project file, loading the data of the previous project file that are common to the new project;

(e) if it is determined in step (a) not to create a new project file, it is determined whether to edit a previous project file stored in the experimental proteome database; and

(f) if it is determined in step (e) to edit a previous project file, selecting a project file to be edited among a number of previous project files stored in experimental proteome database and editing the data of the selected project file.

30. The method of claim 29, wherein if an image file can be commonly used for both the two projects, the common image file is loaded from the previous project file for the new project.

31. The method of claim 29, wherein any kind of data, including numeric data, symbolic data, and image file data, can be edited in step (f).

32. The method of claim 30, wherein the image file is a 2-D gel image obtained through electrophoresis and is loaded in a predetermined file format.

33. A proteome analysis method comprising:

(a) selecting a search method;

(b) if a keyword search option is selected as the search method, inputting a keyword for searching; (c) searching a first database storing experimental proteome data and a second database storing a large amount of validated reference proteome data for proteome data corresponding to the input keyword;

(d) if an image search option is selected as the search method, loading a 2-D gel image for searching; (e) designating a spot of protein on the 2-D gel image;

(f) searching the first and second databases for proteome data corresponding to the position of the designated spot;

(g) if an advanced search option is selected as the search method, searching the first and second databases for similar proteome data by using at least one of peptide mass fingerprinting (PMF) data, a ratio of isoelectric point and molecular weight, and protein sequence information among the proteome data as a search parameter; and

(h) displaying the search result obtained in step (c), (f), or (g).

34. The method of claim 33, further comprising:

(i) analyzing protein expression ratio variations and similar expression patterns between different experimental conditions using the proteome data;

(j) comparing at least two 2-D gel images to analyze the difference between the 2-D gel images;

(k) analyzing the advanced search result obtained in step (g) to characterize a protein of interest;

(I) analyzing the difference between the experimental proteome data and the reference proteome data of similar identified proteins stored in the first database to determine whether a post-translational modification has occurred; and

(m) analyzing the interaction of at least two proteins to characterize the protein of interest.

35. The method of claim 33, wherein the keyword, the 2-D gel image, the PMF data, the ratio of isoelectric point and molecular weight, and the protein sequence information are input via a network from at least one client, and the searched result is provided to the client via the network.

36. A similar protein search method comprising:

(a) determining whether to use a 2-D gel image obtained through electrophoresis for a similar protein search;

(b) if it is determined to use a 2-D gel image for a similar protein search, designating a spot of protein of interest on the 2-D gel image; (c) obtaining the experimental isoelectric point and molecular weight of the protein from the coordinate value of the spot;

(d) if it is determined not to use a 2-D gel image for a similar protein search, directly inputting the experimental isoelectric point and molecular weight of a protein of interest, a search range, name of species, and a ratio of isoelectric point and molecular weight;

(e) adjusting the ratio of the x-axis and y-axis of the 2-D gel image by the ratio of isoelectric point and molecular weight and calculating the Euclidian distance between the protein of interest and each identified protein stored in a reference proteome database using the experimental isoelectric point and molecular weight obtained in step (c) or (d) and the theoretical isoelectric points and molecular weights extracted from the reference proteome database storing a large amount of validated reference proteome data; and (f) sorting and outputting the searched proteins in order of increasing Euclidian distance.

37. The method of claim 36, wherein the experimental isoelectric point corresponds to the x-coordinate value of the spot, the logarithm of the experimental molecular weight corresponds to the y-coordinate value of the spot, and the ratio of isoelectric point and molecular weight corresponds to a value obtained by dividing the experimental isoelectric point by the logarithm of the experimental molecular weight.

38. The method of claim 36, system of claim 13, wherein in step (f), the searched proteins are sorted using a sort function provided for a relational database.

39. The method of claim 36, further comprising (g) comparing the experimental isoelectric point and molecular weight of the identified protein with the theoretical isoelectric point and molecular weight thereof and determining that a post-translational modification has occurred if the difference between the compared values is greater than a predetermined value.

40. A protein expression ratio variation analysis method comprising:

(a) defining at least two different experimental conditions and a protein expression variation range;

(b) extracting quantitative protein information for the defined experimental conditions from a first database storing experimental proteome data obtained through experiments;

(c) calculating protein expression ratio variations of the extracted quantitative protein information for the defined experimental conditions; and

(d) screening proteins having a protein expression ratio variation within the defined protein expression variation range.

41. The method of claim 40, wherein the protein expression ratio variation between two different experimental conditions is calculated using the following equation:

42. The method of claim 40, further comprising:

(e) linking the results of step (d) to a second database storing a large number of validated reference proteome data; and

(f) searching throughout the second database for detailed information on each of the proteins screened in step (d).

43. A similarity expression pattern search method comprising:

(a) selecting a protein of interest for a similar expression pattern search;

(b) extracting quantitative protein profile information on a plurality of proteins from a first database storing experimental proteome data obtained through experiments;

(c) calculating the Euclidian distance between the protein of interest and each of the proteins stored in the first database using the quantitative protein profile information of the protein of interest and the extracted quantitative protein profile information; and

(d) sorting and outputting the proteins stored in the first database in order of increasing Euclidian distance.

44. The method of claim 43, further comprising:

(f) searching throughout the second database for detailed information on each of the proteins output in step (d).

45. A method for hierarchically clustering proteins by similarity in protein expression profile, the method comprising:

(a) define an experimental condition for clustering; (b) extracting the quantitative protein information of proteins for the experimental condition from a first database storing experimental proteome data obtained through experiments;

(c) calculating the Euclidian distance between all pairs of proteins using the extracted quantitative protein information; (d) hierarchically clustering the proteins using the calculated Euclidian distances, wherein a smaller Euclidian distance indicates a higher similarity in expression profile; and

(e) displaying the clustered result.

46. The method of claim 45, wherein step (d) is performed using an average linkage method.

47. The method of claim 45, wherein the clustered result is output as a tree view image in which the degree of similarity in expression profile between two proteins is expressed by the x-axial length of a branch.

48. The method of claim 45, further comprising:

(f) linking the clustered result to a second database storing a large number of validated reference proteome data; and

(g) searching throughout the second database for detailed information on each of the clustered proteins.

49. A computer readable medium having embodied thereon a computer program for the method according to any one of claims 24 through 48.