SYSTEM FOR MOLECULE IDENTIFICATION
Field of the Invention
The present invention relates to a method and tools for the identification of unknown molecules, and, particularly, a method and tools for molecule identification that provide a solution to the problem of random mass matching.
Background of the Invention
Identification of a molecule or several molecules in a sample is a technical problem in various fields of research and technology. Molecule identification problems can concern e.g. the tracing of unwanted substances in the environment and the studies of metabolic pathways and disease-state markers in drug development projects. Molecule identification problems can sometimes be solved by the appropriate application of instruments and methods for the acquisition and processing of data from a sample containing the molecules to be identified. One example of data from a sample is mass data. Molecular or molecular constituent mass data can be obtained by a variety of techniques including techniques such as ultra-centrifugation, electrophoresis, and mass spectrometry. Experimental mass data from the sample analyzed is often compared with database-information about known or hypothetical molecules.
In particular, mass spectrometry (MS) combined with database searching has proven to be a useful approach for molecule identification. For example, MS of protein-digests combined with searching in protein and DNA sequence databases is a method of choice for the identification of proteins in proteomics projects. The field of proteomics, which include the elucidation of protein function under various cell conditions, is believed to form a future basis for drug design. MS-protein identification involves cleavage of proteins with an enzyme having high digestion specificity (usually trypsin), whereupon the resulting proteolytic products are subjected to mass analysis by either matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) or electrospray ionization mass spectrometry (ESI-MS). The experimentally determined masses are then compared with masses of peptides that individual proteins in a database would yield if they were cleaved by the same enzyme as was used in the experiment. In some experiments, individual proteolytic peptide ions are isolated and subjected to fragmentation and fragment mass analysis in the mass spectrometer. The resulting fragment masses are then compared with hypothetical proteolytic peptide fragment masses of the proteins in a database. The protein is identified based on an evaluation of either or both of these comparisons.
Mass spectrometry determines a peptide mass mi to an accuracy ±Δmi, with Amilπii typically >30 ppm. Within the mass range mt Δmi proteolytic peptide masses of several proteins in a genome database can match. Hence, an unmodified peptide will match randomly with several proteins in the database, in addition to the true match with the actual protein present in the sample, and a modified peptide will yield only random matches. Consequently, a database search using mass spectrometry information will not always identify a protein unambiguously. Therefore, in order to perform accurate and reliable molecule identification, instruments for obtaining mass data must be appropriately linked with the use of other technical resources for the comparison of mass data and mass information obtained from a database. The link can be a system that makes use of a method including means for comparison of data and database information, preferably operated via a computer.
Despite the rapidly increasing impact of mass spectrometric protein identification on proteomic research, the problem of accurately taking the phenomenon of random mass matching into account in a database search system has been overlooked. As increasingly complex processes are explored by MS- based protein identification, the use of optimized procedures will become critical. An optimized protein identification system cannot be designed without or with inappropriate account for the random mass-matching process.
State of the Art
Identification of proteins by the above-described approach requires a scheme for determining the best match between the experimental data and a sequence in the database. Existing schemes for determining the best match include ranking by number of matches (W.J. Henzel et al., Proc. Natl. Acad. Sci. U S A 90, 5011, 1993), a scoring system based on the observed frequency of peptides from all proteins in a database in a given molecular weight range (the so-called "MOWSE score" (D.J.C. Pappin et al., Current Biology 6, 327,1993), and a scheme based on Bayesian probabilities (W. Zhang et al., Anal. Chem. 72, 2482, 2000).
None of these schemes takes the problem of random mass matching appropriately into account. The lack of an appropriate account for the random mass matching hinders optimum performance of molecule identification procedures, since the random mass matching can cause false identification results - especially when the quality of the mass spectrometry data is poor.
Summary of the Invention
The object of the present invention is to overcome the shortcomings of the above-mentioned schemes, i.e., to provide a method that solves the problem of random mass matching. This and other objects have been met by providing a system including methods of determining the probability for a particular score due to random mass matching of a molecule, and to utilize the computed probability to rank molecules. The method comprises: a) determining the number of matches between a database molecule and mass data; b) computing the probability that a database molecule would yield a particular number of matches by chance; c) computing a score based on one or several probabilities computed in step (b); c) comparing the scores of molecules in a molecule database; and d) identifying the molecule or molecules that yield(s) the best score (s).
The invention further provides a method of generating a frequency function of the number of matches for random (false) molecule identification for any experimental condition. The method comprises: a) defining a sub -population of the molecules contained in a database; b) computing the probability that a molecule in this sub-population would yield a particular number of matches by chance; c) computing a probability that all molecules in the sub-population would yield at most a particular number of matches by chance; d) computing the probability that at least one molecule in the sub-population would yield at least a particular number of matches by chance; and e) determining the relative frequency of each number of matches by using the probability computed in step (d) for each number of matches and generating therefrom a frequency function of the number of matches for random protein identification.
Brief Description of the Drawings
Fig. 1 shows frequencies (i.e., number of matching proteins) of various tryptic peptide masses in a database. Fig. 2 shows mass distribution peaks for tryptic peptides.
Fig. 3 shows the performance of an implementation of one embodiment of the invention in comparison with state of the art systems for protein identification. The graph displays results from simulations employing the invention (denoted Probity), a Bayesian method, and a method based on the number of matches.
Fig. 4 shows score frequency functions generated by the invention in comparison with score frequency functions generated by simulation.
Detailed Description of the Invention Many applications of molecule identification are inherently large-scale.
Examples of large-scale molecule identification can be found in proteomics projects, where thousands of proteins from cells are to be identified, or cells are screened for molecular markers of states of disease. The ultimate goal of molecule identification procedures is to rely on simple, rapid and automated procedures and instrumentation. The technical solutions of the system that links and compares mass data with database information are of key importance to the design of instruments for automated molecule identification, since the system used will influence strongly the capability of obtaining a high relative frequency of true identification results, which is particularly critical when the quality of the data is poor. Furthermore, automated identification instrumentation demand that the quality of identification results is assessed automatically by the use of a significance test (J. Eriksson et al., Anal. Che . 72, 999, 2000). However, a reliable automated protein identification system cannot be designed without or with inappropriate account for the random mass-matching process. One object of the present invention is to provide a system that utilizes methods that allow more accurate molecule identification and more accurate and rapid significance testing of identification results. The method according to the invention appropriately takes into account the phenomenon of random matching, and is therefore well suited for implementation in an automated molecule identification system.
A particular concern regarding large-scale molecular identification is the time required to obtain the identification result together with a quality assessment of this result. A quality assessment can be accomplished by significance test, which requires knowledge of functions describing scores for false results. Such frequency functions are currently obtained by simulation of random molecular identification. However, since the time needed to derive a frequency function by simulation is about 1000 times longer than with the use of the invention, there is need to derive such a frequency function from an analytical expression. In one embodiment of the invention, such an analytical expression for the derivation of a frequency function is provided.
The methods according to the invention are well suited for, but not limited to, applications, in which the molecules are biological molecules that can exist in cells of organisms.
Biological molecules include any biological polymer that can be degraded into constituent parts. The degradation is preferably into constituent parts at predictable positions to form predictable masses. Examples of biological molecules include proteins, nucleic acid molecules, polysaccharides and carbohydrates.
An experimental biological molecule is a biological molecule that is to be identified; the experimental biological molecule can also be referred to as an unknown biological molecule. A theoretical biological molecule is a biological molecule is a known biological molecule described in a database.
Proteins are polymers of amino acids. Constituent parts of proteins comprise amino acids. A protein typically contains approximately at least ten amino acids, preferably at least 50 amino acids and more preferably at least 100 amino acids.
Nucleic acids are polymers of nucleotides. Constituent parts of nucleic acids comprise nucleotides. Typically, a nucleic acid contains at least 100 nucleotides, preferably at least 500 nucleotides.
Polysaccharides are polymers of monosaccharides. Constituent parts of polysaccharides comprise one or more monosaccharides. Typically, a polysaccharide contains at least five monosaccharides, preferably at least ten monosaccharides.
Mass data of biological molecules are quantifiable information about the masses of the constituent parts of the biological molecule. Mass data include individual mass spectra and groups of mass spectra. The mass spectra can be in the form of peptide maps, oglionucleotide maps or oligosaccharide maps.
The method of the present invention includes generating experimental mass data for the experimental molecule within a certain mass range. Mass data include the measured masses. The method also includes generating theoretical mass data in the same mass range. In one embodiment, the experimental mass data is a subset of the experimental mass data.
For example, mass data for molecules can be generated in any manner that provides mass data within certain accuracy. Examples include matrix-assisted laser desorption/ionization mass spectrometry, electrospray ionization mass spectrometry, chromatography and electrophoresis. Mass data can also be generated by a general -purpose computer configured by software or otherwise.
For the purposes of the present invention the mass data, for example a peptide mass, mi, is determined to an accuracy ±Δmi, with Δmi/mi preferably <10,000 ppm, more preferably <100 ppm, and most preferably <30 ppm.
A step in generating mass data of a molecule may include first cleaving the molecule into constituent parts. Biological molecules may be cleaved by methods known in the art. Preferably, the biological molecules are cleaved into constituent parts at predictable positions to form predictable masses. Methods of cleaving include chemical degradation of the biological molecules. Biological molecules may be degraded by contacting the biological molecule with any chemical substance.
For example, proteins may be predictably degraded into peptides by means of cyanogen bromide and enzymes, such as trypsin, endoproteinase Asp-N, V8 protease, endoproteinase Arg-C, etc. Nucleic acids may be predictably degraded into constituent parts by means of restriction endonucleases, such as Eco RI, Sma I, BamH I, Hinc II, etc. Polysaccharides may be degraded into constituent parts by means of enzymes, such as maltase, amylase, alpha-mannosidase, etc.
In the present invention a mass range (mmin, mmaχ) is determined for the experimental mass data. The mass range can be any mass range of the mass data. In one embodiment, the mass range is the minimum and maximum measured masses of the experimental mass data for a molecule.
A molecule database is any compilation of information about characteristics of molecules. A molecule database can be a biological molecule database. Databases are the preferred method for storing both polypeptide amino acid sequences and the nucleic acid sequences that code for these polypeptides. The databases come in a variety of different types that have advantages and disadvantages when viewed as the hypothesis for a polypeptide identification experiment.
While the "database entry" for an amino acid sequence may appear to be a simple text file for a user browsing for a particular polypeptide, many databases are organized into very flexible, complicated structures. The detailed implementation of the database on a particular system may be based on a collection of simple text files (a "flat-file" database), a collection of tables (a "relational" database), or it may be organized around concepts that stem from the idea of a protein, gene, or organism (an "object-oriented" database). Protein mass data may be predicted from nucleic acid sequence databases.
Alternatively, protein mass data may be obtained directly from protein sequence
databases that contain a collection of amino acid sequences represented by a string of single-letter or three-letter codes for the residues in a polypeptide, starting at the N-terminus of the sequence. These codes may contain nonstandard characters to indicate ambiguity at a particular site (such as "B" indicating that the residue may be "D" (aspartic acid) or "N" (asparagine)). The sequences typically have a unique number-letter combination associated with them that is used internally by the database to identify the sequence, usually referred to as the accession number for the sequence.
Databases may contain a combination of amino acid sequences, comments, literature references, and notes on known posttranslational modifications to the sequence. A database that contains these elements is referred as "annotated". Annotated databases are used if some functional or structural information is known about the mature protein, as opposed to a sequence that is known only from the translation of a stretch of nucleic acid sequence. Non- annotated databases only contain the sequence, an accession number, and a descriptive title.
The background information known about an experimental molecule by which the data base search can be constrained can include any information. Some examples of background information include information about the species of an experimental biological molecule, knowledge or an assumption about the mass of the experimental biological molecule and the isoelectric point of the experimental biological molecule.
For example, the observed molecular mass or the observed isoelectric point of a protein can be used in combination with the measured masses of peptides generated by proteolysis to constrain the search for a polypeptide. In particular, the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen mass range. The chosen mass range is preferably within 50% of the mass of the unknown protein, more preferably within 35%, most preferably within 25%. Similarly, the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen isoelectric point range. The isoelectric point (pi) of a protein is the pH at which its net charge is zero. The chosen isoelectric point range is preferably within 50% of the isoelectric point of the unknown protein, more
preferably within 35%, most preferably within 25%.
Optionally, further information of the experimental biological molecule, such as a protein's sequence, is obtained by generating fragment mass data of the experimental and theoretical biological molecules. Fragment mass data for a peptide can be generated in any manner which provides fragment mass data within a certain accuracy. Experimental conditions include the type of energy used to generate the fragment mass data. Nibrational excitation energy can be used. The vibrational excitation may be generated by collisions of the peptide with electrons, photons, gas molecules or a surface. Electronic excitation can be used. The electronic excitation may be generated by collisions of the peptide with electrons, photons, gas molecules (e.g. argon) or a surface.
In another example, the experimental fragment mass spectrum of a peptide from an enzymatically digested unknown protein is compared with the theoretical masses calculated by applying the rules for the specificity of the enzyme, and the rules for the fragmentation as known to those of ordinary skill in the art, to the amino acid sequence of a database protein.
Fragment mass data for the purposes of this invention can be generated by using multidimensional mass spectrometry (MS/MS), also known as tandem mass spectrometry. A number of types of mass spectrometers can be used including a triple-quadruple mass spectrometer, a Fourier-transform cyclotron resonance mass spectrometer, a tandem time-of-flight mass spectrometer, and a quadruple ion trap mass spectrometer. A single peptide from a protein digest is subjected to MS/MS measurement and the observed pattern of fragment ions is compared to the patterns of fragment ions predicted from database sequences. In one embodiment, the invention provides a method to determine the probabilities for the scores that a particular molecule in a database can yield by chance when compared with mass data. The method can operate under a variety of experimental and database search constraints. The score can be the number of matches between masses derived from known or hypothetical molecules or molecular constituents in a database and masses in mass data from one or several known or unknown molecules, or molecular constituents. The score can also result from a computation that utilizes the number of matches.
In one embodiment, the invention provides a method to extract information about the molecules in a database. Examples of information that can be extracted from a database are total molecular mass, charge, isoelectric point, hydrophobicity and known or hypothetical chemical modification, and mass,
charge, isoelectric point, hydrophobicity and known or hypothetical chemical modification of molecular constituents.
In one embodiment, the invention provides a method to perform actions on molecules in the database that are supposed to mimic actions occurring in an experiment. Examples of actions are degradation of molecules into molecular constituents by hydrolysis, where hydrolysis can result from the activity of chemicals or enzymes. The method can also perform actions that mimic experimental actions on molecular constituents. For example, the fragmentation of an excited molecular constituent into smaller pieces. In one embodiment, the invention provides a method to derive a number of molecular pieces, ku, resulting from an action assumed to mimic an experimental situation. The pieces can be molecular constituents, such as proteolytic peptides resulting from enzymatic digestion of a protein, where different assumptions can be made concerning the degree of completeness of the enzymatic digestion. The pieces can be molecular constituents in the form of fragments of molecular constituents, e.g. fragments of proteolytic peptides.
In one embodiment, the invention provides a method to organize the masses of molecules or molecular constituents or fragments thereof. Examples of such organization are given in Fig. 1 and 2, where Fig. 1 displays the number of proteins in a database that match a given proteolytic peptide mass and Fig 2 displays the clustered distribution of proteolytic peptide masses. Masses clustering in this or similar fashions will be referred to as a mass distribution peak. Mass distribution peaks can be found for all molecules that contain a limited number of different atoms (e.g. C, H, N, O, S). In one embodiment, the invention provides a method for defining mass regions wherein the frequency of various masses can be determined. The method defines fi as the fraction of masses of molecular constituents or fragments that falls into a mass region i.
In one embodiment, the invention provides a method that determines a probability pt that a particular molecule in a database will be found in a randomly chosen mass distribution peak in the mass region i: P
t = F(k
u,m
i,c) , where P is a function, mt is a mass region, and c denotes experimental and database search constraints. In one embodiment pi is given by:
which describes the probability that a molecular constituent from a particular molecule characterized by feu. will be found in a single randomly chosen mass distribution peak. The denominator of the expression above describing pi represents the number of mass distributions peaks within the mass region i.
In one embodiment the invention provides a method of determining the probability, pi', of finding a molecular constituent originating from a particular molecule characterized by ku within a region ±Am around a randomly chosen molecular constituent mass m: p. = pi • δ(mt, Am) , where δ(πii, Am) denotes a function that depends on the shape of the mass distribution peak and mi refers to a mass region. δ(mi, Am) can be interpreted as a statistical measure of the number of molecular constituent masses that can be found within ±Am from a randomly chosen molecular constituent mass. The mass accuracy Am can be different for different mass regions, i.e., in that case denoted by Ami.
In one embodiment, the invention provides a method to determine δ(mi, Am) by simulation of the relative frequency of masses around a randomly chosen mass in a mass distribution. In one embodiment, δ(mi, Am) is determined by integration of a function describing molecular constituent mass distributions and normalization to the total number of molecular constituent masses in a mass distribution peak. In one embodiment, δ(mi, Am) is determined by direct counting followed by normalization.
In one embodiment of the invention, a finite number of mass regions between mmin and mmax is employed, each having an individually defined pi'.
In one embodiment the probabilities pi' are employed to compute a total probability, p(k), for an individual molecule in the database to match randomly k out of n masses, where the n masses refers to the number of masses in the mass data. p(k) = G(pi',k,n,c') , where G is a function and c' denotes experimental and database search constraints.
In one embodiment of the invention p(k) is given by:
where q denotes the number of mass regions, ni denotes the number of masses in the mass data that are in mass region 1, n2 denotes the number of masses in the mass data that are in mass region 2 etc., and hi, where i=l,2,...,q, denotes the number of matches in mass region i. The values of ki are all combinations of values that apply to the constraint Tλ. = k .
In one embodiment of the invention, a score related to random matching is employed in the process of ranking molecules in a database.
In one embodiment of the invention, the probability p(k) is employed in the process of ranking molecules in a database. A whole database or a fraction of a database is processed and organized to allow the computation of p(k) for molecules in the database, k denotes the number of matches between the masses of molecular constituents of each database molecule investigated and masses in the mass data. The molecules in the database can be known or hypothetical. The molecule or molecules producing the mass data can be known or unknown.
In one embodiment of the invention, the ranking of the molecules in a database is based on the score S(p(k)), where <S is a function.
In one embodiment of the invention
where c is a constant or a mathematical function. When c=l, S(p(k)) can be interpreted as the probability that a molecule in the database would yield at least h random matches with the mass data.
In one embodiment of the invention, the molecule in the database that yields the lowest S(p(k)) for k matches with the mass data is given the highest rank. The molecule in the database yielding the second lowest S(p(k)) for k matches is given the second highest rank and so on. The identification of a molecule or molecules is among the molecules having the highest ranks. The highest ranks can be the top ranked molecule only, but it can also be more molecules than the top ranked, e.g. the top two, top three, top four, top five, top ten, or top 100. The number of ranked molecules that are considered as identification results can also be determined by the use of a significance test.
In one embodiment, the invention provides a method of generating a frequency distribution of scores for a particular experimental condition, wherein the scores relate to random identifications of proteins.
A frequency distribution is any compilation of the observed values of the variable being studied and how many times each value is observed. Frequency distributions can be in the form of a table of listings, a bar graph, a histogram, a frequency polygon, or a continuous curve. Functions derived from frequency distributions can be continuous (probability density function) or discrete (probability mass functions). Cumulative distribution functions of each type of function can also be derived.
In one embodiment, the frequency function is generated for a sub-population with H members from a database.
In one embodiment, the sub -population is selected based upon values of ku. In one embodiment, the frequency function is generated for molecules ranked upon their number of matches.
In one embodiment, the frequency function is f(S), where S is a. score. In one embodiment, S is the number of random matches. In one embodiment £■=& ' and
where p(k) has the meaning stated above.
Those of ordinary skill in the art will recognize that the present invention has wide applicability for identification of molecules. Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be effected therein by one skilled in the art without departing from the scope or spirit of the present invention.