CN107818329B - Mass spectrum data analysis method - Google Patents
Mass spectrum data analysis method Download PDFInfo
- Publication number
- CN107818329B CN107818329B CN201710674793.9A CN201710674793A CN107818329B CN 107818329 B CN107818329 B CN 107818329B CN 201710674793 A CN201710674793 A CN 201710674793A CN 107818329 B CN107818329 B CN 107818329B
- Authority
- CN
- China
- Prior art keywords
- mass
- mass spectrum
- data
- sample
- spectrum data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention provides a mass spectrum data analysis method which comprises a sample data acquisition step, a sample data preprocessing step, a data model construction and cross validation step, a data model optimization step and a sample group judgment step.
Description
Technical Field
The invention relates to the field of machine learning application, in particular to a mass spectrum data analysis method.
Background
Machine Learning (ML) is a multi-domain cross discipline, which relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like, and is used for specially researching how a computer simulates or realizes the Learning behavior of a human group to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence and can be applied to many fields such as data mining, computer vision, natural language processing, biological feature recognition, search engines, medical diagnosis, credit card fraud detection, stock market analysis, DNA sequence sequencing and the like. The machine learning algorithm is a group of algorithms which automatically analyze and obtain rules from known data and predict unknown data by using the rules.
The mass spectrum data is obtained by ionizing a sample by using a special instrument to generate charged ions with different charge-mass ratios, and then separating the ions with different charge-mass ratios in space or time by using an external electric field. Ions of different mass-to-charge ratios are separated by a mass analyzer, detected and recorded, and processed by a computer to generate a mass spectrogram.
In the fields of biology, chemistry and medicine, the problem of classifying body fluid samples according to their composition is often involved, and in general, most technicians use a method of separate analysis and separate comparison, which has the advantages of clear sample composition and accurate classification; the disadvantages are that when the types of the body fluid samples to be classified are more, a lot of time and resources are consumed, and the labor cost is higher. How to deduce the new category of the body fluid sample from the known category of the body fluid sample is an important research topic for researchers.
For example, in the medical field, some of the same specific components are present in the body fluids of some patients with certain diseases, and these components may be the causes of the same diseases or the manifestations of certain diseases. Clinically, if a certain type of component exists in the body fluid of a certain patient, the patient can be associated with a certain disease or a certain type of disease, and data support is provided for clinical diagnosis. Because the human body is a very complex organism, diagnosis of diseases and selection of treatment schemes both need professional medical staff to make judgment according to mass data of each individual, diagnosis efficiency is low, and labor cost is high. When the number of patients needing examination is large, the patients need to queue for a long time, doctors can feel hard to work continuously, the diagnosis and treatment time of each patient is short, and misdiagnosis is easy to happen. Therefore, in clinical medicine, there is a need for a medical device capable of analyzing a large number of body fluid samples simultaneously, which can detect and analyze whether a large number of unknown samples contain certain specific components in a short time according to a large number of body fluid samples of known healthy people and patients, thereby assisting medical staff in making a diagnosis more conveniently and accurately.
Disclosure of Invention
The invention aims to: the method for analyzing the mass spectrum data is provided to solve the technical problems that a large amount of time and a large amount of resources are consumed and the labor cost is high when the number of body fluid samples needing to be classified is large in the prior art.
In order to solve the technical problem, the invention provides a mass spectrum data analysis method, which comprises the following steps: a sample data acquisition step, which is used for acquiring mass spectrum data of more than two body fluid samples and generating a mass spectrogram according to the mass spectrum data; the body fluid sample comprises more than two training samples and at least one testing sample; the training samples are divided into more than two groups, and the training samples in the same group are marked with the same group label; a sample data preprocessing step, namely preprocessing at least one group of mass spectrum data, and performing coordinate transformation processing on the mass spectrum to obtain standardized mass spectrum data of the training sample and the test sample; a data model building and cross validation step, which is used for building a primary data model by using the standardized mass spectrum data of the training sample and the group label of the training sample, and performing at least one time of cross validation processing on the primary data model according to the standardized mass spectrum data of the training sample; a data model optimization step, which is used for constructing an optimized data model according to the result of the cross validation; and a sample group judgment step for acquiring a group label of the test sample by using the standardized mass spectrum data of the test sample and the optimized data model.
Further, the sample data acquisition step specifically includes the following steps: obtaining more than two body fluid samples; arranging all the body fluid samples in a matrix on a plate; acquiring mass spectrum data of the body fluid sample by using mass spectrometry and generating a mass spectrogram; at least one set of mass spectral data is acquired for each body fluid sample.
Further, the test sample is located in the middle of the flat plate, and the training sample surrounds the test sample; the flat plate includes, but is not limited to, a base metal plate; the group labels of any two adjacent training samples are different; the distance between any two adjacent body fluid samples is greater than or equal to 2mm and less than or equal to 5mm.
Furthermore, each group of mass spectrum data comprises a mass-to-charge ratio value of an ion in the body fluid sample and a signal measured intensity value corresponding to the ion; each group of mass spectrum data corresponds to a sampling point in the mass spectrogram; the abscissa of each sample point represents the mass-to-charge ratio of an ion, and the ordinate represents the measured signal intensity value corresponding to the ion.
Further, the sample data preprocessing step specifically includes the following steps: a baseline correction step, which is used for performing baseline correction processing on the mass spectrum data in the mass spectrum; a resampling step, which is used for resampling the ion mass-to-charge ratio value in the mass spectrum data after the baseline correction by using a resampling algorithm, carrying out abscissa transformation on the mass spectrum, unifying the mass-to-charge ratios of all mass spectrum data, and obtaining the resampled mass spectrum data; and a standardization step, namely standardizing the ion signal intensity value in the re-sampling mass spectrum data, and carrying out ordinate transformation on the mass spectrum to obtain standardized mass spectrum data.
Further, the baseline correction step specifically includes the steps of: a signal calculation step, which is used for calculating the baseline signal intensity corresponding to at least one mass-to-charge ratio value in a group of mass spectrum data by using a window function; a signal correction step for correcting the measured signal intensity corresponding to the mass-to-charge ratio based on the baseline signal intensity; and repeating the signal calculation step and the signal correction step to finish the correction of each group of mass spectrum data of each body fluid sample in turn.
Further, the resampling step specifically includes the following steps: an effective mass-to-charge ratio selection step for selecting an effective mass-to-charge ratio interval and an effective mass-to-charge ratio quantity; an effective mass-to-charge ratio calculating step, which is used for calculating the mass-to-charge ratio of the re-sampled mass spectrum data by utilizing a re-sampling algorithm; and an interpolation processing step, which is used for carrying out interpolation processing on the mass spectrogram after baseline correction by utilizing the mass-to-charge ratio and the mass-to-charge ratio number after resampling, and converting the abscissa of the mass spectrogram after baseline correction into the mass-to-charge ratio number from the mass-to-charge ratio numerical value.
Further, the resampling algorithm refers to: the mass-to-charge ratio interval of the effective mass spectrum data after resampling is set as [ y 1 ,y 2 ]After resamplingThe number of mass-to-charge ratio coordinates of (a) is N; calculating the mass-to-charge ratio coordinate after resampling by using the following formula
Further, the step of normalizing specifically comprises the steps of: a signal intensity absolute value sum calculating step, which is used for calculating the absolute value sum S of the ion signal intensity values in all the re-sampling mass spectrum data; a normalized signal intensity value sum setting step for setting the sum of absolute values of ion signal intensity values in all the resampled mass spectrum data after normalization processing to be a constant T; a signal intensity value change multiple calculation step for calculating a change multiple T/S of each signal intensity value; and a signal intensity value changing step, which is used for synchronously amplifying or synchronously reducing each ion signal intensity value in the resampled mass spectrum data.
Further, the steps of data model construction and cross validation specifically comprise the following steps: optionally selecting a training sample as a standard training sample, wherein the group label of the standard training sample is known; setting a circular area on the flat plate by taking the position of the standard training sample as a circle center and a specific length r as a radius; constructing a matrix D according to the standardized mass spectrum data of other training samples except the standard training sample in the circular area, wherein each line of data in the matrix D respectively corresponds to a group of standardized mass spectrum data of one training sample; obtaining vectors according to the group labels of other training samples except the standard training sample in the circular areaThe group label for each training sample is recorded in the vector ≥>Performing the following steps; optimization algorithm using sparse learningEstablishing a primary data model-> Multiplying more than two groups of mass spectrum data of the standard training sample by the data model, arranging the products into a number array according to the numerical value, and carrying out rounding processing on the median value to obtain a presumed group label of the standard training sample; comparing the guess group label of the standard training sample with the group label thereof, if the guess group label of the standard training sample is the same as the group label of the standard training sample, judging that the group label guess of the standard training sample is correct, and adding one to the accuracy counter; sequentially taking each training sample as a standard training sample, repeating the steps, performing cross validation processing on all the training samples, and calculating the group label judgment accuracy of the training samples under the condition that the radius is r, wherein the group label judgment accuracy is the ratio of the numerical value of the accuracy counter to the total number of the training samples; adjusting the radius r, repeating the steps, and calculating the judgment accuracy of the group label under the condition that the radius r is different; and selecting a maximum accuracy rate from the judgment accuracy rates of more than two group labels, and acquiring an optimal value R of the radius R corresponding to the maximum accuracy rate. Further, the data model optimization step specifically includes the following steps: setting a circular area on the flat plate by taking the position of a test sample as the center of a circle and the length of the optimal radius value R as the radius; constructing a matrix D according to the standardized mass spectrum data of all the training samples in the circular region W Said matrix D W Each column of data in the training sample corresponds to a group of standardized mass spectrum data of a training sample; acquiring a vector based on the group labels of all training samples in the circular area>The group label of each training sample is recorded in the form of a natural number in the vector corresponding to the training sample->The preparation method comprises the following steps of (1) performing; establishing an optimized data model ^ based on a sparse learning optimization algorithm> The data model optimization step specifically comprises the following steps: setting a circular area on the flat plate by taking the position of a test sample as the center of a circle and the length of the optimal radius value R as the radius; constructing a matrix D according to the standardized mass spectrum data of all the training samples in the circular region W Said matrix D W Each column of data in the training sample corresponds to a group of standardized mass spectrum data of a training sample; acquiring a vector based on the group labels of all training samples in the circular area>The group label of each training sample is recorded in the form of a natural number in the vector corresponding to the training sample>Performing the following steps; establishing optimized data model by using sparse learning optimization algorithm
Further, the step of determining the sample group specifically includes the steps of: multiplying a group of mass spectrum data of a test sample by the data model, and rounding the product to obtain a group label of the test sample; or multiplying more than two groups of mass spectrum data of a test sample by the data model, arranging the products into a number array according to the numerical value, and carrying out rounding processing on the median value to obtain the group label of the test sample.
The invention has the advantages that: the method can construct a grouping device model according to the group of known body fluid samples, obtain a data model with the highest accuracy through multiple cross validation of a plurality of training samples, simultaneously process mass spectrum data of a large number of body fluid samples and group the body fluid samples according to the components of the body fluid samples.
Drawings
FIG. 1 is a flow chart of a method of mass spectrometry data analysis according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method of sample data collection steps according to an embodiment of the present invention;
FIG. 3 is a mass spectrum generated from mass spectrometry data prior to preprocessing according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method of sample data preprocessing step according to an embodiment of the present invention;
FIG. 5 is a flowchart of a method of the sample data baseline correction step according to an embodiment of the present invention;
FIG. 6 is a mass spectrum generated after baseline correction of mass spectral data according to an embodiment of the present invention;
FIG. 7 is a flowchart of a method of resampling processing steps on mass spectrometry data, in accordance with an embodiment of the present invention;
FIG. 8 is a schematic diagram of effective mass-to-charge ratios in resampled mass spectra data according to an embodiment of the invention;
FIG. 9 is a mass spectrum generated by resampling mass spectral data according to an embodiment of the invention;
FIG. 10 is a flowchart of a method of normalizing mass spectral data according to one embodiment of the invention;
FIG. 11 is a mass spectrum generated by normalizing mass spectral data according to the present embodiment;
FIG. 12 is a flowchart of a method for the data model building and cross validation steps according to an embodiment of the present invention;
FIG. 13 is a flowchart of a method for optimizing a data model according to an embodiment of the present invention.
Detailed Description
An embodiment of the present invention is provided below, with reference to the accompanying drawings of the specification, to demonstrate that the invention can be practiced.
As shown in fig. 1, the present embodiment provides a method for analyzing mass spectrum data, which includes the following steps S1) to S5).
Step S1), a sample data acquisition step, which is used for acquiring at least one group of mass spectrum data of more than two body fluid samples and generating a mass spectrogram according to the mass spectrum data. The body fluid sample comprises more than two training samples and at least one testing sample; the training samples are divided into more than two groups (also referred to as classes), and the training samples in the same group are marked with the same group label. The body fluid sample may be a body fluid from a human body or other organisms, in this embodiment, human blood samples are preferred, the group labels are 0 and 1, the group 0 sample is from a patient with a disease (such as diabetes, hemophilia, etc.), the group 1 sample is from a healthy person without the disease, the training sample is a blood sample with a known group label, and each blood sample is labeled with 0 or 1. In other embodiments, the group tag may also be identified as other natural numbers.
As shown in fig. 2, step S1) specifically includes the following steps: step S101) obtaining more than two body fluid samples; typically tens or hundreds of samples may be selected. Step S102) arranging all the body fluid samples in the form of droplets in a matrix on a plate (preferably a matrix metal plate) with the test sample in the middle of the plate and the training sample surrounding the test sample; the group labels of any two adjacent training samples are different; the distance between any two adjacent body fluid samples is greater than or equal to 2mm and less than 5mm; including but not limited to a base metal plate. Step S103) collecting mass spectrum data of the body fluid sample by using mass spectrometry and generating a mass spectrum, as shown in FIG. 3; at least one group of mass spectrum data is acquired from each body fluid sample, preferably more than three groups of mass spectrum data are acquired, negative effects caused by mass spectrum data errors are reduced, accuracy is improved, mode classification is realized on the basis of multiple groups of data of the same sample, and interference caused by single group of data errors can be effectively reduced. Each group of mass spectrum data comprises a mass-to-charge ratio value of an ion in the body fluid sample and a signal actual measurement intensity value corresponding to the ion; in each sampling point in the mass spectrum, the abscissa represents the mass-to-charge ratio of an ion, and the ordinate represents the measured signal intensity corresponding to the ion, as shown in fig. 3.
Step S2) sample data preprocessing step, which is used for preprocessing at least one group of mass spectrum data and performing coordinate transformation processing on the mass spectrum to obtain the standardized mass spectrum data of the training sample. Due to factors such as sample handling, instrument performance, external contamination, etc., mass spectral data directly obtained by the mass spectrometer need to be properly preprocessed to improve the accuracy of the grouping.
As shown in fig. 4, the step S2) specifically includes steps S201) to S203), and the grouping precision of the mass spectrum data can be prevented from being influenced by too many external factors through three processing steps of baseline correction, resampling and normalization on the mass spectrum data on the mass spectrogram.
Step S201) a baseline correction step, which is used for performing baseline correction processing on mass spectrum data on the mass spectrum, wherein a baseline is a basic intensity value in the mass spectrum data, and the baseline correction step is used for identifying and removing a baseline with larger deviation in the mass spectrum and removing data with larger deviation in the mass spectrum data. As shown in fig. 5, step S201) the baseline correction step specifically includes the following steps: step S2011) a signal calculating step for calculating a baseline signal intensity of at least one mass-to-charge ratio in the set of mass spectrometry data using a window function; step S2012) a signal correction step for correcting the actually measured signal intensity corresponding to the mass-to-charge ratio according to the baseline signal intensity, and screening and removing invalid data with large deviation; and repeating the steps S2011) -S2012), and finishing the correction of each set of mass spectrum data of each body fluid sample in turn. When the engineering test signal processing is realized by using a computer, the infinite signal cannot be measured and operated, but the finite time segment of the infinite signal is taken for analysis, a time segment is intercepted from the signal, then the intercepted signal time segment is used for periodic continuation processing to obtain a virtual infinite signal, and the mathematical processing such as Fourier transform, correlation analysis and the like can be carried out on the signal. In a particular application, the signal may be truncated using different truncation functions, referred to as windowing functions. In this embodiment, the window function STEP is set to 50 and WINDOW is set to 50. After the baseline correction is completed, a mass spectrum after the baseline correction is obtained, and in detail, see fig. 6, the abscissa of the mass spectrum represents the mass-to-charge ratio of an ion, and the ordinate of the mass spectrum represents the signal measured intensity value corresponding to the ion.
Step S202), a resampling step, namely resampling the ion mass-to-charge ratio value in the mass spectrum data after baseline correction by using a resampling algorithm, carrying out abscissa transformation on the mass spectrum, unifying the mass-to-charge ratios of all mass spectrum data, removing mass spectrum data with large deviation, and obtaining resampled mass spectrum data.
As shown in fig. 7, the resampling step S202) specifically includes the following steps S2021) to S2023). S2021) an effective mass-to-charge ratio selecting step for selecting an effective mass-to-charge ratio interval and an effective mass-to-charge ratio number; and constructing an effective mass-to-charge ratio schematic diagram in the resampled data, wherein the abscissa of the schematic diagram represents the effective mass-to-charge ratio number reserved after resampling, and the ordinate of the schematic diagram represents the mass-to-charge ratio numerical value corresponding to the mass-to-charge ratio number. S2022) an effective mass-to-charge ratio calculating step for calculating the mass-to-charge ratio of the resampled mass spectrum data by using a resampling algorithm; the resampling algorithm is as follows: the mass-to-charge ratio interval of the effective mass spectrum data after resampling is set as [ y 1 ,y 2 ]The mass-to-charge ratio coordinate number after resampling is N; calculating the mass-to-charge ratio coordinate after resampling by using the following formula
Wherein N is greater than 10 4 And less than 10 5 The balance between the accuracy of the algorithm and the calculation speed is obtained. And S2023) an interpolation processing step, namely performing interpolation processing on the mass spectrogram after baseline correction by using the mass-to-charge ratio and the mass-to-charge ratio number after resampling, and converting the abscissa of the mass spectrogram after baseline correction into the mass-to-charge ratio number from the mass-to-charge ratio numerical value. In this embodiment, the mass-to-charge ratios of the resampled mass spectrum data are all distributed in a mass-to-charge ratio interval of 98.9 to 1003.1, 10000 groups of effective mass spectrum data are reserved, and the resampled mass-to-charge ratio coordinate +is calculated by using the following formula> Corresponding to the effective mass spectrum data, 10000 mass-to-charge ratios are provided, as shown in fig. 8, which is a schematic diagram of the effective mass-to-charge ratio in the resampled data of this embodiment, and the abscissa of the diagram represents the effective mass-to-charge ratio number retained after resampling, and the ordinate of the diagram represents the mass-to-charge ratio value corresponding to the mass-to-charge ratio number.
In the process of carrying out interpolation processing on the mass spectrogram, removing redundant mass spectrum data in the mass spectrogram (shown in figure 6) after baseline correction, and only reserving effective mass spectrum data for resampling; the abscissa of the mass spectrum after baseline correction is converted from the charge-to-mass ratio value to the mass-to-charge ratio number, and the ordinate is unchanged, so that resampling of each group of original mass spectrum data can be completed, as shown in fig. 9, the mass spectrum of the resampled mass spectrum data of this embodiment is shown, the abscissa represents the effective mass-to-charge ratio number after resampling, and the ordinate represents the actually-measured ion signal intensity value corresponding to the mass-to-charge ratio number. After the resampling step, on the mass spectrogram, the interval with the relatively small mass-to-charge ratio contains more sampling values, and the interval with the relatively large mass-to-charge ratio contains less sampling values, and corresponds to the assumption that the interval with the small mass-to-charge ratio contains more information than the interval with the large mass-to-charge ratio.
Step S203), a standardization step, which is used for carrying out standardization processing on the ion signal intensity value in the resampling mass spectrum data and carrying out longitudinal coordinate transformation on the mass spectrum to obtain standardized mass spectrum data. As shown in fig. 10, step S203) the normalization step specifically includes the following steps S2031) to S2034). Step S2031) a signal intensity absolute value sum calculating step for calculating the sum S of the absolute values of the ion signal intensity values in all the resampled mass spectrum data; step S2032) a normalized signal intensity value sum setting step of setting a sum of absolute values of ion signal intensity values in all the resampled mass spectrum data after the normalization processing to a constant T, which is set to 10000 in this embodiment; step S2033) a signal intensity value change multiple calculation step for calculating a change multiple T/S of each signal intensity value; step S2034) a signal intensity value changing step, which is used for synchronously amplifying or reducing each ion signal intensity value in the resampled mass spectrum data, and carrying out ordinate transformation on the mass spectrum, wherein the change multiple of the signal intensity value is T/S in the step S2033). Fig. 11 shows a mass spectrum of the normalized mass spectrum data of the present embodiment, in which the abscissa represents the effective mass-to-charge ratio number after resampling, and the ordinate represents the ion signal normalized intensity value corresponding to the mass-to-charge ratio number. The technical effect of the normalization step is that the intensities of the mass spectrum data are mapped to a uniform range, so that the distribution range of the intensities of each group of mass spectrum data is basically consistent, and the comparability of the mass spectrum data of different samples is enhanced.
And S3) a data model building and cross validation step, namely building a primary data model by using the mass spectrum data of the training sample and the group label of the training sample, carrying out cross validation processing on the primary data model for n times (n is the number of the training samples) according to the mass spectrum data of the training sample, and carrying out machine learning and model building by using the mass spectrum data and the group label of the known training sample. As shown in fig. 12, step S3) specifically includes the following steps: step S301) selecting a training sample as a standard training sample, wherein the group label of the standard training sample is known; step S302) setting a circular area on the flat plate by taking the position of the standard training sample as a circle center and taking a specific length r as a radius; step S303) a matrix D is constructed according to the standardized mass spectrum data of other training samples except the standard training sample in the circular area, wherein each line of data in the matrix D corresponds to a group of standardized mass spectrum data of one training sample; step S304) obtaining vectors according to the group labels of other training samples except the standard training sample in the circular areaThe group label for each training sample is recorded in the vector ≥>Performing the following steps; step S305) of establishing a primary data model ^ using a sparse learning optimization algorithm>Step S306) more than two groups of mass spectrum data of the standard training sample are multiplied by the data model, the products are arranged into a number array according to the numerical value, the median value of the products is rounded, and a presumed group label of the standard training sample is obtained; since the group labels in the invention are all integers (0 or 1), an integer is obtained by rounding off the number after the decimal point, which is rounding processing; step S307) comparing the guessed group label of the standard training sample with the known group label of the standard training sample, if the guessed group label of the standard training sample is the same as the known group label of the standard training sample, judging that the group label guess of the standard training sample is correct, and adding one to an accuracy counter; step S308) sequentially taking each training sample as a standard training sample, and repeating the steps S301) to S307), performing cross validation processing on all the training samples, and calculating the group label judgment accuracy of the training samples under the condition that the radius is r, wherein the group label judgment accuracy is the ratio of the numerical value of the accuracy counter to the total number of the training samples; step S309) adjusting the size of the radius r, and repeating the steps S301) to S308), and calculating the judgment accuracy of a plurality of group labels under the condition that the radius r is different; step S310) selecting a maximum accuracy rate from the plurality of group tag judgment accuracy rates, and obtaining a numerical value of the radius R corresponding to the maximum accuracy rate, namely a radius optimal value R.
The machine learning algorithm is an algorithm for automatically analyzing and obtaining rules from data and predicting unknown data by using the rules. The method adopts a Lasso regression algorithm in machine learning to analyze mass spectrum data, learns and constructs a model, and comprises two processes of training and testing. The basic idea of the Lasso algorithm is to minimize the sum of the squared residuals under the constraint that the sum of the absolute values of the regression coefficients is less than a constant, resulting in some regression coefficients strictly equal to 0, resulting in an interpretable model.
In this embodiment, cross validation is performed n times (n is the number of training samples) by using a Lasso algorithm, and a model obtained by cross validation at each time is matched with 11 intensity thresholds 0, 0.1, … and 1 to correspondingly obtain 11 group label judgment accuracy rates; repeating the steps for n times to obtain n × 11 data models (grouping devices), wherein each data model corresponds to one group label to judge the accuracy. Adjusting specific radiuses R =2.0mm, 2.2mm, 2.4mm …, 4.8mm and 5mm, obtaining n × 11 × 16 group label judgment accuracy rates, arranging numerical values of all group label judgment accuracy rates according to sizes, finding out the maximum accuracy rate numerical value, and finding out the radius corresponding to the maximum accuracy rate, namely the radius optimal value R.
And S4) a data model optimization step, which is used for constructing an optimized data model according to the cross validation result. As shown in fig. 13, step S4) specifically includes the following steps: step S401) setting a circular area on the flat plate by taking the position of a test sample as a circle center and the length of the optimal value R of the radius in the step S310) as the radius; step S402) constructing a matrix D according to the standardized mass spectrum data of all the training samples in the circular area W Said matrix D W Each column of data in the training sample corresponds to a group of standardized mass spectrum data of a training sample; step S403) obtaining vectors according to the group labels of all training samples in the circular areaThe group label of each training sample is recorded in integer form in the vector corresponding to the training sample->Performing the following steps; step S404) of establishing an optimized data model +>Establishing optimized data models>And a sparse learning optimization algorithm is utilized in the process.
And S5) a sample group judgment step for acquiring a group label of the test sample by using the mass spectrum data of the test sample and the optimized data model. In the step S5), a group of mass spectrum data of a test sample is multiplied by the data model, and the product is rounded to obtain a group label of the test sample; or multiplying more than two groups of mass spectrum data of a test sample by the data model, arranging the products into a number array according to the numerical value, and carrying out rounding processing on the median value to obtain the group label of the test sample. In this embodiment, if the result of rounding is 0, the person corresponding to the test sample can be considered to have a mass spectral data pattern associated with a certain disease, thereby assisting the physician in making a diagnosis; if the result of the rounding is 1, it can be assumed that the person corresponding to the test sample does not have a mass spectral data pattern associated with the disease, thereby assisting the physician in making a diagnosis.
The invention provides a mass spectrum data analysis method, which can construct a grouping device model according to the group of known body fluid samples, obtain a data model with the highest accuracy through multiple cross validation of a plurality of training samples, simultaneously process mass spectrum data of a large number of body fluid samples and group the body fluid samples according to the components of the body fluid samples. In clinical medicine, the technical scheme of the invention can be applied to auxiliary disease intelligent diagnosis, and can be used for simultaneously detecting multiple groups of blood samples of multiple persons to be detected by utilizing a computer technology, so that whether the multiple persons to be detected have mass spectrum data modes associated with a certain disease can be judged in a short time, and a doctor is assisted to realize rapid diagnosis.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A method of mass spectrometry data analysis, comprising the steps of:
a sample data acquisition step, which is used for acquiring mass spectrum data of more than two body fluid samples and generating a mass spectrogram according to the mass spectrum data; the body fluid sample comprises more than two training samples and at least one testing sample; the training samples are divided into more than two groups, and the training samples in the same group are marked with the same group labels;
a sample data preprocessing step, namely preprocessing at least one group of mass spectrum data, and performing coordinate transformation processing on the mass spectrum to obtain standardized mass spectrum data of the training sample and the test sample;
a data model building and cross validation step, which is used for building a primary data model by using the standardized mass spectrum data of the training sample and the group label of the training sample, and performing at least one time of cross validation processing on the primary data model according to the standardized mass spectrum data of the training sample;
a data model optimization step, which is used for constructing an optimized data model according to the result of the cross validation; and
a sample group judgment step, which is used for acquiring a group label of the test sample by using the standardized mass spectrum data of the test sample and the optimized data model;
the sample data acquisition step specifically comprises the following steps:
obtaining more than two body fluid samples;
arranging all the body fluid samples in a matrix on a plate; and
collecting mass spectrum data of the body fluid sample by using a mass spectrometry method and generating a mass spectrogram; acquiring at least one set of mass spectral data for each body fluid sample;
the data model building and cross validation steps specifically comprise the following steps:
optionally selecting a training sample as a standard training sample, wherein the group label of the standard training sample is known;
setting a circular area on the flat plate by taking the position of the standard training sample as a circle center and a specific length r as a radius;
constructing a matrix D according to the standardized mass spectrum data of other training samples except the standard training sample in the circular area, wherein each line of data in the matrix D respectively corresponds to a group of standardized mass spectrum data of one training sample;
obtaining vectors according to the group labels of other training samples except the standard training sample in the circular areaA group label for each training sample is recorded in a vector @>The preparation method comprises the following steps of (1) performing;
Multiplying more than two groups of standardized mass spectrum data of the standard training sample by the data model, arranging the products into a number array according to the numerical value, and carrying out rounding processing on the median value to obtain a presumed group label of the standard training sample;
comparing the guess group label of the standard training sample with the group label thereof, if the guess group label of the standard training sample is the same as the group label of the standard training sample, judging that the group label guess of the standard training sample is correct, and adding one to the accuracy counter;
sequentially taking each training sample as a standard training sample, repeating the steps, performing cross validation processing on all the training samples, and calculating the group label judgment accuracy of the training samples under the condition that the radius is r, wherein the group label judgment accuracy is the ratio of the numerical value of the accuracy counter to the total number of the training samples;
adjusting the radius r, repeating the steps, and calculating the judgment accuracy of the group label under the condition that the radius r is different values; and
and selecting a maximum accuracy value from the judgment accuracy rates of more than two group labels, and acquiring the optimal value R of the radius R corresponding to the maximum accuracy value.
2. The method of mass spectrometry data analysis of claim 1,
the test sample is positioned in the middle of the flat plate, and the training sample surrounds the test sample;
the flat plate includes, but is not limited to, a base metal plate;
the group labels of any two adjacent training samples are different;
the distance between any two adjacent body fluid samples is greater than or equal to 2mm and less than or equal to 5mm.
3. The method of mass spectrometry data analysis of claim 1,
each group of mass spectrum data comprises a mass-to-charge ratio value of an ion in the sample and a signal actual measurement intensity value corresponding to the ion;
each group of mass spectrum data corresponds to a sampling point in the mass spectrogram;
the abscissa of each sample point represents the mass-to-charge ratio of an ion, and the ordinate represents the measured signal intensity value corresponding to the ion.
4. The method of mass spectrometry data analysis of claim 1,
the sample data preprocessing step specifically comprises the following steps:
a baseline correction step, which is used for performing baseline correction processing on the mass spectrum data in the mass spectrogram;
a resampling step, which is used for resampling the ion mass-to-charge ratio value in the mass spectrum data after the baseline correction by using a resampling algorithm, carrying out abscissa transformation on the mass spectrum, unifying the mass-to-charge ratios of all mass spectrum data, and obtaining the resampled mass spectrum data; and
and a standardization step, namely standardizing the ion signal intensity value in the re-sampling mass spectrum data, and carrying out ordinate transformation on the mass spectrum to obtain standardized mass spectrum data.
5. The method of mass spectrometry data analysis of claim 4,
the baseline correction step specifically comprises the following steps:
a signal calculation step, which is used for calculating the baseline signal intensity corresponding to at least one mass-to-charge ratio value in a group of mass spectrum data by using a window function;
a signal correction step for correcting the measured signal intensity corresponding to the mass-to-charge ratio based on the baseline signal intensity; and
and repeating the signal calculation step and the signal correction step to finish the correction of each group of mass spectrum data of each body fluid sample in turn.
6. The method of mass spectrometry data analysis of claim 4,
the resampling step specifically comprises the following steps:
an effective mass-to-charge ratio selecting step for selecting an effective mass-to-charge ratio interval and an effective mass-to-charge ratio number;
an effective mass-to-charge ratio calculating step, which is used for calculating the mass-to-charge ratio of the re-sampled mass spectrum data by utilizing a re-sampling algorithm;
and an interpolation processing step, which is used for carrying out interpolation processing on the mass spectrogram after baseline correction by utilizing the mass-to-charge ratio and the mass-to-charge ratio number after resampling, and converting the abscissa of the mass spectrogram after baseline correction into the mass-to-charge ratio number from the mass-to-charge ratio numerical value.
7. The method of mass spectrometry data analysis of claim 6,
the resampling algorithm is as follows:
the mass-to-charge ratio interval of the effective mass spectrum data after resampling is set as [ y 1 ,y 2 ]The mass-to-charge ratio coordinate number after resampling is N;
Wherein N is greater than 10 4 And less than 10 5 。
8. The method of mass spectrometry data analysis of claim 4,
the step of normalizing specifically comprises the steps of:
a signal intensity absolute value sum calculating step, which is used for calculating the absolute value sum S of the ion signal intensity values in all the resampled mass spectrum data;
a normalized signal intensity value sum setting step for setting the sum of absolute values of ion signal intensity values in all the resampled mass spectrum data after normalization processing to be a constant T;
a signal intensity value change multiple calculation step for calculating a change multiple T/S of each signal intensity value;
and a signal intensity value changing step, which is used for synchronously amplifying or synchronously reducing each ion signal intensity value in the resampled mass spectrum data.
9. The method of mass spectrometry data analysis of claim 1,
the data model optimization step specifically comprises the following steps:
setting a circular area on the flat plate by taking the position of a test sample as the center of a circle and the length of the optimal radius value R as the radius;
constructing a matrix D according to the standardized mass spectrum data of all the training samples in the circular region w Said matrix D w Each column of data in the training sample corresponds to a group of standardized mass spectrum data of a training sample;
obtaining vectors according to the group labels of all training samples in the circular areaThe group label of each training sample is recorded in the form of a natural number in the vector corresponding to the training sample>Performing the following steps; and
10. The method of mass spectrometry data analysis of claim 1,
the step of judging the sample group specifically comprises the following steps:
multiplying a group of mass spectrum data of a test sample by the data model, and rounding the product to obtain a group label of the test sample; or
And multiplying more than two groups of mass spectrum data of a test sample by the data model, arranging the products into a number array according to the numerical value, and rounding the median value to obtain the group label of the test sample.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710674793.9A CN107818329B (en) | 2017-08-09 | 2017-08-09 | Mass spectrum data analysis method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710674793.9A CN107818329B (en) | 2017-08-09 | 2017-08-09 | Mass spectrum data analysis method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN107818329A CN107818329A (en) | 2018-03-20 |
| CN107818329B true CN107818329B (en) | 2023-04-18 |
Family
ID=61601540
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201710674793.9A Active CN107818329B (en) | 2017-08-09 | 2017-08-09 | Mass spectrum data analysis method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN107818329B (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109633094B (en) * | 2018-12-28 | 2021-08-03 | 浙江省环境监测中心 | A kind of online monitoring method of odor concentration |
| CN112380758B (en) * | 2020-11-02 | 2021-06-08 | 中煤科工集团重庆研究院有限公司 | A method for constructing a mathematical model of electric field charge of dust particle group |
| CN112418072B (en) * | 2020-11-20 | 2024-08-02 | 上海交通大学 | Data processing method, device, computer equipment and storage medium |
| CN118861619B (en) * | 2024-06-25 | 2025-08-15 | 复旦大学附属华山医院 | Method, system, equipment and storage medium for processing mass spectrum data |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105512669A (en) * | 2014-04-04 | 2016-04-20 | 佰欧迪塞克斯公司 | Treatment selection for lung cancer patients using mass spectrum of blood-based sample |
| CN106415274A (en) * | 2014-03-26 | 2017-02-15 | 梅坦诺米克斯保健有限公司 | Method and method for measuring blood sample quality based on metabolite group |
-
2017
- 2017-08-09 CN CN201710674793.9A patent/CN107818329B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106415274A (en) * | 2014-03-26 | 2017-02-15 | 梅坦诺米克斯保健有限公司 | Method and method for measuring blood sample quality based on metabolite group |
| CN105512669A (en) * | 2014-04-04 | 2016-04-20 | 佰欧迪塞克斯公司 | Treatment selection for lung cancer patients using mass spectrum of blood-based sample |
Non-Patent Citations (5)
| Title |
|---|
| Conrad TO etl.Sparse Proteomics Analysis - a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data.BMC Bioinformatics.2017,第18卷(第1期),全文. * |
| Truntzer C etl.Comparison of classification methods that combine clinical data and high-dimensional mass spectrometry data..BMC Bioinformatics.2014,第15卷(第1期),全文. * |
| 尹康平.基于贝叶斯的质谱数据分析方法.中国优秀硕士学位论文全文数据库 (中国优秀硕士学位论文全文数据库 (基础科学辑)).2012,全文. * |
| 柯激情.基于稀疏表示的蛋白质质谱数据分析.中国优秀硕士学位论文全文数据库 (基础科学辑).2012,(第undefined期),全文. * |
| 石雪娜.基于压缩感知的蛋白质功能分类预测.中国优秀硕士学位论文全文数据库 (信息科技辑).2014,(第undefined期),全文. * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN107818329A (en) | 2018-03-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP5496650B2 (en) | System, method and computer program product for analyzing spectroscopic data to identify and quantify individual elements in a sample | |
| US6675104B2 (en) | Method for analyzing mass spectra | |
| CN107818329B (en) | Mass spectrum data analysis method | |
| US20020193950A1 (en) | Method for analyzing mass spectra | |
| CN111956212B (en) | A method for identifying atrial fibrillation between groups based on frequency domain filtering-multimodal deep neural network | |
| AU2002241535A1 (en) | Method for analyzing mass spectra | |
| CN110141220A (en) | Automatic detection method of myocardial infarction based on multimodal fusion neural network | |
| CN111243753B (en) | Multi-factor correlation interactive analysis method for medical data | |
| CN112786203A (en) | Machine learning diabetic retinopathy morbidity risk prediction method and application | |
| CN114207726A (en) | Genetic testing method for skin consultation | |
| CN109582797A (en) | Obtain method, apparatus, medium and electronic equipment that classification of diseases is recommended | |
| Lemanska et al. | Chemometric variance analysis of 1H NMR metabolomics data on the effects of oral rinse on saliva | |
| CN110236572B (en) | Depression prediction system based on body temperature information | |
| Hubers et al. | Artificial intelligence-based classification of motor unit action potentials in real-world needle EMG recordings | |
| CA2618123C (en) | A system, method, and computer program product using a database in a computing system to compile and compare metabolomic data obtained from a plurality of samples | |
| CN116344067B (en) | Influenza susceptibility marker, construction method and application of influenza high risk group prediction model based on same | |
| CN116779077A (en) | Methods and systems for constructing biological age and aging assessment based on physical examination markers | |
| CN116740431A (en) | A method and system for constructing a neural network classification model based on Raman spectroscopy | |
| Kramer et al. | Recognizing species diversity among large-bodied hominoids: a simulation test using missing data finite mixture analysis | |
| CN118800387A (en) | Laboratory Information Management System Based on Digitalization | |
| CN117409961A (en) | Multiple cancer diagnosis methods and systems based on mass spectrometry data and deep learning algorithms | |
| Saranya et al. | Principal component analysis biplot visualization of electromyogram features for submaximal muscle strength grading | |
| Maddipatla | Classaphasia: an ensemble machine learning network to improve aphasia diagnosis and determine severity | |
| CN118116602A (en) | Clinical state evaluation method, device, system and storage medium | |
| CN119964829A (en) | A method for constructing a standard distribution model of healthy people based on generative adversarial autoencoder |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| TA01 | Transfer of patent application right | ||
| TA01 | Transfer of patent application right |
Effective date of registration: 20180829 Address after: 310053, 16, 1601, 5, Binan Road, Changhe street, Binjiang District, Hangzhou, Zhejiang, China, 688 Applicant after: YINAPU (ZHEJIANG) BIOTECHNOLOGY CO.,LTD. Address before: 200030 Dongchuan Road, Minhang District, Shanghai 800 Applicant before: Shanghai Jiao Tong University |
|
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |