[go: up one dir, main page]

CN114171130A - Core fucose identification method, system, equipment, medium and terminal - Google Patents

Core fucose identification method, system, equipment, medium and terminal Download PDF

Info

Publication number
CN114171130A
CN114171130A CN202111235011.4A CN202111235011A CN114171130A CN 114171130 A CN114171130 A CN 114171130A CN 202111235011 A CN202111235011 A CN 202111235011A CN 114171130 A CN114171130 A CN 114171130A
Authority
CN
China
Prior art keywords
core fucose
fucose
data
core
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111235011.4A
Other languages
Chinese (zh)
Other versions
CN114171130B (en
Inventor
张军英
苏远杰
刘继源
孙士生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202111235011.4A priority Critical patent/CN114171130B/en
Publication of CN114171130A publication Critical patent/CN114171130A/en
Application granted granted Critical
Publication of CN114171130B publication Critical patent/CN114171130B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/20Identification of molecular entities, parts thereof or of chemical compositions
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/62Detectors specially adapted therefor
    • G01N30/72Mass spectrometers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • General Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention belongs to the technical field of core fucose identification, and discloses a method, a system, equipment, a medium and a terminal for identifying core fucose, wherein the method for identifying the core fucose comprises the following steps: introducing characteristic ions; preprocessing data; training a model; calculating a threshold value; and (3) identifying core fucose. According to the invention, core fucose does not exist in mouse tissues of FUT8, the characterization of non-core fucose data is learned through a self-encoder, and the identification of the core fucose is regarded as an abnormal detection problem, so that the problems of non-core fucose with fucose migration and label errors of the core fucose data in training data are avoided. The method is simple to operate, and the training data only contain non-core fucose data; technically, 10 characteristic ions are introduced, the abundance of the characteristic ions is input into a self-encoder, and the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum is obtained. The invention can also distinguish core fucose from non-core fucose with fucose migration, and has fast identification speed.

Description

Core fucose identification method, system, equipment, medium and terminal
Technical Field
The invention belongs to the technical field of core fucose identification, and particularly relates to a core fucose identification method, a system, equipment, a medium and a terminal.
Background
Core fucosylation alters the secondary and tertiary conformations of glycoproteins, thereby playing important roles in tumor progression, immune regulation, and stem cell differentiation. Core fucosylation. It has been reported that the core fucose level of tumor tissues is increased compared to normal tissues, immunoglobulins bind to receptors on the surfaces of natural killer cells and macrophages to induce immune response, and the affinity of antibodies to receptors is reduced by 98% to 99% after core fucosylation of N-glycans in the antibodies. Furthermore, many of the core fucose modified glycoproteins may serve as important biomarkers for tumors, e.g., alpha fetoprotein is an important biomarker for hepatocellular carcinoma.
An auto-encoder is a neural network used to learn to reconstruct input data in an unsupervised manner. The whole algorithm is mainly based on the following concepts:
1. an encoder: the input data is mapped to a code characterizing the input data.
2. A decoder: the encoding is mapped to a reconstruction of the input data.
3. And (3) reconstructing errors: euclidean distance of output data to input data.
4. Threshold value: and calculating the mean value and standard deviation of the reconstruction error of the training data.
The self-encoder algorithm learns the characterization of the training data such that the reconstruction error of the training data is minimized.
Core fucose: fucose is a modification of the linkage of the α 1,6 linkage to the innermost N-acetylglucosamine of N-glycosylation.
Fucose migration phenomenon: the phenomenon of intramolecular migration of the terminal fucose unit into the adjacent or distal monosaccharide after activation (see FIG. 7). This so-called fucose migration often leads to misleading fragment ions, i.e. the fucose residues are re-linked to sterically adjacent or distant monosaccharides, leading to erroneous mass spectral data.
At present, the core fucose identification algorithm based on the mass spectrometry technology mainly comprises the following algorithms:
1. manual spectrum resolving method. Matching the inherent characteristic peak of the core fucose with mass spectrum data generated by glycopeptides at a mass spectrometer MS2 stage in a manual mode, and identifying the glycopeptides by the fact that the number of matched peaks is larger than a threshold;
2. a machine learning method. The identification of the core fucose is regarded as a two-classification problem, corresponding characteristic ions are selected, and the two-classification learning method is applied to solve the identification problem of the core fucose. For example, Heeyoun applies a Support Vector Machine (SVM) and a Deep Neural Network (DNN) to core fucose identification.
However, the existing core fucose identification method cannot distinguish core fucose from non-core fucose in which fucose migration occurs. Because the fucose migration phenomenon is not considered, the training data may have data with wrong labels, that is, the labels of the partial mass spectrum data with the core fucose in the training set used by the training model are wrong, so that the application of the trained model to the core fucose identification is unreliable. Therefore, the development of the identification of core fucose of proteins is helpful to explain the function of core fucosylation of proteins and is also of great importance for the discovery of new biomarkers that can be used for the prognosis and diagnosis of cancer.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) the existing core fucose identification algorithm based on the mass spectrum technology has the problem that a core fucose mass spectrogram cannot be distinguished from a non-core fucose mass spectrogram with fucose migration.
(2) Because the fucose migration phenomenon is not considered, data with wrong labels is possible to exist in the training data, namely, the labels in the training set used by the training model are wrong in the data of the partial mass spectrogram with the labels of the core fucose, so that the trained model is unreliable in the core fucose identification.
The difficulty in solving the above problems and defects is: only the fucose migration phenomenon exists, and the migration condition, the migration position, the migration statistical property and the like are unknown, so that the core fucose identification is greatly challenged.
The significance of solving the problems and the defects is as follows: core fucosylation changes the secondary and tertiary conformation of glycoproteins, thereby playing an important biological role in the development, progression and metastasis of tumors. Due to the fucose migration phenomenon, non-core fucose is easily identified as core fucose, thereby misleading the understanding of the tumor development process. Therefore, the high-quality identification of the core fucose has important significance for understanding the biological mechanism of the tumor.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method, a system, equipment, a medium and a terminal for identifying core fucose, in particular to a method, a system, equipment, a medium and a terminal for identifying core fucose based on a self-encoder, aiming at solving the problem that the core fucose mass spectrogram and the non-core fucose mass spectrogram with fucose migration cannot be distinguished in the existing core fucose identification algorithm based on the mass spectrum technology.
The technical scheme of the invention is summarized as follows: removing core fucose from mouse tissues of FUT8, obtaining mass spectrum data of the tissues, introducing characteristic ions, and using relative abundance of the characteristic ions to train a self-encoder model of non-core fucose; the identification of the core fucose is regarded as an abnormal detection problem of the model, and the identification of the core fucose is realized.
The invention is realized in such a way that a method for identifying core fucose comprises the following steps:
step one, extracting characteristic ions;
step two, data preprocessing;
step three, training a model;
step four, calculating a threshold value;
and step five, identifying the core fucose.
Further, in the step one, the introducing characteristic ions includes:
the pentasaccharide core is the intrinsic structure comprised by the N-saccharide, i.e. theoretically the Y ions generated after the fragmentation of the core fucose comprise 10 ions, called the characteristic ions identified for fucose, while the mass of the Y ions generated by the migration of fucose to the pentasaccharide core part is different from the mass of these 10 characteristic ions.
Further, in step two, the data preprocessing includes:
the mouse tissues from which FUT8 was removed had no core fucose present, and the mass spectra data of these tissues were normalized by dividing the abundance of each Y ion by the sum of the abundances of all Y ions on the mass spectra data to obtain normalized mass spectra data:
Figure BDA0003317191750000041
using the relative abundances of the 10 characteristic ions of the normalized mass spectrometry data as training data.
Further, in step three, the model training includes:
training an uncore fucose autoencoder; the non-core fucose self-encoder is a 7-layer artificial neural network, an input layer and an output layer both comprise 10 artificial neurons, a hidden layer respectively comprises 9 artificial neurons, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons and 9 artificial neurons from left to right, and the artificial neurons of two adjacent layers are fully connected; the activation function of the artificial neuron uses a tanh function, the gradient descent training method of the neural network uses an Adam algorithm, the learning rate is set to be 0.0001, the maximum iteration number is 1000, and the tolerance error is 0.0000001.
Further, in step four, the threshold calculation includes:
taking the normalized data set of the non-core fucose mass spectrogram as X ═ X(1),x(2),x(3),…x(N)H, the self-encoder is fφ·gθCalculating X in the data set X(i)Is recorded as
Figure BDA0003317191750000042
Figure BDA0003317191750000043
The threshold α is calculated as:
α=μ+k·σ;
wherein μ is
Figure BDA0003317191750000044
A mean value of
Figure BDA0003317191750000045
K is a user parameter.
Further, in step five, the identification of core fucose comprises:
recording the normalized data set of the mass spectrum data to be identified as Y ═ Y(1),y(2),y(3),...y(M)Y in the data set Y is calculated(i)Is recorded as
Figure BDA0003317191750000046
Figure BDA0003317191750000047
If it is
Figure BDA0003317191750000048
Identifying the ith mass spectrogram to be identified as the core fucose; if it is
Figure BDA0003317191750000049
The ith mass spectrum to be identified is identified as non-core fucose.
Another object of the present invention is to provide a core fucose identification system using the method for identifying core fucose, the core fucose identification system comprising:
a characteristic ion introduction module for introducing characteristic ions identified by fucose;
the data preprocessing module is used for removing core fucose which does not exist in mouse tissues of FUT8 and normalizing the mass spectrum data of the tissues;
a model training module for training the non-core fucose autoencoder;
a threshold calculation module for calculating X in the data set X(i)The reconstruction error of (2);
a core fucose identification module for identifying Y in the data set Y by calculating(i)And (3) performing identification of the core fucose.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
introducing characteristic ions identified by fucose; removing core fucose in mouse tissues of FUT8, and normalizing mass spectrum data of the tissues; training a non-core fucose self-encoder by using the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
introducing characteristic ions identified by fucose; removing core fucose in mouse tissues of FUT8, and normalizing mass spectrum data of the tissues; training a non-core fucose self-encoder by using the relative abundance of the 10 characteristic ions of the normalized mass spectrum data as training data; computing X in dataset X(i)The reconstruction error of (2); by calculating Y in the data set Y(i)And (3) performing identification of the core fucose.
Another object of the present invention is to provide an information data processing terminal for implementing the core fucose identification system.
By combining all the technical schemes, the invention overcomes the technical bias in the industry and fills the blank in the industry. The invention has the advantages and positive effects that: the core fucose identification method provided by the invention can solve the problem that a core fucose mass spectrogram cannot be distinguished from a non-core fucose mass spectrogram with fucose migration. The core fucose does not exist in the mouse tissues of the FUT8, the invention learns the characterization of the non-core fucose data through a self-encoder, and the identification of the core fucose is regarded as an abnormal detection problem, thereby avoiding the problems of the non-core fucose with fucose migration and the label error of the core fucose data in the training data.
According to the invention, firstly, the mouse tissue of FUT8 is removed to obtain the non-core fucose data only for training the non-core fucose model, so that the training data has no core fucose and non-core fucose with fucose migration; the invention technically uses a self-encoder to learn the characteristics of non-core fucose data, and considers the core fucose identification as an abnormal detection problem; the invention technically introduces 10 characteristic ions, and the relative abundance of the characteristic ions is input into a self-encoder to obtain the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum. The method is simple to operate, and the training data only contain non-core fucose data; the invention can distinguish non-core fucose and core fucose with fucose migration, and has fast identification speed.
Because the invention requires that the mass spectrum actually being the core fucose is identified as much as possible, the mass spectrum of the core fucose is as follows: the identification result contains as many mass spectra of core fucose as possible while allowing a mass spectrum containing a small amount of actually non-core fucose, so that the parameter k is taken as 0.4, and the obtained identification accuracy is shown in table 1. As can be seen, the model has higher accuracy in the non-core fucose identification and the core fucose identification, which indicates that the technology and the system of the invention are reliable and effective core fucose identification technology and system.
Table 1 core fucose identification results based on autoencoder (k ═ 0.4)
Number of spectrogram Rate of accuracy
Non-core fucose 1199 89.86%
Core fucose 426 98.83%
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for identifying core fucose according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a core fucose identification method provided in an embodiment of the present invention.
FIG. 3 is a block diagram of a core fucose identification system according to an embodiment of the present invention;
in the figure: 1. a characteristic ion introduction module; 2. a data preprocessing module; 3. a model training module; 4. a threshold calculation module; 5. a core fucose identification module.
FIG. 4 is a schematic diagram of characteristic ions for core fucose identification provided by embodiments of the present invention.
Fig. 5 is a schematic diagram of parameters of a self-encoder model according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating the influence of the parameter k on the accuracy of the evaluation according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of an example of fucose migration phenomenon provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In view of the problems in the prior art, the present invention provides a method, system, device, medium and terminal for identifying core fucose, and the present invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the method for identifying core fucose provided by the embodiment of the present invention includes the following steps:
s101, introducing characteristic ions;
s102, preprocessing data;
s103, training a model;
s104, calculating a threshold value;
s105, core fucose identification.
The schematic diagram of the core fucose identification method provided by the embodiment of the invention is shown in figure 2.
The core fucose identification system provided by the embodiment of the invention is shown in figure 3, and comprises:
a characteristic ion introduction module 1 for introducing characteristic ions identified by fucose;
a data preprocessing module 2, which is used for carrying out normalization processing on the mass spectrum data of the mouse tissue with the FUT8 removed and core fucose not existing;
the model training module 3 is used for training the non-core fucose self-encoder;
a threshold calculation module 4 for calculating X in the data set X(i)The reconstruction error of (2);
a core fucose identification module 5 for identifying Y in the data set Y by calculating(i)And (4) performing core fucose identification on the reconstruction error.
The technical solution of the present invention will be further described below with reference to the term explanation.
An auto-encoder: a neural network that learns to reconstruct input data in an unsupervised manner;
core fucose: an N-glycosylation modification.
The technical solution of the present invention is further described below with reference to specific examples.
Core fucose is not present in the tissues of mice from which FUT8 was removed. According to the method, the characterization of the non-core fucose mass spectrum data is learned through the self-encoder, and the core fucose identification is regarded as an abnormal detection problem, so that the problems of label errors of the non-core fucose and the core fucose data with fucose migration in training data are avoided.
The technical route of the invention is shown in figure 2.
The technical scheme of the invention is as follows:
(1) introduction of characteristic ions
The pentasaccharide core is the inherent structure contained by the N-saccharide, i.e. theoretically, the Y ions generated after the core fucose is fragmented will include 10 ions as shown in fig. 4, which are called characteristic ions identified by fucose, and the mass of the Y ions generated by the migration of fucose to the pentasaccharide core part is often different from that of the 10 characteristic ions.
(2) Data pre-processing
Core fucose is not present in the tissues of mice from which FUT8 was removed. Normalizing the mass spectrum data of the tissues, namely dividing the abundance of each Y ion in the mass spectrum data by the sum of the abundances of all the Y ions to obtain normalized mass spectrum data:
Figure BDA0003317191750000081
the relative abundance of the above-mentioned 10 characteristic ions of these normalized mass spectral data was taken as training data.
(3) Model training
The invention trains a non-core fucose self-encoder, which is a 7-layer artificial neural network. As shown in fig. 5, the input layer and the output layer both include 10 artificial neurons, the hidden layer includes 9 artificial neurons, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons, and 9 artificial neurons from left to right, respectively, and the artificial neurons of two adjacent layers are all connected. The activation function of the artificial neuron uses a tanh function, the gradient descent training method of the neural network uses an Adam algorithm, the learning rate is set to be 0.0001, the maximum iteration number is 1000, and the tolerance error is 0.0000001.
(4) Threshold calculation
Taking the normalized data set of the non-core fucose mass spectrogram as X ═ X(1),x(2),x(3),...x(N)H, the self-encoder is fφ·gθCalculating X in the data set X(i)Is recorded as
Figure BDA0003317191750000091
Figure BDA0003317191750000092
The threshold α is calculated as:
α=μ+k·σ (3)
wherein μ is
Figure BDA0003317191750000093
A mean value of
Figure BDA0003317191750000094
K is a user parameter.
(5) Core fucose identification
Record and wait for identificationThe normalized data set of the qualitative spectrum data is Y ═ { Y ═ Y(1),y(2),y(3),...y(M)Y in the data set Y is calculated(i)Is recorded as
Figure BDA0003317191750000095
Figure BDA0003317191750000096
If it is
Figure BDA0003317191750000097
Identifying the ith mass spectrogram to be identified as core fucose; if it is
Figure BDA0003317191750000098
Identifying the ith mass spectrum to be identified as non-core fucose.
According to the invention, firstly, the mouse tissue of FUT8 is removed to obtain the data only containing non-core fucose, so that the training data has no core fucose and the non-core fucose with fucose migration; the invention technically uses a self-encoder to learn the characteristics of non-core fucose data, and considers the core fucose identification as an abnormal detection problem; the invention technically introduces 10 characteristic ions, and the abundance of the characteristic ions is input into a self-encoder to obtain the conclusion whether the mass spectrum to be identified is the core fucose mass spectrum. The method is simple to operate, and the training data only contain non-core fucose data; the invention can distinguish non-core fucose and core fucose with fucose migration, and has fast identification speed.
The technical solution of the present invention is further described below with reference to simulation experiments.
Experimental examples: the following examples are for illustrative purposes and are not intended to limit the scope of the present invention.
The experiment of the invention is carried out by training a self-encoder, wherein a training set comprises 18700 non-core fucose mass spectrograms, and the mass spectrograms to be identified comprise 1199 non-core fucose mass spectrograms and 426 high mannose type core fucose mass spectrograms. Fig. 6 shows the identification accuracy of the mass spectrum to be identified (dark color is the identification accuracy of core fucose, light color is the identification accuracy of non-core fucose) with the self-encoder trained by the training set, along with the variation of the parameter k sampling value.
Table 1 core fucose identification results based on autoencoder (k ═ 0.4)
Number of spectrogram Rate of accuracy
Non-core fucose 1199 8986%
Core fucose 426 98.83%
Because the invention requires that the mass spectrum actually being the core fucose is identified as much as possible, the mass spectrum of the core fucose is as follows: the identification result contains as many mass spectra of core fucose as possible while allowing a mass spectrum containing a small amount of actually non-core fucose, so that the parameter k is taken as 0.4, and the obtained identification accuracy is shown in table 1. As can be seen, the model has higher accuracy in the non-core fucose identification and the core fucose identification, which indicates that the technology and the system of the invention are reliable and effective core fucose identification technology and system.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1.一种核心岩藻糖鉴定方法,其特征在于,所述核心岩藻糖鉴定方法包括以下步骤:1. a core fucose identification method, is characterized in that, described core fucose identification method comprises the following steps: 步骤一,引入特征离子;Step 1, introducing characteristic ions; 步骤二,数据预处理;Step 2, data preprocessing; 步骤三,模型训练;Step 3, model training; 步骤四,阈值计算;Step 4, threshold calculation; 步骤五,核心岩藻糖鉴定。Step five, core fucose identification. 2.如权利要求1所述核心岩藻糖鉴定方法,其特征在于,步骤一中,所述引入特征离子,包括:2. core fucose identification method as claimed in claim 1, is characterized in that, in step 1, described introduction characteristic ion, comprises: 五糖核心是N-糖包含的固有结构,即理论上核心岩藻糖碎裂后产生的Y离子均包括10种离子,称为岩藻糖鉴定的特征离子,而岩藻糖迁移到五糖核心部分产生的Y离子的质量与这10种特征离子的质量不同。The pentasaccharide core is an inherent structure contained in N-sugar, that is, theoretically, the Y ions generated after the fragmentation of the core fucose include 10 kinds of ions, which are called characteristic ions identified by fucose, and the fucose migrates to the pentose. The mass of the Y ion produced in the core part is different from the mass of these 10 characteristic ions. 3.如权利要求1所述核心岩藻糖鉴定方法,其特征在于,步骤二中,所述数据预处理,包括:3. core fucose identification method as claimed in claim 1, is characterized in that, in step 2, described data preprocessing, comprises: 去除FUT8的小鼠组织中不存在核心岩藻糖;对这些组织的质谱数据进行归一化处理,即将质谱数据上每一Y离子的丰度与所有离子的丰度之和相除,获得归一化的质谱数据:No core fucose was present in mouse tissues from which FUT8 was removed; the mass spectrometry data of these tissues were normalized by dividing the abundance of each Y ion on the mass spectrometry data by the sum of the abundances of all ions to obtain a normalized value. Normalized mass spectrometry data:
Figure FDA0003317191740000011
Figure FDA0003317191740000011
将所述归一化质谱数据的所述10个特征离子的相对丰度作为训练数据。The relative abundances of the 10 characteristic ions of the normalized mass spectrometry data are used as training data.
4.如权利要求1所述核心岩藻糖鉴定方法,其特征在于,步骤三中,所述模型训练,包括:4. core fucose identification method as claimed in claim 1, is characterized in that, in step 3, described model training, comprises: 训练非核心岩藻糖自编码器;所述非核心岩藻糖自编码器是一个7层人工神经网络,输入层和输出层都包含10个人工神经元,隐藏层从左至右分别包含9个人工神经元、8个人工神经元、7个人工神经元、8个人工神经元和9个人工神经元,相邻两层的人工神经元之间是全连接;其中,所述人工神经元的激活函数使用tanh函数,神经网络的梯度下降训练方法使用的是Adam算法,学习速率设置为0.0001,最大迭代次数为1000,容忍误差为0.0000001。Training a non-core fucose autoencoder; the non-core fucose autoencoder is a 7-layer artificial neural network with 10 artificial neurons in both the input and output layers, and 9 in the hidden layers from left to right One artificial neuron, 8 artificial neurons, 7 artificial neurons, 8 artificial neurons and 9 artificial neurons, and the artificial neurons of two adjacent layers are fully connected; wherein, the artificial neurons The activation function uses the tanh function, the gradient descent training method of the neural network uses the Adam algorithm, the learning rate is set to 0.0001, the maximum number of iterations is 1000, and the tolerance error is 0.0000001. 5.如权利要求1所述核心岩藻糖鉴定方法,其特征在于,步骤四中,所述阈值计算,包括:5. core fucose identification method as claimed in claim 1, is characterized in that, in step 4, described threshold value calculation, comprises: 记非核心岩藻糖质谱图归一化后的数据集为X={x(1),x(2),x(3),…x(N)},自编码器为fφ·gθ,计算数据集X中x(i)的重构误差,记为
Figure FDA0003317191740000021
Denote the normalized data set of non-core fucose mass spectrum as X={x (1) ,x (2) ,x (3) ,…x (N) }, and the autoencoder is f φ ·g θ , calculate the reconstruction error of x (i) in the dataset X, denoted as
Figure FDA0003317191740000021
Figure FDA0003317191740000022
Figure FDA0003317191740000022
阈值α的计算为:The threshold α is calculated as: α=μ+k·σ;α=μ+k·σ; 其中,μ为
Figure FDA0003317191740000023
的平均值,σ为
Figure FDA0003317191740000024
的标准差,k为用户参数。
where μ is
Figure FDA0003317191740000023
The average value of , σ is
Figure FDA0003317191740000024
The standard deviation of , k is a user parameter.
6.如权利要求1所述核心岩藻糖鉴定方法,其特征在于,步骤五中所述核心岩藻糖鉴定,包括:6. core fucose identification method as claimed in claim 1 is characterized in that, core fucose identification described in step 5, comprises: 记待鉴定质谱数据经归一化的数据集为Y={y(1),y(2),y(3),…y(M)},计算数据集Y中y(i)的重构误差,记为
Figure FDA0003317191740000025
Note that the normalized data set of mass spectral data to be identified is Y={y (1) , y (2) , y (3) , ... y (M) }, and calculate the reconstruction of y (i) in data set Y error, denoted as
Figure FDA0003317191740000025
Figure FDA0003317191740000026
Figure FDA0003317191740000026
Figure FDA0003317191740000027
则将第i张待鉴定的质谱图鉴定为核心岩藻糖;若
Figure FDA0003317191740000028
则将第i张待鉴定的质谱图鉴定为非核心岩藻糖。
like
Figure FDA0003317191740000027
Then the i-th mass spectrum to be identified is identified as core fucose; if
Figure FDA0003317191740000028
Then the i-th mass spectrum to be identified is identified as non-core fucose.
7.一种应用如权利要求1~6任意一项所述核心岩藻糖鉴定方法的核心岩藻糖鉴定系统,其特征在于,所述核心岩藻糖鉴定系统包括:7. A core fucose identification system applying the core fucose identification method according to any one of claims 1 to 6, wherein the core fucose identification system comprises: 特征离子引入模块,用于引入岩藻糖鉴定的特征离子;The characteristic ion introduction module is used to introduce characteristic ions identified by fucose; 数据预处理模块,用于去除FUT8的小鼠组织中不存在的核心岩藻糖,并对所述质谱数据进行归一化处理;A data preprocessing module for removing core fucose that does not exist in the mouse tissue of FUT8, and normalizing the mass spectrometry data; 模型训练模块,用于训练非核心岩藻糖自编码器;Model training module for training non-core fucose autoencoders; 阈值计算模块,用于计算数据集X中x(i)的重构误差均值和方差,进而确定阈值;Threshold calculation module, used to calculate the reconstruction error mean and variance of x (i) in the data set X, and then determine the threshold; 核心岩藻糖鉴定模块,用于通过计算数据集Y中y(i)的重构误差,进行核心岩藻糖的鉴定。A core fucose identification module for core fucose identification by calculating the reconstruction error of y (i) in dataset Y. 8.一种计算机设备,其特征在于,所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下步骤:8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to perform the following steps: 引入岩藻糖鉴定的特征离子;去除FUT8的小鼠组织中不存在核心岩藻糖,对所述组织的质谱数据进行归一化处理,将所述归一化质谱数据的所述10个特征离子的相对丰度作为训练数据;训练非核心岩藻糖自编码器;计算数据集X中x(i)的重构误差;通过计算数据集Y中y(i)的重构误差,进行核心岩藻糖的鉴定。The characteristic ions identified by fucose were introduced; the core fucose was absent in the mouse tissue from which FUT8 was removed, the mass spectrometry data of the tissue was normalized, and the 10 features of the normalized mass spectrometry data were The relative abundance of ions is used as training data; the non-core fucose autoencoder is trained; the reconstruction error of x ( i) in dataset X is calculated; Identification of fucose. 9.一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如下步骤:9. A computer-readable storage medium storing a computer program, when the computer program is executed by a processor, the processor is caused to perform the following steps: 引入岩藻糖鉴定的特征离子;去除FUT8的小鼠组织中不存在核心岩藻糖,对所述组织的质谱数据进行归一化处理,将所述归一化质谱数据的所述10个特征离子的相对丰度作为训练数据;训练非核心岩藻糖自编码器;计算数据集X中x(i)的重构误差;通过计算数据集Y中y(i)的重构误差,进行核心岩藻糖的鉴定。The characteristic ions identified by fucose were introduced; the core fucose was absent in the mouse tissue from which FUT8 was removed, the mass spectrometry data of the tissue was normalized, and the 10 features of the normalized mass spectrometry data were The relative abundance of ions is used as training data; the non-core fucose autoencoder is trained; the reconstruction error of x ( i) in dataset X is calculated; Identification of fucose. 10.一种信息数据处理终端,其特征在于,所述信息数据处理终端用于实现如权利要求7所述的核心岩藻糖鉴定系统。10 . An information data processing terminal, wherein the information data processing terminal is used to implement the core fucose identification system according to claim 7 .
CN202111235011.4A 2021-10-22 2021-10-22 Method, system, equipment, medium and terminal for identifying core fucose Active CN114171130B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111235011.4A CN114171130B (en) 2021-10-22 2021-10-22 Method, system, equipment, medium and terminal for identifying core fucose

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111235011.4A CN114171130B (en) 2021-10-22 2021-10-22 Method, system, equipment, medium and terminal for identifying core fucose

Publications (2)

Publication Number Publication Date
CN114171130A true CN114171130A (en) 2022-03-11
CN114171130B CN114171130B (en) 2024-07-19

Family

ID=80477172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111235011.4A Active CN114171130B (en) 2021-10-22 2021-10-22 Method, system, equipment, medium and terminal for identifying core fucose

Country Status (1)

Country Link
CN (1) CN114171130B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150160233A1 (en) * 2012-05-21 2015-06-11 Indiana University Research And Technology Corporation Identification and Quantification of Intact Glycopeptides in Complex Samples
US20180101643A1 (en) * 2015-05-18 2018-04-12 The Regents Of The University Of California Systems and Methods for Predicting Glycosylation on Proteins
WO2018223025A1 (en) * 2017-06-01 2018-12-06 Brandeis University System and method for determining glycan topology using tandem mass spectra
CN110009706A (en) * 2019-03-06 2019-07-12 上海电力学院 A kind of digital cores reconstructing method based on deep-neural-network and transfer learning
US20200273545A1 (en) * 2019-02-22 2020-08-27 Board Of Regents Of The Nevada System Of Higher Education, On Behalf Of The University Of Nevada Computer-implemented methods and systems for identifying a species from mass spectra
CN113383236A (en) * 2018-11-23 2021-09-10 新加坡科技研究局 Method for multi-attribute identification of unknown biological samples
CN113484400A (en) * 2021-07-01 2021-10-08 上海交通大学 Mass spectrogram molecular formula calculation method based on machine learning
CN113495094A (en) * 2020-04-01 2021-10-12 中国电信股份有限公司 Molecular mass spectrum model training method, molecular mass spectrum simulation method and computer

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150160233A1 (en) * 2012-05-21 2015-06-11 Indiana University Research And Technology Corporation Identification and Quantification of Intact Glycopeptides in Complex Samples
US20180101643A1 (en) * 2015-05-18 2018-04-12 The Regents Of The University Of California Systems and Methods for Predicting Glycosylation on Proteins
WO2018223025A1 (en) * 2017-06-01 2018-12-06 Brandeis University System and method for determining glycan topology using tandem mass spectra
CN113383236A (en) * 2018-11-23 2021-09-10 新加坡科技研究局 Method for multi-attribute identification of unknown biological samples
US20200273545A1 (en) * 2019-02-22 2020-08-27 Board Of Regents Of The Nevada System Of Higher Education, On Behalf Of The University Of Nevada Computer-implemented methods and systems for identifying a species from mass spectra
CN110009706A (en) * 2019-03-06 2019-07-12 上海电力学院 A kind of digital cores reconstructing method based on deep-neural-network and transfer learning
CN113495094A (en) * 2020-04-01 2021-10-12 中国电信股份有限公司 Molecular mass spectrum model training method, molecular mass spectrum simulation method and computer
CN113484400A (en) * 2021-07-01 2021-10-08 上海交通大学 Mass spectrogram molecular formula calculation method based on machine learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YANG, Y等: "Prediction of glycopeptide fragment mass spectra by deep learning", NATURE COMMUNICATIONS, vol. 15, no. 1, 10 April 2024 (2024-04-10), pages 1 - 12 *
乔彦涛;缪佳铮;孙世伟;刘金刚;卜东波;: "串联质谱的蛋白质序列鉴定技术综述", 计算机科学与探索, no. 02, 15 February 2010 (2010-02-15), pages 5 - 15 *
苏远杰: "基于质谱数据的核心岩藻糖鉴定方法与算法研究", 中国优秀硕士学位论文全文数据库(电子期刊)), no. 04, 31 December 2022 (2022-12-31), pages 006 - 232 *

Also Published As

Publication number Publication date
CN114171130B (en) 2024-07-19

Similar Documents

Publication Publication Date Title
WO2023092961A1 (en) Semi-supervised method and apparatus for public opinion text analysis
CN111402257B (en) Automatic medical image segmentation method based on multi-task collaborative cross-domain migration
Li et al. One sentence one model for neural machine translation
CN115270872A (en) Radar radiation source individual small sample learning and identifying method, system, device and medium
WO2021213161A1 (en) Dialect speech recognition method, apparatus, medium, and electronic device
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
WO2019202941A1 (en) Self-training data selection device, estimation model learning device, self-training data selection method, estimation model learning method, and program
US20250189496A1 (en) Predicting chemical structure and properties based on mass spectra
CN113723083B (en) BERT model-based weighted passive supervision text emotion analysis method
LeBrun et al. Evaluating distributional distortion in neural language modeling
CN114550831B (en) A gastric cancer proteomic classification framework identification method based on deep learning feature extraction
US20240370793A1 (en) Apparatus and method for task allocation
CN117390187A (en) Event type induction method and system based on contrast learning and iterative optimization
CN117115817A (en) Cell morphology recognition method and device based on multi-modal fusion
CN118332291A (en) A method for predicting aircraft multi-sensor data faults
CN117854137A (en) Anti-noise expression recognition method based on regularization constraint
Zhang et al. USNID: A framework for unsupervised and semi-supervised new intent discovery
CN117274639A (en) Isotope peak identification method, device and readable medium based on mass spectrometry imaging data
CN114171130A (en) Core fucose identification method, system, equipment, medium and terminal
CN119514531A (en) Method for identifying and judging specific information in text
CN118888007A (en) A cancer drug response prediction method based on deep transfer learning
Deng et al. Active learning music genre classification based on support vector machine
CN115985299B (en) Keyword voice cleaning method, cleaning device and storage medium
CN115472179A (en) Automatic detection method and system for digital audio deletion and insertion tampering operation
CN115966201B (en) Speech data set screening and processing method, screening and processing device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant