[go: up one dir, main page]

WO2024230826A1 - Procédé de prédiction de solubilité d'une protéine - Google Patents

Procédé de prédiction de solubilité d'une protéine Download PDF

Info

Publication number
WO2024230826A1
WO2024230826A1 PCT/CN2024/092583 CN2024092583W WO2024230826A1 WO 2024230826 A1 WO2024230826 A1 WO 2024230826A1 CN 2024092583 W CN2024092583 W CN 2024092583W WO 2024230826 A1 WO2024230826 A1 WO 2024230826A1
Authority
WO
WIPO (PCT)
Prior art keywords
protein
sample
features
model
solubility
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/092583
Other languages
English (en)
Chinese (zh)
Inventor
樊隆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genscript Shanghai Biotech Co Ltd
Original Assignee
Genscript Shanghai Biotech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genscript Shanghai Biotech Co Ltd filed Critical Genscript Shanghai Biotech Co Ltd
Priority to CN202480001588.4A priority Critical patent/CN118696380A/zh
Publication of WO2024230826A1 publication Critical patent/WO2024230826A1/fr
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present disclosure relates to protein research, and more particularly to methods for predicting protein solubility.
  • Improving protein solubility plays a very important role in increasing the yield of recombinant proteins, conducting protein function research (including structural proteomics and functional proteomics research) and completing enzyme engineering precursor screening, which is directly related to the use function and production cost of recombinant proteins.
  • this important evaluation index can be used to improve the screening efficiency, find high-potential proteins, and effectively reduce the scale of downstream experimental verification.
  • the primary sequence of a protein determines its higher-order structure after folding, and the higher-order structure of a protein determines its solubility. Therefore, both sequence features based on the primary structure and features based on the higher-order structure have been used to construct protein solubility prediction models, including protein sequence length, amino acid residue composition (such as the number of charged amino acid residues, the number of hydrophobic/hydrophilic amino acid residues, k-mer features), protein secondary structure (such as the ratio of alpha helix and beta fold), protein tertiary structure (such as the turn between the alpha carbon atoms of two amino acids), etc. (For detailed feature descriptions, see the papers in Table 1).
  • the statistical methods, machine learning, and deep learning methods currently used to predict protein solubility include: Gaussian discriminant analysis model, linear regression, support vector machine, logistic regression, linear programming, random Machine forest, gradient boosting machine, naive Bayes, neural network, convolutional neural network, dual-channel convolutional neural network, bidirectional threshold recurrent unit neural network, graph neural network, etc. (For detailed model usage, see the text of each paper in Table 1).
  • the present disclosure proposes a protein solubility prediction method, which combines the features extracted based on the protein pre-training model with the protein primary structure features, and uses an automatic machine learning framework to build and train a model that is compatible with classification problems (predicting whether a protein is soluble) and regression problems (predicting the probability of different proteins being soluble, predicting the effect of amino acid mutations on solubility, and helping to screen high-potential variants in advance) for predicting protein solubility. Furthermore, the hyperparameter combination that optimizes model performance can be obtained through automatic hyperparameter screening in the automatic machine learning framework.
  • a method for predicting protein solubility may include: extracting features based on a protein sequence using a protein pre-training model; extracting primary structural features of the protein based on the protein sequence; concatenating the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the protein; inputting the feature vector of the protein into a protein solubility prediction model to obtain a prediction result of the protein solubility.
  • the protein pre-training model can be implemented by a combination of one or more of the following models: ESM-1b, UniRep, ProteinBert, TAPE, ProtGPT2, ProtTXL, ProtBert, ProtXLNet, ProtAlbert, ProtElectra, ProtT5-XL-BFD, ProtT5-XL-UniRef50 and ProtT5-XXL; more preferably, the protein pre-training model is ProtT5-XL-BFD or ProtT5-XL-UniRef50; more preferably, the protein pre-training model is ProtT5-XL-UniRef50.
  • the protein sequence can be used as input, and the ProtT5-XL-UniRef50 pre-trained model can be used to extract the embedding layer vector output by the encoder, wherein each amino acid corresponds to a 1024-dimensional vector, and each sequence corresponds to an L ⁇ 1024-dimensional feature matrix, wherein L is the length of the amino acid sequence of the protein; the above-mentioned feature matrix is averaged by column to obtain a 1024-dimensional feature vector.
  • the Sentence-T5 Encoder-only mean method can be used to average the values of each dimension of the embedding layer feature values of all amino acids output by the encoder by column to obtain a 1024-dimensional feature vector.
  • the protein sequence can be used as input to extract the primary structural features of the protein using the iFeature tool.
  • the primary structural features of the protein can be selected from AAC (Amino Acid Composition), DPC (Di-Peptide Composition), DDE (Dipeptide Deviation from Expected Mean), TPC (Tri-Peptide Composition), CKSAAP (Composition of k-spaced Amino Acid Pairs), EAAC (Enhanced Amino Acid Composition), GAAC (Grouped Amino Acid Composition), ition (grouped amino acid composition), CKSAAGP (Composition of k-Spaced Amino Acid Group Pairs), GDPC (Grouped Di-Peptide Composition), GTPC (Grouped Tri-Peptide Composition), Moran (Moran correlation), Geary (Geary correlation), NMBroto (Normalized Moreau-Broto Autocorrelation), CTDC (Compositio n, Transition and Distribution: Composition, Composition, Transition and Distribution: Composition), CTDT (Composition, Transition and Distribution: Transition, Composition, Transition and Distribution: Transition), CTDD (Composition, Transition, Transition,
  • the primary structural features of the protein extracted using the iFeature tool include 11 categories, namely AAC, GAAC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, PAAC and APAAC.
  • the protein solubility prediction model can be constructed by the following steps: obtaining a data set, the data set comprising a plurality of training samples, each of the training samples comprising
  • the invention discloses a method for predicting protein solubility of a protein by using a protein pre-training model and a sample protein sequence and its solubility data; extracting features based on the sample protein sequence using a protein pre-training model; extracting primary structural features of the protein based on the sample protein sequence; splicing the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the sample protein; using the feature vector of the sample protein as input data and the solubility data of the sample protein as output data, and obtaining a final protein solubility prediction model by training a machine learning model.
  • the protein solubility prediction model can be constructed and trained based on an automatic machine learning framework, using one or more machine learning models as candidate models.
  • the machine learning model can be a deep learning model.
  • the automatic machine learning framework can be selected from AutoGluon (https://auto.gluon.ai/stable/index.html), Auto-Sklearn (https://automl.github.io/auto-sklearn/master/), TPOT (http://epistasislab.github.io/tpot/), Hyperopt Sklearn (https://github.com/hyperopt/hyperopt-sklearn), Auto-Keras (https://autokeras.com/), H2O AutoML (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html), TransmogrifAI (https://transmogrif.ai/).
  • the automatic learning framework can be AutoGluon.
  • the machine learning model is a deep learning model, which can be selected from NN_Torch, FASTAI, LGBModel, RFModel, KNNModel, etc.
  • the deep learning model can be one or more.
  • the deep learning model can be NN_Torch and/or FASTAI.
  • the deep learning model can be NN_Torch and FASTAI.
  • the machine learning model can be automatically selected by an automatic learning framework, or a list and screening range of machine learning models can be manually set and then selected by the automatic learning framework.
  • AutoGluon can be used as an automatic machine learning framework
  • deep learning models NN_Torch and FASTAI can be selected as candidate models to construct and train the protein solubility prediction model.
  • other features of the protein can also be extracted based on the protein sequence; the features extracted by the protein pre-training model are spliced with the primary structure features of the protein and other features of the protein to obtain a feature vector of the spliced protein.
  • the other characteristics of the protein are selected from the group consisting of molecular weight, aromaticity, instability index, flexibility, isoelectric point, molar absorption coefficient, grand average of hydrophilicity, Hydropathy), the proportion of helix/turn/sheet in the secondary structure of the protein, SSEC (Secondary Structure Elements Content), SSEB (Secondary Structure Elements Binary), Disorder, DisorderC, DisorderB (Disorder Binary), ASA (Accessible Solvent Accessibility) and TA (Torsion angle), sequence length, ZSCALE (Z-scale) and 48 PseKRAAC (48 pseudo K-tuple reduced amino acids composition); more preferably, the other characteristics of the protein include molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorptivity, total average hydrophilicity and the proportion of helix/turn/sheet in the secondary structure of the protein.
  • a method for constructing a protein solubility prediction model may include: obtaining a data set, the data set includes multiple training samples, each of the training samples includes a sample protein sequence and its solubility data; based on the sample protein sequence, extracting features using a protein pre-training model; based on the sample protein sequence, extracting the primary structural features of the protein; splicing the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the sample protein; using the feature vector of the sample protein as input data, using the solubility data of the sample protein as output data, and training the machine learning model to obtain a final protein solubility prediction model.
  • the method for constructing a protein solubility prediction model in the first aspect of the present disclosure is suitable for the second aspect of the present disclosure.
  • a system for predicting protein solubility may include: an acquisition module for acquiring a sample to be predicted, wherein the sample to be predicted includes a protein sequence of the sample to be predicted; a processing module for obtaining a prediction result of the protein solubility of the sample to be predicted by inputting the protein sequence of the sample to be predicted, wherein the processing module further includes the following submodules: a first feature extraction submodule for extracting features based on the protein sequence of the sample to be predicted using a protein pre-training model; a second feature extraction submodule for extracting the primary structural features of the protein based on the protein sequence of the sample to be predicted; a feature splicing submodule for splicing the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the protein of the sample to be predicted; and a model prediction submodule for inputting the feature vector of the protein of the sample to be predicted into a protein solubility prediction model
  • the system for predicting protein solubility according to the third aspect of the present disclosure can be implemented by a computer.
  • the system can be implemented by executing a computer program to implement the following operations: obtaining a sample to be predicted, wherein the sample to be predicted includes The protein sequence of the sample to be predicted is input; by inputting the protein sequence of the sample to be predicted, a prediction result of the protein solubility of the sample to be predicted is obtained, and the operation further includes the following sub-operations: based on the protein sequence of the sample to be predicted, a protein pre-training model is used to extract features; based on the protein sequence of the sample to be predicted, the primary structure features of the protein are extracted; the features extracted by the protein pre-training model are spliced with the primary structure features of the protein to obtain a feature vector of the protein of the sample to be predicted; the feature vector of the protein of the sample to be predicted is input into the protein solubility prediction model to obtain a prediction result of the protein solubility of the sample to be predicted.
  • a non-transitory computer-readable storage medium for storing a computer program.
  • the computer program includes instructions.
  • the electronic device implements the protein solubility prediction method according to the first aspect of the present disclosure or the method for constructing a protein solubility prediction model according to the second aspect of the present disclosure.
  • a computer system comprising: a processor, a memory, and a computer program.
  • the computer program is stored in the memory and is configured to be executed by the processor.
  • the computer program comprises instructions for implementing the protein solubility prediction method according to the first aspect of the present disclosure or the method for constructing a protein solubility prediction model according to the second aspect of the present disclosure.
  • a protein solubility prediction model constructed by the method of the second aspect of the present disclosure is provided. Further provided is the application of the protein solubility prediction model in predicting protein solubility.
  • the protein solubility prediction model can be applied to predict the solubility of any protein sample of known sequence (such as a mutant of a known protein), and the predicted result can be the probability of solubility or the result of solubility yes or no.
  • the comprehensive performance of the method disclosed in the present invention in predicting protein solubility exceeds the NetSolP method which is currently ranked first in classification performance.
  • the method disclosed in the present invention has a high accuracy rate in predicting the solubility of proteins, and is applicable to both the classification prediction problem of predicting whether a protein is soluble and the regression problem of predicting the effect of amino acid mutations on solubility. It can be used to predict the solubility of recombinant protein expression and to screen enzyme engineering mutants based on solubility potential.
  • FIG1 is a flow chart of a method for predicting protein solubility according to an embodiment of the present disclosure.
  • FIG. 2 is a schematic diagram showing the association between a protein solubility prediction method and a prediction model construction method according to an embodiment of the present disclosure.
  • FIG. 3 is a more detailed flowchart of a protein sequence-based feature extraction method according to an embodiment of the present disclosure.
  • FIG. 4 is a more detailed flowchart of the method for constructing a protein solubility prediction model used in the present disclosure.
  • FIG5 shows a schematic block diagram of a system for predicting protein solubility according to the present disclosure.
  • Predicting protein solubility can be divided into two categories. One is a binary classification problem or a multi-classification problem of predicting whether a protein is soluble or insoluble or whether it is easy or difficult to dissolve; the other is a regression problem of predicting the probability of protein solubility. Regardless of the type of problem, the prediction cannot be separated from the collection of solubility data based on expression experiments.
  • the commonly used protein solubility experimental data sets are as follows (see Table 2).
  • the existing protein solubility classification prediction method has low compatibility when used for regression prediction.
  • the NetSolP method has an accuracy of 72.8% on the independent test set NESG dataset for classification problems, but only 66.1% on the CamSol independent test set for regression problems.
  • the present disclosure provides a method for predicting protein solubility. By constructing a prediction model that is compatible with classification problems and regression problems, it is possible to predict whether a protein is soluble and the probability of different proteins being soluble, which is helpful for screening high-potential variants in advance.
  • FIG1 shows a flow chart of a protein solubility prediction method according to an embodiment of the present disclosure.
  • the protein solubility prediction method 100 starts at step S110 , in which features are extracted based on the protein sequence using a protein pre-training model.
  • step S120 still based on the protein sequence, the primary structural features of the protein are extracted.
  • step S130 the features extracted by the protein pre-training model in step S110 are concatenated with the primary structure features of the protein extracted in step S120 to obtain a feature vector of the protein.
  • step S140 the feature vector of the protein obtained in step S130 is input into a protein solubility prediction model to obtain a prediction result of protein solubility.
  • the prediction result of protein solubility is: the protein is classified as soluble or insoluble; in another preferred embodiment according to the present disclosure, the prediction result of protein solubility is the probability that the protein is soluble.
  • the method disclosed in the present invention combines the features extracted based on the protein pre-training model with the primary structure features of the protein, and uses an automatic learning framework to construct a prediction model that is compatible with classification problems (predicting whether a protein is soluble) and regression problems (predicting the probability that different proteins are soluble).
  • Figure 2 shows a schematic diagram 200 of the association between the protein solubility prediction method and the prediction model construction method according to an embodiment of the present disclosure.
  • the left side of the dotted line shows the model construction method
  • the right side of the dotted line shows the method of using the model for prediction.
  • the process on the right side of the dotted line in FIG. 2 is actually the protein solubility prediction method according to the embodiment of the present disclosure shown in FIG. 1.
  • the process on the left side of the dotted line in FIG. 2 will be described in detail below (e.g., FIG. 4 and its corresponding text description).
  • both the model construction and the actual prediction process need to use the protein feature extraction method, which will be described in detail below (e.g., FIG. 3 and its corresponding text description).
  • the training sample it includes both the protein sequence of the training sample and the protein solubility data of the training sample.
  • the protein solubility data may include protein soluble or insoluble. Therefore, in the process of model construction, the protein sequence in the training sample is used to extract features.
  • the model based on the automatic machine learning framework (e.g., using the AutoGluon framework) is fully trained to obtain the final protein solubility prediction model based on the automatic machine learning framework.
  • the constructed and trained prediction model it can be put into the actual work of predicting the protein solubility of the predicted sample, that is, the process on the right side of the dotted line in FIG. 2.
  • FIG. 3 is a more detailed flowchart of a protein feature extraction method 300 according to an embodiment of the present disclosure.
  • step S110 of the prediction method of FIG. 1 in the protein feature extraction step, first, based on the protein sequence, a protein pre-training model is used to extract features.
  • the protein pre-training model described here can be implemented by a combination of one or more of the following models: ESM-1b, UniReP, ProteinBert, TAPE, ProtGPT2, ProtTXL, ProtBert, ProtXLNet, ProtAlbert, ProtElectra, ProtT5-XL-BFD, ProtT5-XL-UniRef50 and ProtT5-XXL.
  • ProtT5-XL-BFD or ProtT5-XL-UniRef50 can be used as the protein pre-training model.
  • the protein sequence pre-training model can be a ProtT5-XL-UniRef50 pre-training model.
  • ProtT5-XL-UniRef50 is a protein sequence pre-training model using a masked language modeling (MLM) objective.
  • MLM masked language modeling
  • ProtT5-XL-UniRef50 is based on the t5-3b model and is pre-trained on a large number of protein sequences in a self-supervised manner.
  • the model can be used for protein feature extraction, and it is best to use features extracted from the encoder.
  • the protein feature extraction method 300 can start at step S310. That is, as described above, according to a preferred embodiment of the present disclosure, in step S310, the protein sequence of the sample to be predicted is used as input, and the ProtT5-XL-UniRef50 pre-trained model is used to extract the embedding layer vector output by the encoder, wherein each amino acid corresponds to a 1024-dimensional vector, and each sequence corresponds to an L ⁇ 1024-dimensional feature matrix, wherein L is the amino acid sequence length of the protein.
  • ProtT5-XL-UniRef50 pre-training model is used to extract the feature matrix of the protein sequence in the preferred embodiment of the present disclosure
  • those skilled in the art are fully motivated to use another or newly developed protein pre-training model to replace the ProtT5-XL-UniRef50 pre-training model used in the present disclosure when the input and output are essentially unchanged and the functions are similar and the effects are similar or better.
  • those skilled in the art should understand that even after such a replacement of the pre-training model, the replaced scheme still falls within the scope of protection claimed by the present disclosure.
  • the obtained feature matrix can be compressed. Specifically, the feature matrix can be averaged by column to obtain a 1024-dimensional feature vector. More specifically, as shown in FIG3 , in step S320, the feature matrix obtained in step 310 is compressed, that is, the Sentence-T5 Encoder-only mean processing method is used to average the values of each dimension of the embedding layer feature values of all amino acids output by the encoder by column to obtain a 1024-dimensional feature vector.
  • the embedding layer feature values of the padding code are not calculated, so that the embedding layer feature values of all amino acids constituting a single protein are converted into 1024-dimensional feature values of the protein.
  • Sentence-T5 Encoder-only mean is one of the processing methods for the features output by the encoder in Sentence-T5.
  • Sentence-T5 Encoder-only mean is one of the processing methods for the features output by the encoder in Sentence-T5.
  • Sentence-T5 Encoder-only mean please refer to Jianmo Ni, Gustavo Hernandez Abrego, Arthur Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang, “Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models”, in Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, Dublin, Ireland; Association for Computational Linguistics; doi: 10.18653/v1/2022.findings-acl.146.
  • the entire contents of the above-mentioned documents are hereby incorporated into the present disclosure by reference, making them a part of the contents of the present disclosure.
  • Sentence-T5 Encoder-only mean is a disclosed technology
  • its application in the method of the present disclosure is novel, because the application field involved in the present disclosure is completely different from the application field involved in the original disclosure of Sentence-T5 Encoder-only mean.
  • the present disclosure uses the Sentence-T5 Encoder-only mean method to compress the extracted feature matrix, so that the extracted feature vectors can more deeply integrate the characteristics of the protein, which is helpful for the subsequent construction of a prediction model with better comprehensive performance based on this feature, and the final protein solubility prediction result is more accurate.
  • step S120 of FIG. 1 next, the primary structural features of the protein are extracted based on the protein sequence.
  • the protein sequence is used as input and the iFeature tool is used to extract the primary structural features of the protein.
  • the protein sequence is used as input and the iFeature tool is used to extract the primary structural features of the protein.
  • the iFeature tool is used to extract the primary structural features of proteins in the preferred embodiment of the present disclosure, the input and output data are essentially unchanged, and the method functions are similar and the effects are similar or better, those skilled in the art are fully motivated to use another or newly developed extraction tool to replace the iFeature tool used in the present disclosure. Those skilled in the art should understand that even after such an extraction tool is replaced, the replaced solution still falls within the scope of protection claimed by the present disclosure.
  • the primary structural features of proteins extracted using the iFeature tool may include AAC (Amino Acid Composition), DPC (Di-Peptide Composition), DDE (Dipeptide Deviation from Expected Mean), TPC (Tri-Peptide Composition), CKSAAP (Composition of k-spaced Amino Acid Pairs), EAAC (Enhanced Amino Acid Composition), GAAC (Grouped Amino Acid Composition), and ), CKSAAGP(Composition of k-SpacedAmino Acid Group Pairs,Composition of k-spaced amino acid group pairs,GDPC(Grouped Di-Peptide Composition,Grouped Di-Peptide Composition,GTPC(Grouped Tri-Peptide Composition,Grouped Tri-Peptide Composition,Moran(Moran correlation,Moran correlation,Geary(Geary correlation,Geary correlation,NMBroto(Normalized Moreau-BrotoAutocorrelation,Normalized Moreau-BrotoAutocorrelation,Norm
  • the primary structural features of proteins extracted using the iFeature tool include 11 categories, namely AAC, GAAC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, PAAC and APAAC.
  • these features can be normalized, and the parameters of the standard normalization will be saved and used to process similar features of the input sequence during the prediction process, that is, the standard normalization parameters of the training set are saved for processing the same features of the independent test set.
  • step S130 of FIG. 1 the features extracted by the protein pre-training model are then concatenated with the primary structure features of the protein to obtain a feature vector of the protein.
  • step S340 the feature vector obtained in step S320 is concatenated with the primary structural feature of the protein obtained in step S330 to obtain a feature vector of the protein.
  • other features of the protein can be extracted based on the protein sequence, and the other features include molecular features, secondary structure and/or higher-order structural features of the protein sequence, thereby further optimizing the model.
  • the iFeature tool can be used to extract other features of the protein.
  • the molecular features of the sequence can be selected from one or more of the following: sequence length, ZSCALE (Z-scale) and 48PseKRAAC (48pseudo K-tuple reduced amino acids composition);
  • the secondary structure features can be selected from one or more of the following: SSEC (Secondary Structure Elements Content), SSEB (Secondary Structure Elements Binary);
  • the advanced structure features can be selected from one or more of the following: Disorder, DisorderC, DisorderB (Disorder Binary), ASA (Accessible Solvent accessibility) and TA (Torsion angle).
  • other features can also be extracted using the Bio.SeqUtils.ProtParam module of Biopython.
  • other features of the protein sequence are selected from one or more of the following: molecular weight, aromaticity, instability index, flexibility, isoelectric point, molar absorption coefficient and grand average of hydrophilicity; the secondary structure features include the proportion of helix/turn/sheet in the secondary structure of the protein.
  • other features of the protein include molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorptivity, overall average hydrophilicity and the proportion of helix/turn/fold in the secondary structure of the protein extracted using the Bio.SeqUtils.ProtParam module of Biopython.
  • Bio.SeqUtils.ProtParam module of Biopython is used to extract other features of proteins in the preferred embodiment of the present disclosure, those skilled in the art are fully motivated to use other or newly developed extraction methods when the input and output data do not change in essence and the functions of the methods are similar and the effects are similar or better.
  • the Biopython tool used in the present disclosure is replaced by a tool. Those skilled in the art should understand that even after such an extraction tool is replaced, the replaced scheme still falls within the scope of protection claimed in the present disclosure.
  • the iFeature tool is used to extract the primary structural features of the protein, including AAC, GAAC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, PAAC and APAAC; at the same time, the Bio.SeqUtils.ProtParam module of Biopython is used to extract other features of the protein, including molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorptivity, total average hydrophilicity and the proportion of helix/turn/fold in the secondary structure of the protein, a total of 1502 features.
  • the Bio.SeqUtils.ProtParam module of Biopython is used to extract other features of the protein, including molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorptivity, total average hydrophilicity and the proportion of helix/turn/fold in the secondary structure of the protein, a total of 1502 features.
  • the features extracted by the protein pre-training model are spliced with the primary structural features of the protein and other features of the protein to obtain the feature vector of the spliced protein.
  • the 1024-dimensional feature vector extracted by the protein pre-training model is spliced with the primary structural features and other features of the extracted 1502 proteins to obtain a 2526-dimensional feature vector for each protein.
  • the present embodiment does not limit the order of splicing the three types of features, namely, features extracted through the protein pre-training model, primary structural features of the protein, and other features of the protein. It is only necessary to ensure that the order of splicing the three types of features is consistent during model building and actual prediction.
  • the features extracted through the protein pre-training model, primary structural features of the protein, and other features of the protein are spliced in order to obtain the feature vector of the protein; then, during the actual prediction process, the features extracted through the protein pre-training model, primary structural features of the protein, and other features of the protein are spliced in the same order to obtain the feature vector of the protein of the sample to be predicted.
  • the method may further include performing dimensionality reduction processing on the feature vector of the spliced protein.
  • the dimensionality reduction processing may further perform feature extraction on the mutation vector of the protein, extract features with relevance and effectiveness as much as possible, simplify the prediction model, reduce the computational complexity and shorten the computational time, so as to make the subsequent steps more efficient.
  • the principal component analysis (PCA) model may be used to perform dimensionality reduction on the feature vector of the spliced protein.
  • the above series of steps can also be used to extract protein features from training samples, so that the extracted protein features are used as input data and input into a prediction model constructed by an automatic machine learning framework (such as the AutoGluon framework), and the protein solubility data in the training samples are used as output data to train a protein solubility prediction model based on the automatic machine learning framework, so as to ultimately determine the model parameters, thereby constructing a protein solubility prediction model with relatively satisfactory accuracy.
  • an automatic machine learning framework such as the AutoGluon framework
  • FIG. 4 is a more detailed flowchart of a method 400 for constructing a protein solubility prediction model used in the present disclosure.
  • each training sample in the multiple training samples in the data set includes a sample protein sequence and its solubility data.
  • the PSI:Biology dataset is selected as a training set to construct a binary classification model, and the source of the dataset is shown in Table 2 above.
  • the sequence and solubility data of the protein are screened from the dataset, where the solubility data is divided into two categories: soluble and insoluble, as the training dataset; the probability value output by the classification model (such as the probability of being judged as a soluble protein) is used for the regression problem.
  • step S420 in FIG. 4 feature extraction is performed on all training sample proteins according to the method of FIG. 3 . Specifically, first, based on the sample protein sequence, features are extracted using a protein pre-training model. Then, based on the sample protein sequence, the primary structural features of the protein are extracted. Finally, the features extracted by the protein pre-training model are spliced with the primary structural features of the protein to obtain a feature vector of the sample protein.
  • other features of the sample protein can also be extracted based on the sample protein sequence.
  • the features extracted by the protein pre-training model are spliced with the primary structural features of the extracted protein and other features of the extracted protein to obtain a feature vector of the spliced sample protein.
  • the feature vector of the spliced sample protein is used as input data, and the solubility data of the sample protein is used as output data.
  • the machine learning model is trained to obtain the final protein solubility prediction model.
  • the protein solubility prediction model can be constructed and trained based on an automatic machine learning (eg, AutoML) framework, using one or more machine learning (eg, deep learning models) as candidate models.
  • AutoML automatic machine learning
  • machine learning eg, deep learning models
  • AutoGluon is a popular automatic machine learning framework. This AutoML open source toolkit developed by AWS helps achieve strong predictive performance in various machine learning and deep learning models.
  • AutoGluon (see doi:10.48550/arXiv.2003.06505, the entire contents of which are incorporated herein by reference as a part of the present disclosure) has achieved excellent performance in the evaluation of different automatic machine (deep) learning frameworks (see doi:10.48550/arXiv.2111.02705) due to its powerful feature engineering processing capabilities, automatic model selection and automatic combination and layer stacking technology, and automatic hyperparameter search technology, and can effectively avoid overfitting problems. It has been selected as the model automatic learning framework in the embodiments of the present disclosure.
  • step S430 AutoGluon is used as an automatic machine learning framework, and the protein solubility prediction model is constructed and trained based on the deep learning models NN_Torch and FASTAI as candidate models.
  • hyperparameter_tune_kwargs 'bayes'
  • a five-fold cross validation was used (i.e., 80% of the data in the PSI:Biology dataset was used as the training set and 20% of the data was used as the validation set), whether the training set protein was soluble was used as the classification value, and MCC was used as the optimization indicator for evaluating the performance of the prediction model.
  • MCC Moatthews Correlation Coefficient
  • the feature vector of the spliced sample protein is used as input data, and the solubility data (i.e., soluble or insoluble) of the sample protein is used as the output data of the prediction model to train the model.
  • the solubility data i.e., soluble or insoluble
  • the output data of the prediction model will be consistent with the solubility data of the sample protein.
  • the prediction model constructed by the method of, for example, FIG4 can be used for the prediction method shown in FIG1.
  • the protein solubility data of the training samples include whether the protein is soluble
  • the prediction result of the protein solubility can be the probability that the protein is soluble.
  • Tables 3 to 6 The results of the training set and the independent test set are shown in Tables 3 to 6.
  • Tables 3 to 5 the comparative data in Tables 3 to 5 are all from: Vineet Thumuluri, Hannah-Marie Martiny, Jose J Almagro Armenteros, Jesper Salomon, Henrik Nielsen, Alexander Rosenberg Johansen, "NetSolP: predicting protein solubility in Escherichia coli using language models", Bioinformatics, Volume 38, Issue 4, February 2022, Pages 941–946, https://doi.org/10.1093/bioinformatics/btab801.
  • the independent test set Camsol dataset contains 56 mutant sequences, of which the literature Trevino contains 22 protein variants, Miklos contains 3 protein variants, Tan contains 1 protein variant, and Dudgeon contains 30 protein variants.
  • Different prediction models and the disclosed model predict the accuracy of changes in protein solubility after point mutations.
  • AUC Area Under Curve
  • ROC receiver operating characteristic curve
  • Accuracy refers to the ratio of correctly classified samples to the total number of samples.
  • Precision refers to the proportion of correctly predicted results among all the results predicted as positive samples.
  • MCC Melatthews correlation coefficient
  • MCC (TP*TN-FP*FN)/(sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))).
  • the value range of MCC is between -1 and +1, where +1 indicates perfect prediction, 0 indicates random prediction, and -1 indicates that the prediction is completely inconsistent with the actual observation.
  • TP True Positive
  • TN True Negative
  • FP False Positive
  • FN False Negative
  • P and N represent the number of positive samples and negative samples.
  • the prediction model constructed by the present invention not only has better classification performance, but also takes into account the application in regression problems (such as predicting the effect of amino acid mutations on protein solubility). Due to the different sources of different data sets, the experimental conditions (including culture medium, cell line, protein expression and purification conditions) are also different, resulting in the performance of the trained model being reduced when it is used in a generalized manner. However, the method flow of the present invention is simple and easy to transplant. By retraining the experimental data generated for specific experimental conditions as a training set, the performance of the model predicting solubility can be further improved, and even the process can be used to solve problems such as protein expression prediction, protein folding type and folding rate constant prediction.
  • FIG5 shows a schematic block diagram of a system for predicting protein solubility according to the present disclosure.
  • the system 500 for predicting protein solubility includes an acquisition module 510 and a processing module 520.
  • the acquisition module 510 is used to acquire a sample to be predicted, where the sample to be predicted includes a protein sequence of the sample to be predicted.
  • the processing module 520 is used to obtain the prediction result of the protein solubility of the sample to be predicted by inputting the protein sequence of the sample to be predicted.
  • the processing module 520 may further include: a first feature extraction submodule 521, which is used to extract features based on the protein sequence of the sample to be predicted using the protein pre-training model; a second feature extraction submodule 522, which is used to extract features based on the protein sequence of the sample to be predicted using the protein pre-training model;
  • the extraction submodule 522 is used to extract the primary structural features of the protein based on the protein sequence of the sample to be predicted;
  • the feature splicing submodule 523 is used to splice the features extracted by the protein pre-training model with the primary structural features of the protein to obtain the feature vector of the protein of the sample to be predicted;
  • the model prediction submodule 524 is used to input the feature vector of the protein of the sample to be predicted into the protein solubility prediction model based on the automatic machine learning framework to obtain the prediction
  • the system for predicting protein solubility described above can be implemented by a computer.
  • the following operations can be implemented by executing a computer program: obtaining a sample to be predicted, the sample to be predicted includes a protein sequence of the sample to be predicted; by inputting the protein sequence of the sample to be predicted, a prediction result of the protein solubility of the sample to be predicted is obtained, and the operation further includes the following sub-operations: extracting features based on the protein sequence of the sample to be predicted using a protein pre-training model; extracting the primary structural features of the protein based on the protein sequence of the sample to be predicted; splicing the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the protein of the sample to be predicted; inputting the feature vector of the protein of the sample to be predicted into the protein solubility prediction model to obtain a prediction result of the protein solubility of the sample to be predicted.
  • each operation function corresponding to the prediction method is described as a module or sub-module, those skilled in the art should understand that such a module or sub-module may not be composed of circuit components or other physical components, but a functional module constructed by a computer program.
  • Non-transient computer-readable media include various types of tangible storage media.
  • non-transient computer-readable media examples include magnetic recording media (such as floppy disks, tapes, and hard disk drives), magneto-optical recording media (such as magneto-optical disks), CD-ROMs (compact disk read-only memories), CD-Rs, CD-R/Ws, and semiconductor memories (such as ROMs, PROMs (programmable ROMs), EPROMs (erasable PROMs), flash ROMs, and RAMs (random access memories)).
  • these programs can be provided to the computer using various types of transient computer-readable media. Examples of transient computer-readable media include electrical signals, optical signals, and electromagnetic waves. Transient computer-readable media can be used to provide programs to the computer through wired communication paths or wireless communication paths such as wires and optical fibers.
  • a non-transitory computer-readable storage medium may be provided for storing a computer program, wherein the computer program includes instructions, which, when executed by a processor of an electronic device, enable the electronic device to implement the protein solubility prediction method as described above.
  • a computer system comprising: a processor; a memory; and a computer program.
  • the computer program is stored in the memory and is configured to be executed by the processor.
  • the computer program includes instructions for implementing the protein solubility prediction method described above.
  • a non-transitory computer-readable storage medium may be provided for storing a computer program, wherein the computer program includes instructions that, when executed by a processor of an electronic device, enable the electronic device to implement the method for constructing a protein solubility prediction model as described above.
  • a computer system comprising: a processor; a memory; and a computer program.
  • the computer program is stored in the memory and is configured to be executed by the processor.
  • the computer program includes instructions for implementing the method for constructing the protein solubility prediction model described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

L'invention concerne un procédé de prédiction de solubilité d'une protéine. Des caractéristiques, qui sont extraites sur la base d'un modèle de pré-entraînement de protéine, sont combinées à des caractéristiques structurales primaires d'une protéine, un cadre d'apprentissage automatique est combiné à un algorithme de sélection d'hyper-paramètre automatique, et un coefficient MCC est utilisé en tant qu'indice d'optimisation de modèle, de façon à construire un modèle de prédiction à la fois pour un problème de classification et un problème de régression, de manière à pouvoir prédire si la protéine est soluble ou non, et à prédire également les probabilités de solubilité de différentes protéines, ce qui facilite la sélection anticipée d'une variante à haut potentiel.
PCT/CN2024/092583 2023-05-11 2024-05-11 Procédé de prédiction de solubilité d'une protéine Pending WO2024230826A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202480001588.4A CN118696380A (zh) 2023-05-11 2024-05-11 蛋白质可溶性预测方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310533179 2023-05-11
CN202310533179.6 2023-05-11

Publications (1)

Publication Number Publication Date
WO2024230826A1 true WO2024230826A1 (fr) 2024-11-14

Family

ID=93431375

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/092583 Pending WO2024230826A1 (fr) 2023-05-11 2024-05-11 Procédé de prédiction de solubilité d'une protéine

Country Status (1)

Country Link
WO (1) WO2024230826A1 (fr)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150045237A1 (en) * 2012-03-16 2015-02-12 Max-Delbbruck-Certrum Fuer Molekulare Medizin Method for identification of the sequence of poly(a)+rna that physically interacts with protein
CN111210871A (zh) * 2020-01-09 2020-05-29 青岛科技大学 基于深度森林的蛋白质-蛋白质相互作用预测方法
US20200372339A1 (en) * 2019-05-23 2020-11-26 Salesforce.Com, Inc. Systems and methods for verification of discriminative models
CN112906755A (zh) * 2021-01-27 2021-06-04 深圳职业技术学院 一种植物抗性蛋白识别方法、装置、设备和存储介质
CN113223620A (zh) * 2021-05-13 2021-08-06 西安电子科技大学 基于多维度序列嵌入的蛋白质溶解性预测方法
CN114550824A (zh) * 2022-01-29 2022-05-27 河南大学 基于嵌入特征和不平衡分类损失的蛋白质折叠识别方法及系统
CN114582423A (zh) * 2022-02-26 2022-06-03 河南省健康元生物医药研究院有限公司 一种基于组合型机器学习模型的蛋白质溶解性预测方法
CN115458039A (zh) * 2022-08-08 2022-12-09 北京分子之心科技有限公司 基于机器学习的单序列蛋白结构预测的方法和系统
CN115527609A (zh) * 2022-10-31 2022-12-27 郑州科技学院 一种基于集成学习的β-内酰胺酶蛋白质的预测方法

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150045237A1 (en) * 2012-03-16 2015-02-12 Max-Delbbruck-Certrum Fuer Molekulare Medizin Method for identification of the sequence of poly(a)+rna that physically interacts with protein
US20200372339A1 (en) * 2019-05-23 2020-11-26 Salesforce.Com, Inc. Systems and methods for verification of discriminative models
CN111210871A (zh) * 2020-01-09 2020-05-29 青岛科技大学 基于深度森林的蛋白质-蛋白质相互作用预测方法
CN112906755A (zh) * 2021-01-27 2021-06-04 深圳职业技术学院 一种植物抗性蛋白识别方法、装置、设备和存储介质
CN113223620A (zh) * 2021-05-13 2021-08-06 西安电子科技大学 基于多维度序列嵌入的蛋白质溶解性预测方法
CN114550824A (zh) * 2022-01-29 2022-05-27 河南大学 基于嵌入特征和不平衡分类损失的蛋白质折叠识别方法及系统
CN114582423A (zh) * 2022-02-26 2022-06-03 河南省健康元生物医药研究院有限公司 一种基于组合型机器学习模型的蛋白质溶解性预测方法
CN115458039A (zh) * 2022-08-08 2022-12-09 北京分子之心科技有限公司 基于机器学习的单序列蛋白结构预测的方法和系统
CN115527609A (zh) * 2022-10-31 2022-12-27 郑州科技学院 一种基于集成学习的β-内酰胺酶蛋白质的预测方法

Similar Documents

Publication Publication Date Title
CN111325037B (zh) 文本意图识别方法、装置、计算机设备和存储介质
CN114492363B (zh) 一种小样本微调方法、系统及相关装置
US9256596B2 (en) Language model adaptation for specific texts
CN108509474B (zh) 搜索信息的同义词扩展方法及装置
Ao et al. Prediction of antioxidant proteins using hybrid feature representation method and random forest
CN108711422B (zh) 语音识别方法、装置、计算机可读存储介质和计算机设备
Zhao et al. Exploratory predicting protein folding model with random forest and hybrid features
Wu et al. Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors
Anisha et al. Early Prediction of Parkinson's Disease (PD) Using Ensemble Classifiers
CN114357120A (zh) 基于faq的无监督式检索方法、系统及介质
Fang From dynamic time warping (DTW) to hidden markov model (HMM)
CN108510977A (zh) 语种识别方法及计算机设备
Kim et al. Sequential labeling for tracking dynamic dialog states
CN118733706B (zh) 基于检索增强的电力标准问答模型训练方法及系统
CN118696380A (zh) 蛋白质可溶性预测方法
Wang et al. Supervised contrastive learning with nearest neighbor search for speech emotion recognition
Singh et al. Simultaneously learning robust audio embeddings and balanced hash codes for query-by-example
CN121175756A (zh) 用于抗体药物开发的免疫原性预测方法
CN114996466B (zh) 一种医学标准映射模型的建立方法、系统及使用方法
WO2024230826A1 (fr) Procédé de prédiction de solubilité d'une protéine
CN112149424A (zh) 语义匹配方法、装置、计算机设备和存储介质
Ji et al. A novel three-stage framework for few-shot named entity recognition
CN105046106A (zh) 一种用最近邻检索实现的蛋白质亚细胞定位预测方法
CN119741509A (zh) 一种图像识别方法、装置、设备及计算机可读存储介质
CN121002518A (zh) 预训练期间模型的解耦优化

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24803104

Country of ref document: EP

Kind code of ref document: A1