WO2024230826A1

WO2024230826A1 - Protein solubility prediction method

Info

Publication number: WO2024230826A1
Application number: PCT/CN2024/092583
Authority: WO
Inventors: 樊隆
Original assignee: Genscript Shanghai Biotech Co Ltd
Current assignee: Genscript Shanghai Biotech Co Ltd
Priority date: 2023-05-11
Filing date: 2024-05-11
Publication date: 2024-11-14
Anticipated expiration: 2025-11-11

Abstract

A protein solubility prediction method. Features, which are extracted on the basis of a protein pre-training model, are combined with primary structural features of a protein, an automatic learning framework is combined with an automatic hyper-parameter screening algorithm, and an MCC is used as a model optimization index, so as to construct a prediction model for both a classification problem and a regression problem, such that whether the protein is soluble can be predicted, and the solubility probabilities of different proteins can also be predicted, thus facilitating the screening of a high-potential variant in advance.

Description

Protein solubility prediction method

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求2023年5月11日提交的申请号为202310533179.6的中国专利申请的优先权，其全部内容通过引入并入本文。This application claims priority to Chinese patent application No. 202310533179.6 filed on May 11, 2023, the entire contents of which are incorporated herein by reference.

Technical Field

本公开涉及蛋白质研究，更具体涉及蛋白质可溶性预测方法。The present disclosure relates to protein research, and more particularly to methods for predicting protein solubility.

Background Art

蛋白质可溶性提升对于提升重组蛋白产量、进行蛋白功能研究(包括结构蛋白组和功能蛋白组学研究)和完成酶工程改造前体筛选起着十分重要的作用，直接关乎重组蛋白的使用功能和生产成本。例如，在蛋白质工程改造的过程中，如果基于蛋白质的氨基酸序列能够提前准确预测蛋白不同变体的可溶性这个重要评价指标，便可以提升筛选效率，找出高潜力蛋白，有效减小下游实验验证规模。在生产过程中提高蛋白质可溶性的方法很多，比如在大肠杆菌表达系统中，通过使用弱启动子、使用特定的蛋白标签或者改变信号肽、强变性剂变性复性、低温长时间表达、密码子优化和表达分子伴侣等方式减少重组蛋白的包涵体，提升蛋白质的有效产量(即可溶蛋白)。但是提前精准预测蛋白的可溶性仍然是一个十分重要的问题，这也是近年来预测蛋白可溶性模型和工具不断推陈出新的原因(如表一)。Improving protein solubility plays a very important role in increasing the yield of recombinant proteins, conducting protein function research (including structural proteomics and functional proteomics research) and completing enzyme engineering precursor screening, which is directly related to the use function and production cost of recombinant proteins. For example, in the process of protein engineering, if the solubility of different protein variants can be accurately predicted in advance based on the amino acid sequence of the protein, this important evaluation index can be used to improve the screening efficiency, find high-potential proteins, and effectively reduce the scale of downstream experimental verification. There are many ways to improve protein solubility in the production process. For example, in the Escherichia coli expression system, by using weak promoters, using specific protein tags or changing signal peptides, strong denaturants, denaturation and renaturation, low temperature and long-term expression, codon optimization and expression of molecular chaperones, etc., the inclusion bodies of recombinant proteins are reduced, and the effective yield of proteins (ie, soluble proteins) is increased. However, accurately predicting protein solubility in advance is still a very important issue, which is also the reason why models and tools for predicting protein solubility have been continuously updated in recent years (as shown in Table 1).

蛋白质一级序列决定折叠后的高级结构，蛋白质的高级结构决定蛋白质的可溶解度。因此无论基于一级结构的序列特征还是基于高级结构的特征都已经被用于蛋白质的可溶性预测模型的构建，包括蛋白序列长度、氨基酸残基组成(例如带电荷氨基酸残基数量、疏水/亲水氨基酸残基数量、k-mer特征)、蛋白质二级结构(例如alpha螺旋、beta折叠的比例)、蛋白质三级结构(例如两个氨基酸的alpha碳原子之间的转角)等(详细特征描述见表一中各论文)。目前用于蛋白质溶解度预测的统计方法、机器学习和深度学习方法包括：高斯判别分析模型、线性回归、支持向量机、逻辑回归、线性规划、随机森林、梯度提升机、朴素贝叶斯、神经网络、卷积神经网络、双通道卷积神经网络、双向门限递归单元神经网络、图神经网络等(详细模型使用情况见表一中各论文正文)。The primary sequence of a protein determines its higher-order structure after folding, and the higher-order structure of a protein determines its solubility. Therefore, both sequence features based on the primary structure and features based on the higher-order structure have been used to construct protein solubility prediction models, including protein sequence length, amino acid residue composition (such as the number of charged amino acid residues, the number of hydrophobic/hydrophilic amino acid residues, k-mer features), protein secondary structure (such as the ratio of alpha helix and beta fold), protein tertiary structure (such as the turn between the alpha carbon atoms of two amino acids), etc. (For detailed feature descriptions, see the papers in Table 1). The statistical methods, machine learning, and deep learning methods currently used to predict protein solubility include: Gaussian discriminant analysis model, linear regression, support vector machine, logistic regression, linear programming, random Machine forest, gradient boosting machine, naive Bayes, neural network, convolutional neural network, dual-channel convolutional neural network, bidirectional threshold recurrent unit neural network, graph neural network, etc. (For detailed model usage, see the text of each paper in Table 1).

表一：现在已知的蛋白质可溶性预测方法

Table 1: Currently known methods for predicting protein solubility

但是目前这些蛋白可溶性预测模型不能兼容分类问题和回归问题，且在准确度上有待提高。However, these current protein solubility prediction models are not compatible with classification and regression problems, and their accuracy needs to be improved.

发明内容Summary of the invention

针对现有技术的蛋白质可溶性分类预测模型很难同时保证分类和回归的准确率都很高，本公开提出了一种蛋白质可溶性预测方法，将基于蛋白质预训练模型提取的特征与蛋白质一级结构特征结合，使用自动机器学习框架，构建并训练出兼容分类问题(预测蛋白质是否可溶)和回归问题(预测不同蛋白质可溶的概率，可预测氨基酸突变对可溶性的影响，有助于提前筛选高潜力变体)的模型，用于预测蛋白质的可溶性。进一步，可通过自动机器学习框架中自动超参数筛选，得到模型性能最优化的超参数组合。In view of the fact that it is difficult for the protein solubility classification prediction model in the prior art to simultaneously ensure high accuracy of classification and regression, the present disclosure proposes a protein solubility prediction method, which combines the features extracted based on the protein pre-training model with the protein primary structure features, and uses an automatic machine learning framework to build and train a model that is compatible with classification problems (predicting whether a protein is soluble) and regression problems (predicting the probability of different proteins being soluble, predicting the effect of amino acid mutations on solubility, and helping to screen high-potential variants in advance) for predicting protein solubility. Furthermore, the hyperparameter combination that optimizes model performance can be obtained through automatic hyperparameter screening in the automatic machine learning framework.

根据本公开的第一方面，提供了一种蛋白质可溶性预测方法。所述方法可以包括：基于蛋白质序列，使用蛋白质预训练模型提取特征；基于蛋白质序列，提取蛋白质的一级结构特征；将包括通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到所述蛋白质的特征向量；将所述蛋白质的特征向量输入到蛋白质可溶性预测模型中，得到蛋白质可溶性的预测结果。According to the first aspect of the present disclosure, a method for predicting protein solubility is provided. The method may include: extracting features based on a protein sequence using a protein pre-training model; extracting primary structural features of the protein based on the protein sequence; concatenating the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the protein; inputting the feature vector of the protein into a protein solubility prediction model to obtain a prediction result of the protein solubility.

在根据本公开第一方面的方法中，优选地，所述蛋白质预训练模型可以采用以下模型中的一种或多种的组合来实现：ESM-1b、UniRep、ProteinBert、TAPE、ProtGPT2、ProtTXL、ProtBert、ProtXLNet、ProtAlbert、ProtElectra、ProtT5-XL-BFD、ProtT5-XL-UniRef50和ProtT5-XXL；较优选的，所述蛋白质预训练模型为ProtT5-XL-BFD或ProtT5-XL-UniRef50；更优选的，所述蛋白质预训练模型为ProtT5-XL-UniRef50。In the method according to the first aspect of the present disclosure, preferably, the protein pre-training model can be implemented by a combination of one or more of the following models: ESM-1b, UniRep, ProteinBert, TAPE, ProtGPT2, ProtTXL, ProtBert, ProtXLNet, ProtAlbert, ProtElectra, ProtT5-XL-BFD, ProtT5-XL-UniRef50 and ProtT5-XXL; more preferably, the protein pre-training model is ProtT5-XL-BFD or ProtT5-XL-UniRef50; more preferably, the protein pre-training model is ProtT5-XL-UniRef50.

在根据本公开第一方面的方法中，优选地，可以以蛋白质序列为输入，使用ProtT5-XL-UniRef50预训练模型提取编码器输出的嵌入层向量，其中，每个氨基酸对应一个1024维向量，每条序列对应一个L×1024维的特征矩阵，其中L为蛋白质的氨基酸序列长度；将上述特征矩阵按列取平均值，得到一个1024维的特征向量。 In the method according to the first aspect of the present disclosure, preferably, the protein sequence can be used as input, and the ProtT5-XL-UniRef50 pre-trained model can be used to extract the embedding layer vector output by the encoder, wherein each amino acid corresponds to a 1024-dimensional vector, and each sequence corresponds to an L×1024-dimensional feature matrix, wherein L is the length of the amino acid sequence of the protein; the above-mentioned feature matrix is averaged by column to obtain a 1024-dimensional feature vector.

优选地，可以使用Sentence-T5 Encoder-only mean方法将编码器输出的所有氨基酸的嵌入层特征值的各维数值分别按列取平均值，得到一个1024维的特征向量。Preferably, the Sentence-T5 Encoder-only mean method can be used to average the values of each dimension of the embedding layer feature values of all amino acids output by the encoder by column to obtain a 1024-dimensional feature vector.

在根据本公开第一方面的方法中，优选地，可以以蛋白质序列为输入，使用iFeature工具提取蛋白质的一级结构特征。In the method according to the first aspect of the present disclosure, preferably, the protein sequence can be used as input to extract the primary structural features of the protein using the iFeature tool.

优选地，所述的蛋白质的一级结构特征可以选自AAC(Amino Acid Composition，氨基酸组成)、DPC(Di-Peptide Composition，二肽组成)、DDE(Dipeptide Deviation from Expected Mean，二肽与期望平均值的偏差)、TPC(Tri-Peptide Composition，三肽组成)、CKSAAP(Composition of k-spaced Amino Acid Pairs，k间隔氨基酸对的组成)、EAAC(Enhanced Amino Acid Composition，增强型氨基酸组成)、GAAC(GroupedAminoAcid Composition，分组氨基酸组成)、CKSAAGP(Composition ofk-Spaced Amino Acid Group Pairs，k间隔氨基酸基团对的组成)、GDPC(Grouped Di-Peptide Composition，分组二肽组成)、GTPC(Grouped Tri-Peptide Composition，分组三肽组成)、Moran(Moran correlation，莫兰相关性)、Geary(Geary correlation，Geary相关性)、NMBroto(Normalized Moreau-BrotoAutocorrelation，归一化莫尔-布鲁托自相关)、CTDC(Composition,Transition and Distribution:Composition，组成、过渡和分布：组成)、CTDT(Composition,Transition and Distribution:Transition，组成、过渡和分布：过渡)、CTDD(Composition,Transition and Distribution:Distribution，组成、过渡和分布：分布)、CTriad(Conjoint Triad，共轭三元组)、KSCTriad(k-Spaced Conjoint Triad，k间隔共轭三元组)、SOCNumber(Sequence-Order-Coupling Number，序列次序耦合号)、QSOrder(Quasi-sequence-order，准序列次序)、PAAC(Pseudo-Amino Acid Composition，伪氨基酸组成)、APAAC(Amphiphilic Pseudo-AminoAcid Composition，两亲性伪氨基酸组成)、KNNprotein(K-Nearest Neighbor for proteins，蛋白质的K-最近邻)、PSSM(PSSM profile，位置特异性打分矩阵)、AAINDEX(AAindex，氨基酸指数)和BLOSUM62(blocks substitution matrix，生物信息学中用于序列对比的氨基酸替换打分矩阵)中的一种或多种。Preferably, the primary structural features of the protein can be selected from AAC (Amino Acid Composition), DPC (Di-Peptide Composition), DDE (Dipeptide Deviation from Expected Mean), TPC (Tri-Peptide Composition), CKSAAP (Composition of k-spaced Amino Acid Pairs), EAAC (Enhanced Amino Acid Composition), GAAC (Grouped Amino Acid Composition), ition (grouped amino acid composition), CKSAAGP (Composition of k-Spaced Amino Acid Group Pairs), GDPC (Grouped Di-Peptide Composition), GTPC (Grouped Tri-Peptide Composition), Moran (Moran correlation), Geary (Geary correlation), NMBroto (Normalized Moreau-Broto Autocorrelation), CTDC (Compositio n, Transition and Distribution: Composition, Composition, Transition and Distribution: Composition), CTDT (Composition, Transition and Distribution: Transition, Composition, Transition and Distribution: Transition), CTDD (Composition, Transition and Distribution: Distribution, Composition, Transition and Distribution: Distribution), CTriad (Conjoint Triad, Conjugate Triad), KSCTriad (k-Spaced Conjoint Triad, k-Spaced Conjoint Triad), SOCNumber (Sequence-Order-Coupling Number, Sequence-Order-Coupling Number), The code may be one or more of the following: QSOrder (Quasi-sequence-order), PAAC (Pseudo-Amino Acid Composition), APAAC (Amphiphilic Pseudo-Amino Acid Composition), KNNprotein (K-Nearest Neighbor for proteins), PSSM (PSSM profile, position-specific scoring matrix), AAINDEX (AAindex) and BLOSUM62 (blocks substitution matrix, amino acid substitution scoring matrix used for sequence comparison in bioinformatics).

更优选地，使用iFeature工具提取的蛋白质的一级结构特征包括AAC、GAAC、Moran、Geary、NMBroto、CTDC、CTDT、CTDD、CTriad、PAAC和APAAC共11大类。More preferably, the primary structural features of the protein extracted using the iFeature tool include 11 categories, namely AAC, GAAC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, PAAC and APAAC.

在根据本公开第一方面的方法中，优选地，所述的蛋白质可溶性预测模型可以通过以下步骤构建出来：获取数据集，所述数据集包括多个训练样本，每个所述训练样本包括样本蛋白质序列及其可溶性数据；基于样本蛋白质序列，使用蛋白质预训练模型提取特征；基于样本蛋白质序列，提取蛋白质的一级结构特征；将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到所述样本蛋白质的特征向量；使用所述样本蛋白质的特征向量作为输入数据，使用该样本蛋白质的可溶性数据作为输出数据，通过对机器学习模型进行训练，得到最终的蛋白质可溶性预测模型。In the method according to the first aspect of the present disclosure, preferably, the protein solubility prediction model can be constructed by the following steps: obtaining a data set, the data set comprising a plurality of training samples, each of the training samples comprising The invention discloses a method for predicting protein solubility of a protein by using a protein pre-training model and a sample protein sequence and its solubility data; extracting features based on the sample protein sequence using a protein pre-training model; extracting primary structural features of the protein based on the sample protein sequence; splicing the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the sample protein; using the feature vector of the sample protein as input data and the solubility data of the sample protein as output data, and obtaining a final protein solubility prediction model by training a machine learning model.

在根据本公开第一方面的方法中，优选地，可以基于自动机器学习框架，采用一个或多个机器学习模型作为候选模型，构建并训练出所述蛋白质可溶性预测模型。在一些实施方案中，所述机器学习模型可以是深度学习模型。在一些实施方案中，所述自动机器学习框架可以选自AutoGluon(https://auto.gluon.ai/stable/index.html)、Auto-Sklearn(https://automl.github.io/auto-sklearn/master/)、TPOT(http://epistasislab.github.io/tpot/)、Hyperopt Sklearn(https://github.com/hyperopt/hyperopt-sklearn)、Auto-Keras(https://autokeras.com/)、H2O AutoML(https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)、TransmogrifAI(https://transmogrif.ai/)。在一些具体实施方案中，所述自动学习框架可以是AutoGluon。In the method according to the first aspect of the present disclosure, preferably, the protein solubility prediction model can be constructed and trained based on an automatic machine learning framework, using one or more machine learning models as candidate models. In some embodiments, the machine learning model can be a deep learning model. In some embodiments, the automatic machine learning framework can be selected from AutoGluon (https://auto.gluon.ai/stable/index.html), Auto-Sklearn (https://automl.github.io/auto-sklearn/master/), TPOT (http://epistasislab.github.io/tpot/), Hyperopt Sklearn (https://github.com/hyperopt/hyperopt-sklearn), Auto-Keras (https://autokeras.com/), H2O AutoML (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html), TransmogrifAI (https://transmogrif.ai/). In some specific embodiments, the automatic learning framework can be AutoGluon.

在一些实施方案中，所述机器学习模型为深度学习模型，可选自NN_Torch、FASTAI、LGBModel、RFModel、KNNModel等。在一些实施方案中，所述深度学习模型可以是一个，也可以是多个。在优选实施方案中，所述深度学习模型可以是NN_Torch和/或FASTAI。在一个具体实施方案中，所述深度学习模型可以是NN_Torch和FASTAI。In some embodiments, the machine learning model is a deep learning model, which can be selected from NN_Torch, FASTAI, LGBModel, RFModel, KNNModel, etc. In some embodiments, the deep learning model can be one or more. In a preferred embodiment, the deep learning model can be NN_Torch and/or FASTAI. In a specific embodiment, the deep learning model can be NN_Torch and FASTAI.

在根据本公开第一方面的方法中，所述机器学习模型可以是自动学习框架自动选择，也可以是人为设定机器学习模型的列表和筛选范围再通过自动学习框架选择。优选地，可以采用AutoGluon作为自动机器学习框架，选择深度学习模型NN_Torch和FASTAI作为候选模型，构建并训练出所述蛋白质可溶性预测模型。In the method according to the first aspect of the present disclosure, the machine learning model can be automatically selected by an automatic learning framework, or a list and screening range of machine learning models can be manually set and then selected by the automatic learning framework. Preferably, AutoGluon can be used as an automatic machine learning framework, and deep learning models NN_Torch and FASTAI can be selected as candidate models to construct and train the protein solubility prediction model.

在根据本公开第一方面的方法中，优选地，还可以基于蛋白质序列，提取蛋白质的其他特征；将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征、蛋白质的其他特征进行拼接，得到拼接后的所述蛋白质的特征向量。In the method according to the first aspect of the present disclosure, preferably, other features of the protein can also be extracted based on the protein sequence; the features extracted by the protein pre-training model are spliced with the primary structure features of the protein and other features of the protein to obtain a feature vector of the spliced protein.

优选地，所述蛋白质其他特征选自分子量(Molecular weight)、芳香性(Aromaticity)、不稳定系数(Instability index)、弹性(Flexibility)、等电点(Isoelectric point)、摩尔吸光度(Molar absorption coefficient)、亲水性总平均值(Grand Average of Hydropathy)、蛋白质二级结构中螺旋(Helix)/转角(Turn)/折叠(Sheet)的占比、SSEC(Secondary Structure Elements Content，二级结构元素内容)、SSEB(Secondary Structure Elements Binary，二级结构元素二进制)、Disorder(乱序)、DisorderC(乱序内容)、DisorderB(Disorder Binary，乱序二进制)、ASA(Accessible Solvent accessibility，无障碍溶剂无障碍性)和TA(Torsion angle，扭转角)、序列长度、ZSCALE(Z-尺度)和48 PseKRAAC(48 pseudo K-tuple reduced amino acids composition，48伪K元组还原氨基酸组成)中的一种或多种；更优选地，所述蛋白质其他特征包括分子量、芳香性、不稳定系数、弹性、等电点、摩尔吸光度、亲水性总平均值和蛋白质二级结构中螺旋/转角/折叠的占比。Preferably, the other characteristics of the protein are selected from the group consisting of molecular weight, aromaticity, instability index, flexibility, isoelectric point, molar absorption coefficient, grand average of hydrophilicity, Hydropathy), the proportion of helix/turn/sheet in the secondary structure of the protein, SSEC (Secondary Structure Elements Content), SSEB (Secondary Structure Elements Binary), Disorder, DisorderC, DisorderB (Disorder Binary), ASA (Accessible Solvent Accessibility) and TA (Torsion angle), sequence length, ZSCALE (Z-scale) and 48 PseKRAAC (48 pseudo K-tuple reduced amino acids composition); more preferably, the other characteristics of the protein include molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorptivity, total average hydrophilicity and the proportion of helix/turn/sheet in the secondary structure of the protein.

根据本公开的第二方面，提供一种蛋白质可溶性预测模型的构建方法。所述构建方法可以包括：获取数据集，所述数据集包括多个训练样本，每个所述训练样本包括样本蛋白质序列及其可溶性数据；基于样本蛋白质序列，使用蛋白质预训练模型提取特征；基于样本蛋白质序列，提取蛋白质的一级结构特征；将包括通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到所述样本蛋白质的特征向量；使用所述样本蛋白质的特征向量作为输入数据，使用该样本蛋白质的可溶性数据作为输出数据，通过对机器学习模型进行训练，得到最终的蛋白质可溶性预测模型。According to the second aspect of the present disclosure, a method for constructing a protein solubility prediction model is provided. The construction method may include: obtaining a data set, the data set includes multiple training samples, each of the training samples includes a sample protein sequence and its solubility data; based on the sample protein sequence, extracting features using a protein pre-training model; based on the sample protein sequence, extracting the primary structural features of the protein; splicing the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the sample protein; using the feature vector of the sample protein as input data, using the solubility data of the sample protein as output data, and training the machine learning model to obtain a final protein solubility prediction model.

本公开第一方面关于蛋白质可溶性预测模型的构建方法适于本公开的第二方面。The method for constructing a protein solubility prediction model in the first aspect of the present disclosure is suitable for the second aspect of the present disclosure.

根据本公开的第三方面，提供一种预测蛋白质可溶性的系统。所述系统可以包括：获取模块，用于获取待预测样本，所述待预测样本包括待预测样本的蛋白质序列；处理模块，用于通过输入待预测样本的蛋白质序列，得到对待预测样本的蛋白质可溶性的预测结果，所述处理模块进一步包括如下子模块：第一特征提取子模块，用于基于待预测样本的蛋白质序列，使用蛋白质预训练模型提取特征；第二特征提取子模块，用于基于待预测样本的蛋白质序列，提取蛋白质的一级结构特征；特征拼接子模块，用于将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到所述待预测样本的蛋白质的特征向量；模型预测子模块，用于将所述待预测样本的蛋白质的特征向量输入到蛋白质可溶性预测模型中，得到待预测样本的蛋白质可溶性的预测结果。According to the third aspect of the present disclosure, a system for predicting protein solubility is provided. The system may include: an acquisition module for acquiring a sample to be predicted, wherein the sample to be predicted includes a protein sequence of the sample to be predicted; a processing module for obtaining a prediction result of the protein solubility of the sample to be predicted by inputting the protein sequence of the sample to be predicted, wherein the processing module further includes the following submodules: a first feature extraction submodule for extracting features based on the protein sequence of the sample to be predicted using a protein pre-training model; a second feature extraction submodule for extracting the primary structural features of the protein based on the protein sequence of the sample to be predicted; a feature splicing submodule for splicing the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the protein of the sample to be predicted; and a model prediction submodule for inputting the feature vector of the protein of the sample to be predicted into a protein solubility prediction model to obtain a prediction result of the protein solubility of the sample to be predicted.

根据本公开第三方面的预测蛋白质可溶性的系统可以通过计算机实现，优选地，可以通过执行计算机程序以实现以下操作：获取待预测样本，所述待预测样本包括待预测样本的蛋白质序列；通过输入待预测样本的蛋白质序列，得到待预测样本的蛋白质可溶性的预测结果，所述操作进一步包括如下子操作：基于待预测样本的蛋白质序列，使用蛋白质预训练模型提取特征；基于待预测样本的蛋白质序列，提取蛋白质的一级结构特征；将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到待预测样本的蛋白质的特征向量；将待预测样本的蛋白质的特征向量输入到蛋白质可溶性预测模型中，得到待预测样本的蛋白质可溶性的预测结果。The system for predicting protein solubility according to the third aspect of the present disclosure can be implemented by a computer. Preferably, the system can be implemented by executing a computer program to implement the following operations: obtaining a sample to be predicted, wherein the sample to be predicted includes The protein sequence of the sample to be predicted is input; by inputting the protein sequence of the sample to be predicted, a prediction result of the protein solubility of the sample to be predicted is obtained, and the operation further includes the following sub-operations: based on the protein sequence of the sample to be predicted, a protein pre-training model is used to extract features; based on the protein sequence of the sample to be predicted, the primary structure features of the protein are extracted; the features extracted by the protein pre-training model are spliced with the primary structure features of the protein to obtain a feature vector of the protein of the sample to be predicted; the feature vector of the protein of the sample to be predicted is input into the protein solubility prediction model to obtain a prediction result of the protein solubility of the sample to be predicted.

根据本公开的第四方面，提供一种非瞬时性计算机可读存储介质，用于存储计算机程序。所述计算机程序包括指令。所述指令在由电子设备的处理器执行时使所述电子设备实施根据本公开的第一方面的蛋白质可溶性预测方法或根据本公开的第二方面的蛋白质可溶性预测模型的构建方法。According to a fourth aspect of the present disclosure, a non-transitory computer-readable storage medium is provided for storing a computer program. The computer program includes instructions. When the instructions are executed by a processor of an electronic device, the electronic device implements the protein solubility prediction method according to the first aspect of the present disclosure or the method for constructing a protein solubility prediction model according to the second aspect of the present disclosure.

根据本公开的第五方面，提供一种计算机系统。所述计算机系统包括：处理器、存储器和计算机程序。所述计算机程序存储在所述存储器中并且被配置为由所述处理器执行。所述计算机程序包括用于实施根据本公开的第一方面的蛋白质可溶性预测方法或根据本公开的第二方面的蛋白质可溶性预测模型的构建方法的指令。According to a fifth aspect of the present disclosure, a computer system is provided. The computer system comprises: a processor, a memory, and a computer program. The computer program is stored in the memory and is configured to be executed by the processor. The computer program comprises instructions for implementing the protein solubility prediction method according to the first aspect of the present disclosure or the method for constructing a protein solubility prediction model according to the second aspect of the present disclosure.

根据本公开的第六方面，提供本公开第二方面方法构建的蛋白质可溶性预测模型。进一步提供了该蛋白质可溶性预测模型在预测蛋白质可溶性中的应用。比如，该蛋白质可溶性预测模型可以应用于预测任何已知序列的蛋白样本(如已知蛋白的突变体)可溶性，预测的结果可以是可溶性的概率，也可以是可溶性是或否的结果。According to a sixth aspect of the present disclosure, a protein solubility prediction model constructed by the method of the second aspect of the present disclosure is provided. Further provided is the application of the protein solubility prediction model in predicting protein solubility. For example, the protein solubility prediction model can be applied to predict the solubility of any protein sample of known sequence (such as a mutant of a known protein), and the predicted result can be the probability of solubility or the result of solubility yes or no.

本公开的方法预测蛋白质可溶性的综合性能超越分类性能最新排名第一的NetSolP方法。The comprehensive performance of the method disclosed in the present invention in predicting protein solubility exceeds the NetSolP method which is currently ranked first in classification performance.

此外，本公开的方法对于蛋白质的可溶性预测准确率高，同时适用预测蛋白是否可溶的分类预测问题和预测氨基酸突变对可溶性的影响的回归问题，可用于重组蛋白表达可溶性预判以及基于溶解性潜力的酶工程突变体筛选。In addition, the method disclosed in the present invention has a high accuracy rate in predicting the solubility of proteins, and is applicable to both the classification prediction problem of predicting whether a protein is soluble and the regression problem of predicting the effect of amino acid mutations on solubility. It can be used to predict the solubility of recombinant protein expression and to screen enzyme engineering mutants based on solubility potential.

BRIEF DESCRIPTION OF THE DRAWINGS

通过以下详细的描述并结合附图将更充分地理解本公开，其中相似的元件以相似的方式编号，其中： The present disclosure will be more fully understood from the following detailed description taken in conjunction with the accompanying drawings, in which like elements are numbered in a similar manner, and in which:

图1是根据本公开的实施例的蛋白质可溶性预测方法的流程图。FIG1 is a flow chart of a method for predicting protein solubility according to an embodiment of the present disclosure.

图2是根据本公开的实施例的蛋白质可溶性预测方法与预测模型构建方法的关联示意图。FIG. 2 is a schematic diagram showing the association between a protein solubility prediction method and a prediction model construction method according to an embodiment of the present disclosure.

图3是根据本公开的实施例的基于蛋白质序列的特征提取方法的更详细的流程图。FIG. 3 is a more detailed flowchart of a protein sequence-based feature extraction method according to an embodiment of the present disclosure.

图4是本公开所使用的蛋白质可溶性预测模型的构建方法的更详细的流程图。FIG. 4 is a more detailed flowchart of the method for constructing a protein solubility prediction model used in the present disclosure.

图5示出了根据本公开的预测蛋白质可溶性的系统的示意框图。FIG5 shows a schematic block diagram of a system for predicting protein solubility according to the present disclosure.

DETAILED DESCRIPTION

除非另有说明，本公开所用的技术和科学术语具有与本公开所属领域的普通技术人员通常所理解的含义。Unless otherwise defined, technical and scientific terms used in the present disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which the present disclosure belongs.

下面通过实施例，并结合附图，对本公开的技术方案作进一步详细的说明。除非另有说明，下文描述的实施例的方法和材料均为可以通过市场购买获得的常规产品。本公开所属领域的技术人员将会理解，下文描述的方法和材料，仅是示例性的，而不应视为限定本公开的范围。The technical scheme of the present disclosure is further described in detail below by way of examples and in conjunction with the accompanying drawings. Unless otherwise specified, the methods and materials of the embodiments described below are conventional products that can be purchased from the market. It will be understood by those skilled in the art that the present disclosure belongs to that the methods and materials described below are merely exemplary and should not be construed as limiting the scope of the present disclosure.

预测蛋白可溶性可以分为两类问题。一类是预测蛋白可溶和不可溶或者溶解难易的二分类问题或多分类问题；另一类是预测蛋白质可溶的概率的回归问题。无论是哪类问题的预测都离不开基于表达实验的可溶性数据收集，目前常用的蛋白可溶性实验数据集如下(见表二)。Predicting protein solubility can be divided into two categories. One is a binary classification problem or a multi-classification problem of predicting whether a protein is soluble or insoluble or whether it is easy or difficult to dissolve; the other is a regression problem of predicting the probability of protein solubility. Regardless of the type of problem, the prediction cannot be separated from the collection of solubility data based on expression experiments. The commonly used protein solubility experimental data sets are as follows (see Table 2).

表二：蛋白可溶性实验数据集名录

Table 2: List of protein solubility experimental datasets

由于不能简单地基于回归问题再预测分类，现有技术的蛋白质可溶性分类预测模型很难同时保证分类和回归的准确率都很高，现有的蛋白质可溶性分类预测方法用于回归预测的时候兼容性低。例如，NetSolP方法针对分类问题在独立测试集NESG数据集上的准确度达到72.8％，但针对回归问题在CamSol独立测试集上准确度只有66.1％。Since the classification cannot be predicted simply based on the regression problem, it is difficult for the existing protein solubility classification prediction model to ensure high accuracy of both classification and regression at the same time. The existing protein solubility classification prediction method has low compatibility when used for regression prediction. For example, the NetSolP method has an accuracy of 72.8% on the independent test set NESG dataset for classification problems, but only 66.1% on the CamSol independent test set for regression problems.

本公开提供了一种蛋白质可溶性预测方法，通过构建兼容分类问题和回归问题的预测模型，从而既能预测蛋白质是否可溶，又能预测不同蛋白质可溶的概率，有助于提前筛选高潜力变体。The present disclosure provides a method for predicting protein solubility. By constructing a prediction model that is compatible with classification problems and regression problems, it is possible to predict whether a protein is soluble and the probability of different proteins being soluble, which is helpful for screening high-potential variants in advance.

从预测方法的总体角度来看，通过图1示出了根据本公开的实施例的蛋白质可溶性预测方法的流程图。From the overall perspective of the prediction method, FIG1 shows a flow chart of a protein solubility prediction method according to an embodiment of the present disclosure.

如图1中所述，根据本公开的实施例的蛋白质可溶性预测方法100开始于步骤S110，在此步骤，基于蛋白质序列，使用蛋白质预训练模型提取特征。As shown in FIG. 1 , the protein solubility prediction method 100 according to an embodiment of the present disclosure starts at step S110 , in which features are extracted based on the protein sequence using a protein pre-training model.

接下来，在步骤S120，仍基于蛋白质序列，提取蛋白质的一级结构特征。Next, in step S120, still based on the protein sequence, the primary structural features of the protein are extracted.

在步骤S130，将在步骤S110通过蛋白质预训练模型提取的特征与在步骤S120提取到的蛋白质的一级结构特征进行拼接，得到所述蛋白质的特征向量。In step S130, the features extracted by the protein pre-training model in step S110 are concatenated with the primary structure features of the protein extracted in step S120 to obtain a feature vector of the protein.

最后，在步骤S140，将在步骤S130得到的所述蛋白质的特征向量输入到蛋白质可溶性预测模型中，得到蛋白质可溶性的预测结果。Finally, in step S140, the feature vector of the protein obtained in step S130 is input into a protein solubility prediction model to obtain a prediction result of protein solubility.

在根据本公开的优选实施例中，蛋白质可溶性的预测结果是：蛋白质被划分为可溶或不可溶；在根据本公开的另一优选实施例中，蛋白质可溶性的预测结果是蛋白质可溶的概率。 In a preferred embodiment according to the present disclosure, the prediction result of protein solubility is: the protein is classified as soluble or insoluble; in another preferred embodiment according to the present disclosure, the prediction result of protein solubility is the probability that the protein is soluble.

如前所述，本公开的方法将基于蛋白质预训练模型提取的特征与蛋白质一级结构特征结合，运用自动学习框架，构建兼容分类问题(预测蛋白质是否可溶)和回归问题(预测不同蛋白质可溶的概率)的预测模型。As mentioned above, the method disclosed in the present invention combines the features extracted based on the protein pre-training model with the primary structure features of the protein, and uses an automatic learning framework to construct a prediction model that is compatible with classification problems (predicting whether a protein is soluble) and regression problems (predicting the probability that different proteins are soluble).

图2给出了根据本公开的实施例的蛋白质可溶性预测方法与预测模型构建方法的关联示意图200。在图2所示的示意图200中，虚线左侧示出的是模型构建方法，而虚线右侧示出的是使用模型进行预测的方法。Figure 2 shows a schematic diagram 200 of the association between the protein solubility prediction method and the prediction model construction method according to an embodiment of the present disclosure. In the schematic diagram 200 shown in Figure 2, the left side of the dotted line shows the model construction method, and the right side of the dotted line shows the method of using the model for prediction.

图2中虚线右侧的流程其实就是图1所示的根据本公开的实施例的蛋白质可溶性预测方法。图2中虚线左侧的流程还会在下文中详细介绍(例如图4及其相应文字描述)。从图2可以看出，无论是模型构建还是实际预测过程，都需要使用蛋白质特征提取方法，该方法将会在下文中详细介绍(例如图3及其相应文字描述)。对于训练样本来说，既包括训练样本的蛋白质序列，也包括该训练样本的蛋白质可溶性数据，如后文所介绍的，蛋白质可溶性数据可以包括蛋白质可溶或不可溶。因此，在模型构建过程中，使用训练样本中的蛋白质序列来提取特征。通过作为输入数据的提取的蛋白质特征、作为输出数据的训练样本中的蛋白质可溶性数据，对基于自动机器学习框架(例如采用AutoGluon框架)的模型进行充分训练，从而得到最终的基于自动机器学习框架的蛋白质可溶性预测模型。对于构建并训练好的预测模型，就可以投入到对待预测样本的蛋白质可溶性预测的实际工作中去，即图2中虚线右侧的流程。The process on the right side of the dotted line in FIG. 2 is actually the protein solubility prediction method according to the embodiment of the present disclosure shown in FIG. 1. The process on the left side of the dotted line in FIG. 2 will be described in detail below (e.g., FIG. 4 and its corresponding text description). As can be seen from FIG. 2, both the model construction and the actual prediction process need to use the protein feature extraction method, which will be described in detail below (e.g., FIG. 3 and its corresponding text description). For the training sample, it includes both the protein sequence of the training sample and the protein solubility data of the training sample. As described later, the protein solubility data may include protein soluble or insoluble. Therefore, in the process of model construction, the protein sequence in the training sample is used to extract features. By using the extracted protein features as input data and the protein solubility data in the training sample as output data, the model based on the automatic machine learning framework (e.g., using the AutoGluon framework) is fully trained to obtain the final protein solubility prediction model based on the automatic machine learning framework. For the constructed and trained prediction model, it can be put into the actual work of predicting the protein solubility of the predicted sample, that is, the process on the right side of the dotted line in FIG. 2.

下面来分别详细介绍根据本公开的实施例的蛋白质特征提取方法以及基于自动机器学习框架的蛋白质可溶性预测模型的构建方法的更详细的流程。The following is a detailed description of the protein feature extraction method according to the embodiments of the present disclosure and the more detailed process of the method for constructing a protein solubility prediction model based on an automatic machine learning framework.

图3是根据本公开的实施例的蛋白质特征提取方法300的更详细的流程图。FIG. 3 is a more detailed flowchart of a protein feature extraction method 300 according to an embodiment of the present disclosure.

如图1的预测方法中的步骤S110所示，在蛋白质特征提取步骤中，首先，基于蛋白质序列，使用蛋白质预训练模型提取特征。As shown in step S110 of the prediction method of FIG. 1 , in the protein feature extraction step, first, based on the protein sequence, a protein pre-training model is used to extract features.

这里所述的蛋白质预训练模型可以采用以下模型中的一种或多种的组合来实现：ESM-1b、UniReP、ProteinBert、TAPE、ProtGPT2、ProtTXL、ProtBert、ProtXLNet、ProtAlbert、ProtElectra、ProtT5-XL-BFD、ProtT5-XL-UniRef50和ProtT5-XXL。The protein pre-training model described here can be implemented by a combination of one or more of the following models: ESM-1b, UniReP, ProteinBert, TAPE, ProtGPT2, ProtTXL, ProtBert, ProtXLNet, ProtAlbert, ProtElectra, ProtT5-XL-BFD, ProtT5-XL-UniRef50 and ProtT5-XXL.

较优选的，可以采用ProtT5-XL-BFD或ProtT5-XL-UniRef50作为蛋白质预训练模型。 Preferably, ProtT5-XL-BFD or ProtT5-XL-UniRef50 can be used as the protein pre-training model.

在本公开的优选实施例中，蛋白质序列预训练模型可以是ProtT5-XL-UniRef50预训练模型。ProtT5-XL-UniRef50是使用掩蔽语言建模(MLM)目标的一种蛋白质序列预训练模型。可以参考A.Elnaggar et al.,“ProtTrans:Toward Understanding the Language of Life Through Self-Supervised Learning”,in IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.44,no.10,pp.7112-7127,1Oct.2022,doi:10.1109/TPAMI.2021.3095381。在此通过援引，将上述文献的全部内容合并到本公开中，使之成为本公开的内容的一部分。另外，有关ProtT5-XL-UniRef50模型的源代码及其更详细的应用，可以在https://github.com/agemagician/ProtTrans中找到，该模型也是最早在该网址发布。ProtT5-XL-UniRef50基于t5-3b模型，并以自监督的方式在大量蛋白质序列上进行预训练。该模型可用于蛋白质特征提取，且最好使用从编码器提取的特征。In a preferred embodiment of the present disclosure, the protein sequence pre-training model can be a ProtT5-XL-UniRef50 pre-training model. ProtT5-XL-UniRef50 is a protein sequence pre-training model using a masked language modeling (MLM) objective. Reference may be made to A. Elnaggar et al., “ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 7112-7127, 1Oct. 2022, doi: 10.1109/TPAMI.2021.3095381. By reference herein, the entire contents of the above-mentioned document are incorporated into the present disclosure and become part of the contents of the present disclosure. In addition, the source code of the ProtT5-XL-UniRef50 model and its more detailed application can be found at https://github.com/agemagician/ProtTrans, where the model was first released. ProtT5-XL-UniRef50 is based on the t5-3b model and is pre-trained on a large number of protein sequences in a self-supervised manner. The model can be used for protein feature extraction, and it is best to use features extracted from the encoder.

如图3中所示，根据本公开的实施例的蛋白质特征提取方法300可以开始于步骤S310，即如上所述，根据本公开的优选实施例，在步骤S310，以待预测样本的蛋白质序列为输入，使用ProtT5-XL-UniRef50预训练模型提取编码器输出的嵌入层向量，其中，每个氨基酸对应一个1024维向量，每条序列对应一个L×1024维的特征矩阵，其中L为蛋白质的氨基酸序列长度。As shown in Figure 3, the protein feature extraction method 300 according to an embodiment of the present disclosure can start at step S310. That is, as described above, according to a preferred embodiment of the present disclosure, in step S310, the protein sequence of the sample to be predicted is used as input, and the ProtT5-XL-UniRef50 pre-trained model is used to extract the embedding layer vector output by the encoder, wherein each amino acid corresponds to a 1024-dimensional vector, and each sequence corresponds to an L×1024-dimensional feature matrix, wherein L is the amino acid sequence length of the protein.

尽管在本公开的优选实施例中采用了ProtT5-XL-UniRef50预训练模型来提取蛋白质序列的特征矩阵，但是在输入、输出本质上不发生变化，且功能相似、效果相仿或更好的情况下，本领域技术人员完全有动机采用另外的或者更新开发的蛋白质预训练模型来替代本公开中所使用的ProtT5-XL-UniRef50预训练模型。本领域技术人员应该理解，即使经过这样的预训练模型的替换，替换后的方案仍然落入到本公开所要求保护的范围之内。Although the ProtT5-XL-UniRef50 pre-training model is used to extract the feature matrix of the protein sequence in the preferred embodiment of the present disclosure, those skilled in the art are fully motivated to use another or newly developed protein pre-training model to replace the ProtT5-XL-UniRef50 pre-training model used in the present disclosure when the input and output are essentially unchanged and the functions are similar and the effects are similar or better. Those skilled in the art should understand that even after such a replacement of the pre-training model, the replaced scheme still falls within the scope of protection claimed by the present disclosure.

接下来，可以对得到的特征矩阵进行压缩，具体地说，可以将上述特征矩阵按列取平均值，得到一个1024维的特征向量。更具体地说，如图3所示，在步骤S320，对步骤310中得到的特征矩阵进行压缩，即：使用Sentence-T5 Encoder-only mean处理方法，把编码器输出的所有氨基酸的嵌入层特征值的各维数值分别按列取平均值，得到一个1024维的特征向量。这里，填充码的嵌入层特征值不进入计算，从而将组成单个蛋白质的所有氨基酸的嵌入层特征值转换为该蛋白质的1024维特征值。Sentence-T5Encoder-only mean是Sentence-T5中对编码器输出的特征的其中一种处理方式，关于Sentence-T5 Encoder-only mean的更多信息，可以参见Jianmo Ni,Gustavo Hernandez Abrego,Noah Constant,Ji Ma,Keith Hall,Daniel Cer,and Yinfei Yang,“Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models”,in Findings of the Association for Computational Linguistics:ACL 2022,pages 1864–1874,Dublin,Ireland；Association for Computational Linguistics；doi:10.18653/v1/2022.findings-acl.146。在此通过援引，将上述文献的全部内容合并到本公开中，使之成为本公开的内容的一部分。Next, the obtained feature matrix can be compressed. Specifically, the feature matrix can be averaged by column to obtain a 1024-dimensional feature vector. More specifically, as shown in FIG3 , in step S320, the feature matrix obtained in step 310 is compressed, that is, the Sentence-T5 Encoder-only mean processing method is used to average the values of each dimension of the embedding layer feature values of all amino acids output by the encoder by column to obtain a 1024-dimensional feature vector. Here, the embedding layer feature values of the padding code are not calculated, so that the embedding layer feature values of all amino acids constituting a single protein are converted into 1024-dimensional feature values of the protein. Sentence-T5 Encoder-only mean is one of the processing methods for the features output by the encoder in Sentence-T5. For more information about Sentence-T5 Encoder-only mean, please refer to Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang, “Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models”, in Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, Dublin, Ireland; Association for Computational Linguistics; doi: 10.18653/v1/2022.findings-acl.146. The entire contents of the above-mentioned documents are hereby incorporated into the present disclosure by reference, making them a part of the contents of the present disclosure.

本领域技术人员应注意，尽管Sentence-T5 Encoder-only mean是已经公开的技术，但是其在本公开的方法中的应用是新颖的，因为本公开所涉及的应用领域与Sentence-T5 Encoder-only mean原公开内容所涉及的应用领域是完全不同的，本公开使用Sentence-T5 Encoder-only mean方法对提取的特征矩阵进行压缩，使提取的特征向量更深层次地融合了蛋白质的特征，有助于后续基于此特征构建综合性能更优的预测模型，最终得到的蛋白质可溶性预测结果准确性更高。Those skilled in the art should note that, although Sentence-T5 Encoder-only mean is a disclosed technology, its application in the method of the present disclosure is novel, because the application field involved in the present disclosure is completely different from the application field involved in the original disclosure of Sentence-T5 Encoder-only mean. The present disclosure uses the Sentence-T5 Encoder-only mean method to compress the extracted feature matrix, so that the extracted feature vectors can more deeply integrate the characteristics of the protein, which is helpful for the subsequent construction of a prediction model with better comprehensive performance based on this feature, and the final protein solubility prediction result is more accurate.

根据图1的步骤S120，接下来，基于蛋白质序列，提取蛋白质的一级结构特征。According to step S120 of FIG. 1 , next, the primary structural features of the protein are extracted based on the protein sequence.

在本公开的优选实施例中，更具体地说，以蛋白质序列为输入，使用iFeature工具提取蛋白质的一级结构特征。如图3所示，在步骤S330，以蛋白质序列为输入，使用iFeature工具提取蛋白质的一级结构特征。In a preferred embodiment of the present disclosure, more specifically, the protein sequence is used as input and the iFeature tool is used to extract the primary structural features of the protein. As shown in Figure 3, in step S330, the protein sequence is used as input and the iFeature tool is used to extract the primary structural features of the protein.

关于iFeature工具，可以参见Zhen Chen,Pei Zhao,Fuyi Li,AndréLeier,Tatiana TMarquez-Lago,Yanan Wang,Geoffrey I Webb,A Ian Smith,Roger J Daly,Kuo-Chen Chou,Jiangning Song,“iFeature:a Python package and web server for features extraction and selection from protein andpeptide sequences”,Bioinformatics,Volume 34,Issue 14,July 2018,Pages 2499–2502,https://doi.org/10.1093/bioinformatics/bty140。在此通过援引，将上述文献的全部内容合并到本公开中，使之成为本公开的内容的一部分。For the iFeature tool, see Zhen Chen, Pei Zhao, Fuyi Li, André Leier, Tatiana T. Marquez-Lago, Yanan Wang, Geoffrey I. Webb, A. Ian Smith, Roger J. Daly, Kuo-Chen Chou, Jiangning Song, “iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences”, Bioinformatics, Volume 34, Issue 14, July 2018, Pages 2499–2502, https://doi.org/10.1093/bioinformatics/bty140. The entire contents of the above-mentioned document are hereby incorporated into the present disclosure by reference, making it a part of the contents of the present disclosure.

尽管在本公开的优选实施例中采用了iFeature工具提取蛋白质的一级结构特征，但是在输入、输出的数据在本质上不发生变化，且方法功能相似、效果相仿或更好的情况下，本领域技术人员完全有动机采用另外的或者更新开发的提取工具来替代本公开中所使用的iFeature工具。本领域技术人员应该理解，即使经过这样的提取工具的替换，替换后的方案仍然落入到本公开所要求保护的范围之内。Although the iFeature tool is used to extract the primary structural features of proteins in the preferred embodiment of the present disclosure, the input and output data are essentially unchanged, and the method functions are similar and the effects are similar or better, those skilled in the art are fully motivated to use another or newly developed extraction tool to replace the iFeature tool used in the present disclosure. Those skilled in the art should understand that even after such an extraction tool is replaced, the replaced solution still falls within the scope of protection claimed by the present disclosure.

在本公开的一些实施例中，使用iFeature工具提取的蛋白质的一级结构特征可以包括AAC(Amino Acid Composition，氨基酸组成)、DPC(Di-Peptide Composition，二肽组成)、DDE(Dipeptide Deviation from Expected Mean，二肽与期望平均值的偏差)、TPC(Tri-Peptide Composition，三肽组成)、CKSAAP(Composition of k-spaced Amino Acid Pairs，k间隔氨基酸对的组成)、EAAC(Enhanced Amino Acid Composition，增强型氨基酸组成)、GAAC(GroupedAminoAcid Composition，分组氨基酸组成)、CKSAAGP(Composition ofk-SpacedAmino Acid Group Pairs，k间隔氨基酸基团对的组成)、GDPC(Grouped Di-Peptide Composition，分组二肽组成)、GTPC(Grouped Tri-Peptide Composition，分组三肽组成)、Moran(Moran correlation，莫兰相关性)、Geary(Geary correlation，Geary相关性)、NMBroto(Normalized Moreau-BrotoAutocorrelation，归一化莫尔-布鲁托自相关)、CTDC(Composition,Transition and Distribution:Composition，组成、过渡和分布：组成)、CTDT(Composition,Transition and Distribution:Transition，组成、过渡和分布：过渡)、CTDD(Composition,Transition and Distribution:Distribution，组成、过渡和分布：分布)、CTriad(Conjoint Triad，共轭三元组)、KSCTriad(k-Spaced Conjoint Triad，k间隔共轭三元组)、SOCNumber(Sequence-Order-Coupling Number，序列次序耦合号)、QSOrder(Quasi-sequence-order，准序列次序)、PAAC(Pseudo-Amino Acid Composition，伪氨基酸组成)、APAAC(Amphiphilic Pseudo-Amino Acid Composition，两亲性伪氨基酸组成)KNNprotein(K-Nearest Neighbor for proteins，蛋白质的K-最近邻)、PSSM(PSSM profile，PSSM档案)、AAINDEX(AAindex)和BLOSUM62(BLOSUM62)中的一种或多种。In some embodiments of the present disclosure, the primary structural features of proteins extracted using the iFeature tool may include AAC (Amino Acid Composition), DPC (Di-Peptide Composition), DDE (Dipeptide Deviation from Expected Mean), TPC (Tri-Peptide Composition), CKSAAP (Composition of k-spaced Amino Acid Pairs), EAAC (Enhanced Amino Acid Composition), GAAC (Grouped Amino Acid Composition), and ), CKSAAGP(Composition of k-SpacedAmino Acid Group Pairs，Composition of k-spaced amino acid group pairs，GDPC(Grouped Di-Peptide Composition，Grouped Di-Peptide Composition，GTPC(Grouped Tri-Peptide Composition，Grouped Tri-Peptide Composition，Moran(Moran correlation，Moran correlation，Geary(Geary correlation，Geary correlation，NMBroto(Normalized Moreau-BrotoAutocorrelation，Normalized Moreau-BrotoAutocorrelation，CTDC(Composition,Transition and Distribution:Composition，Composition，Transition and Distribution:Composition，CTDT(Composition,Transition and Distribution:Transition，Composition,Transition and Distribution:Transition，CTDD(Composition,Transition and Distribution:Distribution，Composition,Transition and Distribution:Distribution，CTriad(Conjoint Triad，Conjugated Triad，KSCTriad，K-Spaced Conjoint Triad，SOCNumber(Sequence-Order-Coupling Number, sequence order coupling number), QSOrder (Quasi-sequence-order), PAAC (Pseudo-Amino Acid Composition), APAAC (Amphiphilic Pseudo-Amino Acid Composition), KNNprotein (K-Nearest Neighbor for proteins), PSSM (PSSM profile), AAINDEX (AAindex) and BLOSUM62 (BLOSUM62) are one or more of them.

在本公开的一个优选实施例中，使用iFeature工具提取的蛋白质的一级结构特征包括AAC、GAAC、Moran、Geary、NMBroto、CTDC、CTDT、CTDD、CTriad、PAAC和APAAC共11大类。In a preferred embodiment of the present disclosure, the primary structural features of proteins extracted using the iFeature tool include 11 categories, namely AAC, GAAC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, PAAC and APAAC.

此外，可以对这些特征做标准正态化处理，标准正态化的参数会被保存，并用于预测过程中输入序列的同类特征处理，即保存训练集的标准正态化参数用于独立测试集的相同特征的处理。In addition, these features can be normalized, and the parameters of the standard normalization will be saved and used to process similar features of the input sequence during the prediction process, that is, the standard normalization parameters of the training set are saved for processing the same features of the independent test set.

根据图1的步骤S130，接下来要将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到所述蛋白质的特征向量。According to step S130 of FIG. 1 , the features extracted by the protein pre-training model are then concatenated with the primary structure features of the protein to obtain a feature vector of the protein.

如图3所示，在步骤S340，将在步骤S320得到的特征向量与在步骤S330得到的蛋白质的一级结构特征进行拼接，得到所述蛋白质的特征向量。As shown in FIG. 3 , in step S340 , the feature vector obtained in step S320 is concatenated with the primary structural feature of the protein obtained in step S330 to obtain a feature vector of the protein.

本领域技术人员应注意，在本公开的方法中，创造性地提出了将通过蛋白质预训练模型提取的特征与通过例如iFeature工具提取到的蛋白质的一级结构特征相结合，形成为用于预测模型输入的蛋白质特征。这一手段不仅是首次提出，而且如后文所述，在蛋白质可溶性预测方面取得了良好的效果。 Those skilled in the art should note that in the method disclosed herein, it is creatively proposed to combine the features extracted by the protein pre-training model with the primary structural features of the protein extracted by, for example, the iFeature tool to form protein features for prediction model input. This approach is not only proposed for the first time, but also, as described below, has achieved good results in protein solubility prediction.

此外，在一种可替换的实施例中，除了蛋白质的一级结构特征之外，还可以基于蛋白质序列，提取蛋白质的其他特征，所述其他特征包括蛋白质序列的分子特征、二级结构和/或高级结构特征，由此可以进一步优化模型。Furthermore, in an alternative embodiment, in addition to the primary structural features of the protein, other features of the protein can be extracted based on the protein sequence, and the other features include molecular features, secondary structure and/or higher-order structural features of the protein sequence, thereby further optimizing the model.

在一些具体的实施例中，可以使用iFeature工具提取蛋白质的其他特征，例如所述序列的分子特征可以选自以下一种或多种：序列长度、ZSCALE(Z-尺度)和48PseKRAAC(48pseudo K-tuple reduced amino acids composition，48伪K元组还原氨基酸组成)；所述二级结构特征可以选自以下一种或多种：SSEC(Secondary Structure Elements Content，二级结构元素内容)、SSEB(Secondary Structure Elements Binary，二级结构元素二进制)；所述高级结构特征可以选自以下一种或多种：Disorder(乱序)、DisorderC(DisorderC)、DisorderB(Disorder Binary，乱序二进制)、ASA(Accessible Solvent accessibility，无障碍溶剂无障碍性)和TA(Torsion angle，扭转角)。In some specific embodiments, the iFeature tool can be used to extract other features of the protein. For example, the molecular features of the sequence can be selected from one or more of the following: sequence length, ZSCALE (Z-scale) and 48PseKRAAC (48pseudo K-tuple reduced amino acids composition); the secondary structure features can be selected from one or more of the following: SSEC (Secondary Structure Elements Content), SSEB (Secondary Structure Elements Binary); the advanced structure features can be selected from one or more of the following: Disorder, DisorderC, DisorderB (Disorder Binary), ASA (Accessible Solvent accessibility) and TA (Torsion angle).

在另一些具体的实施例中，还可以使用Biopython的Bio.SeqUtils.ProtParam模块提取其他特征，例如所述蛋白序列的其他特征选自以下一种或多种：分子量(Molecular weight)、芳香性(Aromaticity)、不稳定系数(Instability index)、弹性(Flexibility)、等电点(Isoelectric point)、摩尔吸光度(Molar absorption coefficient)和亲水性总平均值(GrandAverage ofHydropathy)，所述二级结构特征包括蛋白质二级结构中螺旋(Helix)/转角(Turn)/折叠(Sheet)的占比。In other specific embodiments, other features can also be extracted using the Bio.SeqUtils.ProtParam module of Biopython. For example, other features of the protein sequence are selected from one or more of the following: molecular weight, aromaticity, instability index, flexibility, isoelectric point, molar absorption coefficient and grand average of hydrophilicity; the secondary structure features include the proportion of helix/turn/sheet in the secondary structure of the protein.

在一些优选实施例中，所述蛋白质的其他特征包括使用Biopython的Bio.SeqUtils.ProtParam模块提取的分子量、芳香性、不稳定系数、弹性、等电点、摩尔吸光度、亲水性总平均值和蛋白质二级结构中螺旋/转角/折叠的占比。In some preferred embodiments, other features of the protein include molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorptivity, overall average hydrophilicity and the proportion of helix/turn/fold in the secondary structure of the protein extracted using the Bio.SeqUtils.ProtParam module of Biopython.

关于Biopython，可以参见Peter J.A.Cock,Tiago Antao,Jeffrey T.Chang,Brad A.Chapman,Cymon J.Cox,Andrew Dalke,Iddo Friedberg,Thomas Hamelryck,Frank Kauff,Bartek Wilczynski,Michiel J.L.de Hoon,Biopython:freely available Python tools for computational molecular biology and bioinformatics,Bioinformatics,Volume 25,Issue 11,June 2009,Pages 1422–1423,https://doi.org/10.1093/bioinformatics/btp163。在此通过援引，将上述文献的全部内容合并到本公开中，使之成为本公开的内容的一部分。For Biopython, see Peter J. A. Cock, Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, Michiel J. L. de Hoon, Biopython: freely available Python tools for computational molecular biology and bioinformatics, Bioinformatics, Volume 25, Issue 11, June 2009, Pages 1422–1423, https://doi.org/10.1093/bioinformatics/btp163. The entire contents of the above-mentioned documents are hereby incorporated into the present disclosure by reference, making them a part of the contents of the present disclosure.

尽管在本公开的优选实施例中采用了Biopython的Bio.SeqUtils.ProtParam模块提取蛋白质的其它特征，但是在输入、输出的数据在本质上不发生变化，且方法功能相似、效果相仿或更好的情况下，本领域技术人员完全有动机采用另外的或者更新开发的提取工具来替代本公开中所使用的Biopython工具。本领域技术人员应该理解，即使经过这样的提取工具的替换，替换后的方案仍然落入到本公开所要求保护的范围之内。Although the Bio.SeqUtils.ProtParam module of Biopython is used to extract other features of proteins in the preferred embodiment of the present disclosure, those skilled in the art are fully motivated to use other or newly developed extraction methods when the input and output data do not change in essence and the functions of the methods are similar and the effects are similar or better. The Biopython tool used in the present disclosure is replaced by a tool. Those skilled in the art should understand that even after such an extraction tool is replaced, the replaced scheme still falls within the scope of protection claimed in the present disclosure.

在本公开的一个具体实施例中，使用iFeature工具提取蛋白质的一级结构特征，包括AAC、GAAC、Moran、Geary、NMBroto、CTDC、CTDT、CTDD、CTriad、PAAC和APAAC；同时使用Biopython的Bio.SeqUtils.ProtParam模块提取蛋白质的其它特征，包括分子量、芳香性、不稳定系数、弹性、等电点、摩尔吸光度、亲水性总平均值和蛋白质二级结构中螺旋/转角/折叠的占比，共1502个特征。In a specific embodiment of the present disclosure, the iFeature tool is used to extract the primary structural features of the protein, including AAC, GAAC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, PAAC and APAAC; at the same time, the Bio.SeqUtils.ProtParam module of Biopython is used to extract other features of the protein, including molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorptivity, total average hydrophilicity and the proportion of helix/turn/fold in the secondary structure of the protein, a total of 1502 features.

这样，将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征、蛋白质的其他特征进行拼接，得到拼接后的蛋白质的特征向量。具体地说，就是将通过蛋白质预训练模型提取的1024维特征向量与提取到的1502个蛋白质的一级结构特征和其他特征进行拼接，得到每个蛋白质的2526维特征向量。In this way, the features extracted by the protein pre-training model are spliced with the primary structural features of the protein and other features of the protein to obtain the feature vector of the spliced protein. Specifically, the 1024-dimensional feature vector extracted by the protein pre-training model is spliced with the primary structural features and other features of the extracted 1502 proteins to obtain a 2526-dimensional feature vector for each protein.

通过如上的一系列步骤S310-S340，根据本公开的方法，可以基于待预测样本的蛋白质序列，分别(1)通过蛋白质预训练模型提取到蛋白质特征向量和(2)通过iFeature工具提取到的蛋白质的一级结构特征以及(3)通过Biopython的Bio.SeqUtils.ProtParam模块提取蛋白质的其他特征，然后将三类特征进行拼接。由此，可以将拼接后的特征作为输入，通过蛋白质可溶性预测模型来对待预测样本的蛋白质可溶性进行预测，输出预测的蛋白质可溶性结果。Through the above series of steps S310-S340, according to the method disclosed in the present invention, based on the protein sequence of the sample to be predicted, (1) protein feature vectors can be extracted through the protein pre-training model, (2) primary structural features of proteins can be extracted through the iFeature tool, and (3) other features of proteins can be extracted through the Bio.SeqUtils.ProtParam module of Biopython, and then the three types of features can be spliced. Thus, the spliced features can be used as input to predict the protein solubility of the sample to be predicted through the protein solubility prediction model, and the predicted protein solubility results can be output.

可以理解的是，针对通过蛋白质预训练模型提取的特征、蛋白质的一级结构特征和蛋白质的其他特征三类特征的拼接顺序，本实施例不做限制，只需要保证在模型构建和实际预测过程中三类特征的拼接顺序一致即可。示例性的，在模型构建过程中，按照通过蛋白质预训练模型提取的特征、蛋白质的一级结构特征、蛋白质的其他特征的顺序进行拼接，得到所述蛋白质的特征向量；那么在实际预测过程中，同样按照通过蛋白质预训练模型提取的特征、蛋白质的一级结构特征、蛋白质的其他特征的顺序进行拼接，得到所述待预测样本的蛋白质的特征向量。It is understandable that the present embodiment does not limit the order of splicing the three types of features, namely, features extracted through the protein pre-training model, primary structural features of the protein, and other features of the protein. It is only necessary to ensure that the order of splicing the three types of features is consistent during model building and actual prediction. Exemplarily, during model building, the features extracted through the protein pre-training model, primary structural features of the protein, and other features of the protein are spliced in order to obtain the feature vector of the protein; then, during the actual prediction process, the features extracted through the protein pre-training model, primary structural features of the protein, and other features of the protein are spliced in the same order to obtain the feature vector of the protein of the sample to be predicted.

在一些实施例中，还可以进一步包括对拼接后的蛋白质的特征向量进行降维处理。降维处理可以进一步对蛋白质的突变向量进行特征提取，尽可能提取具有相关性、有效性的特征，简化预测模型，降低计算复杂程度以及缩短计算时间，使得在进行后续步骤时效率更高。例如，可以使用主成分分析(PCA)模型对拼接后的蛋白质特征向量进行降维。 In some embodiments, the method may further include performing dimensionality reduction processing on the feature vector of the spliced protein. The dimensionality reduction processing may further perform feature extraction on the mutation vector of the protein, extract features with relevance and effectiveness as much as possible, simplify the prediction model, reduce the computational complexity and shorten the computational time, so as to make the subsequent steps more efficient. For example, the principal component analysis (PCA) model may be used to perform dimensionality reduction on the feature vector of the spliced protein.

另一方面，如图2中所述，也可以通过上述的一系列步骤来对训练样本进行蛋白质特征提取，以便将提取的蛋白质特征作为输入数据，输入到通过自动机器学习框架(例如AutoGluon框架)构建的预测模型中，将训练样本中的蛋白质可溶性数据作为输出数据，训练基于自动机器学习框架的蛋白质可溶性预测模型，以便最终确定模型参数，从而构建出具有较满意的准确度的蛋白质可溶性预测模型。On the other hand, as described in FIG2 , the above series of steps can also be used to extract protein features from training samples, so that the extracted protein features are used as input data and input into a prediction model constructed by an automatic machine learning framework (such as the AutoGluon framework), and the protein solubility data in the training samples are used as output data to train a protein solubility prediction model based on the automatic machine learning framework, so as to ultimately determine the model parameters, thereby constructing a protein solubility prediction model with relatively satisfactory accuracy.

图4是本公开所使用的蛋白质可溶性预测模型的构建方法400的更详细的流程图。FIG. 4 is a more detailed flowchart of a method 400 for constructing a protein solubility prediction model used in the present disclosure.

本领域技术人员应该知道，如图4中所示，在步骤S410，首先应当获取包括多个训练样本的数据集。具体地说，数据集中多个训练样本中的每个训练样本都包括样本蛋白质序列及其可溶性数据。Those skilled in the art should know that, as shown in Figure 4, in step S410, a data set including multiple training samples should be first obtained. Specifically, each training sample in the multiple training samples in the data set includes a sample protein sequence and its solubility data.

在根据本公开的一个优选实施例中，选定PSI:Biology数据集作为训练集，构建二分类模型，数据集来源见之前的表二。从数据集中筛选出蛋白质的序列及可溶性数据，其中可溶性数据分成可溶和不可溶两类，作为训练数据集；使用分类模型输出的概率值(比如判断为可溶性蛋白的概率)用于回归问题。In a preferred embodiment of the present disclosure, the PSI:Biology dataset is selected as a training set to construct a binary classification model, and the source of the dataset is shown in Table 2 above. The sequence and solubility data of the protein are screened from the dataset, where the solubility data is divided into two categories: soluble and insoluble, as the training dataset; the probability value output by the classification model (such as the probability of being judged as a soluble protein) is used for the regression problem.

然后，如前所述，需要基于数据集中的样本蛋白质序列进行蛋白质特征提取。如图4中步骤S420所示，根据图3的方法对所有训练样本蛋白质进行特征提取。具体地说，首先，基于样本蛋白质序列，使用蛋白质预训练模型提取特征。然后，基于样本蛋白质序列，提取蛋白质的一级结构特征。最后，将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到所述样本蛋白质的特征向量。Then, as mentioned above, it is necessary to perform protein feature extraction based on the sample protein sequence in the data set. As shown in step S420 in FIG. 4 , feature extraction is performed on all training sample proteins according to the method of FIG. 3 . Specifically, first, based on the sample protein sequence, features are extracted using a protein pre-training model. Then, based on the sample protein sequence, the primary structural features of the protein are extracted. Finally, the features extracted by the protein pre-training model are spliced with the primary structural features of the protein to obtain a feature vector of the sample protein.

此外，在一种可替换的实施例中，除了样本蛋白质的一级结构特征之外，还可以基于样本蛋白质序列，提取样本蛋白质的其他特征。这样，将通过蛋白质预训练模型提取的特征与提取到的蛋白质的一级结构特征、提取到的蛋白质的其他特征进行拼接，得到拼接后的样本蛋白质的特征向量。Furthermore, in an alternative embodiment, in addition to the primary structural features of the sample protein, other features of the sample protein can also be extracted based on the sample protein sequence. In this way, the features extracted by the protein pre-training model are spliced with the primary structural features of the extracted protein and other features of the extracted protein to obtain a feature vector of the spliced sample protein.

接下来，使用拼接后的样本蛋白质的特征向量作为输入数据，使用该样本蛋白质的可溶性数据作为输出数据，通过对机器学习模型进行训练，得到最终的蛋白质可溶性预测模型。Next, the feature vector of the spliced sample protein is used as input data, and the solubility data of the sample protein is used as output data. The machine learning model is trained to obtain the final protein solubility prediction model.

具体地说，可以基于自动机器学习(例如AutoML)框架，采用一个或多个机器学习(如深度学习模型)作为候选模型，构建并训练所述蛋白质可溶性预测模型。Specifically, the protein solubility prediction model can be constructed and trained based on an automatic machine learning (eg, AutoML) framework, using one or more machine learning (eg, deep learning models) as candidate models.

本领域技术人员应该理解，实现机器学习模型所涉及的复杂性和学习量很大，并且在选择最佳模型之前，它还需要广泛的领域专业知识来生成和比较多个模型，最终才能获得最佳模型来实现我们的最终目标。而通过自动机器学习可以简化过程，自动化构建整个机器学习管道的过程，而人工干预最少。此外，与较大的数据集相比，使用自动机器学习显然可以更快地训练中小型数据集。Those skilled in the art will appreciate that the complexity and amount of learning involved in implementing a machine learning model is significant, and that it also requires extensive domain expertise to generate and compare multiple models before selecting the best model. Get the best model to achieve our ultimate goal. Automatic machine learning can simplify the process and automate the process of building the entire machine learning pipeline with minimal human intervention. In addition, using automatic machine learning can obviously train small and medium-sized datasets faster than larger datasets.

AutoGluon是一种流行的自动机器学习框架。这个由AWS开发的AutoML开源工具包有助于在各种机器学习和深度学习模型中获得强大的预测性能。AutoGluon is a popular automatic machine learning framework. This AutoML open source toolkit developed by AWS helps achieve strong predictive performance in various machine learning and deep learning models.

AutoGluon(参见doi:10.48550/arXiv.2003.06505，在此通过援引，将上述文献的全部内容合并到本公开中，使之成为本公开的内容的一部分)凭借其强大的特征工程处理能力、模型自动选择和自动组合和层堆叠技术、自动超参数搜索技术在不同自动机器(深度)学习框架测评中取得优异表现(参见doi:10.48550/arXiv.2111.02705)并可能有效避免过拟合问题，被本公开的实施例选定为模型自动学习框架。AutoGluon (see doi:10.48550/arXiv.2003.06505, the entire contents of which are incorporated herein by reference as a part of the present disclosure) has achieved excellent performance in the evaluation of different automatic machine (deep) learning frameworks (see doi:10.48550/arXiv.2111.02705) due to its powerful feature engineering processing capabilities, automatic model selection and automatic combination and layer stacking technology, and automatic hyperparameter search technology, and can effectively avoid overfitting problems. It has been selected as the model automatic learning framework in the embodiments of the present disclosure.

如图4中所示，在步骤S430，采用AutoGluon作为自动机器学习框架，基于深度学习模型NN_Torch和FASTAI作为候选模型，构建并训练所述蛋白质可溶性预测模型。As shown in FIG. 4 , in step S430 , AutoGluon is used as an automatic machine learning framework, and the protein solubility prediction model is constructed and trained based on the deep learning models NN_Torch and FASTAI as candidate models.

更具体地说，AutoGluon使用时选择深度学习模型NN_Torch和FASTAI作为候选模型，开启贝叶斯自动超参数选择策略，并搜索十次，模型层最高堆叠数为三层，即实际模型使用参数可以设定为：More specifically, when using AutoGluon, deep learning models NN_Torch and FASTAI are selected as candidate models, the Bayesian automatic hyperparameter selection strategy is enabled, and the search is performed ten times. The maximum number of model layers stacked is three, that is, the actual model usage parameters can be set as follows:

problem_type＝"binary"problem_type="binary"

eval_metric＝'mcc'eval_metric = 'mcc'

presets＝'best_quality'presets = 'best_quality'

num_bag_folds＝5num_bag_folds=5

num_bag_sets＝5num_bag_sets=5

num_stack_levels＝3num_stack_levels = 3

excluded_model_types＝["GBM","CAT","XGB","RF","XT","KNN","LR"]excluded_model_types=["GBM","CAT","XGB","RF","XT","KNN","LR"]

hyperparameter_tune_kwargs＝'bayes'hyperparameter_tune_kwargs='bayes'

hyperparameters＝{"FASTAI":{},"NN_TORCH":{}}。hyperparameters={"FASTAI":{},"NN_TORCH":{}}.

构建模型的过程中，使用五折交叉验证(即PSI:Biology数据集中80％数据作为训练集，20％数据作为验证集)，训练集蛋白质是否可溶作为分类值，使用MCC作为用于评估预测模型性能的优化指标。In the process of building the model, a five-fold cross validation was used (i.e., 80% of the data in the PSI:Biology dataset was used as the training set and 20% of the data was used as the validation set), whether the training set protein was soluble was used as the classification value, and MCC was used as the optimization indicator for evaluating the performance of the prediction model.

MCC(Matthews Correlation Coefficient，马修斯相关系数)是一个常用于评估分类模型预测性能的指标，作为二分类模型优化指标被文献推荐，尤其是针对样品类别数量不均衡时表现优异。例如，参见Chicco,D.,Jurman,G.“The advantages ofthe Matthews correlation coefficient(MCC)over F1 score and accuracy in binary classification evaluation”,BMC Genomics 21,6(2020)；https://doi.org/10.1186/s12864-019-6413-7。因此，在本公开的优选实施例中，选择MCC作为优化指标(作为评价分类模型的性能指标)，以提升模型的泛化能力。MCC (Matthews Correlation Coefficient) is an indicator commonly used to evaluate the prediction performance of classification models. It is recommended by the literature as an optimization indicator for binary classification models, especially when the number of sample categories is unbalanced. For example, see Chicco, D., Jurman, G. "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification" evaluation”, BMC Genomics 21, 6 (2020); https://doi.org/10.1186/s12864-019-6413-7. Therefore, in a preferred embodiment of the present disclosure, MCC is selected as the optimization indicator (as a performance indicator for evaluating the classification model) to improve the generalization ability of the model.

如前所述，在模型构建过程中，使用拼接后的样本蛋白质的特征向量作为输入数据，使用样本蛋白质的可溶性数据(即可溶或不可溶)作为所述预测模型的输出数据，对所述模型进行训练。随着模型参数的不断优化，预测模型的输出数据将与样本蛋白质的可溶性数据相吻合。As mentioned above, in the process of model construction, the feature vector of the spliced sample protein is used as input data, and the solubility data (i.e., soluble or insoluble) of the sample protein is used as the output data of the prediction model to train the model. As the model parameters are continuously optimized, the output data of the prediction model will be consistent with the solubility data of the sample protein.

本领域技术人员应该理解，通过例如图4的方法构建得到的预测模型可用于图1中所示的预测方法。前面已经提到，尽管在模型构建与训练过程中，训练样本的蛋白质可溶性数据包括蛋白质是否可溶，但在运用预测模型进行实际预测过程中，对蛋白质可溶性的预测结果可以是蛋白质可溶的概率。Those skilled in the art should understand that the prediction model constructed by the method of, for example, FIG4 can be used for the prediction method shown in FIG1. As mentioned above, although in the process of model construction and training, the protein solubility data of the training samples include whether the protein is soluble, in the actual prediction process using the prediction model, the prediction result of the protein solubility can be the probability that the protein is soluble.

本公开的独立测试集有三个，其中NESG数据集作为分类问题的独立测试集，CamSol数据集和eSol数据集作为回归问题独立测试集。There are three independent test sets disclosed in the present invention, among which the NESG dataset is used as an independent test set for classification problems, and the CamSol dataset and the eSol dataset are used as independent test sets for regression problems.

本领域技术人员应该理解，在本公开的实施例中，使用与图3所示同样的步骤提取各独立测试集中蛋白质的特征，将其特征作为模型的输入，输出蛋白质可溶性的预测结果。Those skilled in the art should understand that in the embodiments of the present disclosure, the same steps as shown in FIG. 3 are used to extract the features of the proteins in each independent test set, and the features are used as the input of the model to output the prediction results of the protein solubility.

训练集和独立测试集的结果如下表三到表六所示。其中，表三到表五中的对比数据均来源于：Vineet Thumuluri,Hannah-Marie Martiny,Jose J Almagro Armenteros,Jesper Salomon,Henrik Nielsen,Alexander Rosenberg Johansen,“NetSolP:predicting protein solubility in Escherichia coli using language models”,Bioinformatics,Volume 38,Issue 4,February 2022,Pages 941–946,https://doi.org/10.1093/bioinformatics/btab801。The results of the training set and the independent test set are shown in Tables 3 to 6. Among them, the comparative data in Tables 3 to 5 are all from: Vineet Thumuluri, Hannah-Marie Martiny, Jose J Almagro Armenteros, Jesper Salomon, Henrik Nielsen, Alexander Rosenberg Johansen, "NetSolP: predicting protein solubility in Escherichia coli using language models", Bioinformatics, Volume 38, Issue 4, February 2022, Pages 941–946, https://doi.org/10.1093/bioinformatics/btab801.

表三：训练集PSI:Bio VCXlogy数据集测试结果
Table 3: Training set PSI: Bio VCXlogy dataset test results

表四：独立测试集NESG数据集测试结果
Table 4: Test results of the independent test set NESG dataset

表五：独立测试集Camsol数据集测试结果
Table 5: Test results of the independent test set Camsol dataset

有关表五，需要说明的是：独立测试集Camsol数据集包含56个突变型序列，其中，文献Trevino中包含22个蛋白变体，Miklos中包含3个蛋白变体，Tan中包含1个蛋白变体，Dudgeon包含30个蛋白变体，不同的预测模型和本公开模型预测的是点突变后蛋白溶解性变化的准确度。Regarding Table 5, it should be noted that the independent test set Camsol dataset contains 56 mutant sequences, of which the literature Trevino contains 22 protein variants, Miklos contains 3 protein variants, Tan contains 1 protein variant, and Dudgeon contains 30 protein variants. Different prediction models and the disclosed model predict the accuracy of changes in protein solubility after point mutations.

表六：独立测试集eSol数据集测试结果
Table 6: Test results of the independent test set eSol dataset

本领域技术人员应理解，回归问题预测一般以皮尔逊相关系数表示，值越大，表明预测值和真实值之间具有越高的正相关性。Those skilled in the art should understand that the prediction of a regression problem is generally expressed by the Pearson correlation coefficient, and the larger the value, the higher the positive correlation between the predicted value and the true value.

表三到表五中各指标说明如下： The indicators in Tables 3 to 5 are explained as follows:

曲线下面积(Area Under Curve,AUC)，是指接收者操作特征曲线(receiver operating characteristic curve,ROC)的曲线下面积；Area Under Curve (AUC) refers to the area under the receiver operating characteristic curve (ROC).

准确率(ACC)，是指分类正确的样本数量占样本总数的比例，准确率计算公式为：ACC＝(TP+TN)/(P+N)，通常正确率越高，说明分类器越好。Accuracy (ACC) refers to the ratio of correctly classified samples to the total number of samples. The accuracy calculation formula is: ACC = (TP + TN) / (P + N). Generally, the higher the accuracy, the better the classifier.

精确率(Precision)，是指在所有预测为阳性样本的结果中，正确预测的结果所占的比例；精确度计算公式为：Precision＝TP/(TP+FP)，值越接近1，性能越好。Precision refers to the proportion of correctly predicted results among all the results predicted as positive samples. The precision calculation formula is: Precision = TP/(TP+FP). The closer the value is to 1, the better the performance.

MCC(Matthews correlation coefficient)，即马修斯相关系数，常用于评估分类模型预测性能的指标，描述的是实际分类与预测分类之间的相关系数，MCC的计算公式为：MCC＝(TP*TN-FP*FN)/(sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)))，MCC的取值范围在-1到+1之间，其中：+1表示完美预测，0表示随机预测，-1表示预测与实际观察完全不一致。MCC (Matthews correlation coefficient) is an indicator often used to evaluate the prediction performance of classification models. It describes the correlation coefficient between the actual classification and the predicted classification. The calculation formula of MCC is: MCC = (TP*TN-FP*FN)/(sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))). The value range of MCC is between -1 and +1, where +1 indicates perfect prediction, 0 indicates random prediction, and -1 indicates that the prediction is completely inconsistent with the actual observation.

其中，TP(True Positive)、TN(True Negative)表示分类正确的阳性样本和阴性样本个数，FP(False Positive)、FN(False Negative)表示分类错误的阳性样本和阴性样本个数，P、N表示阳性样本和阴性样本数。Among them, TP (True Positive) and TN (True Negative) represent the number of correctly classified positive samples and negative samples, FP (False Positive) and FN (False Negative) represent the number of incorrectly classified positive samples and negative samples, and P and N represent the number of positive samples and negative samples.

从训练集和三个独立测试集的综合测试结果可见本发明所构建的预测模型不仅具有更好的分类性能，同时兼顾了在回归类问题中的应用(比如预测氨基酸突变对蛋白质可溶性的影响)。由于不同数据集来源不同，实验条件(包括培养基、细胞系、蛋白表达纯化等条件)也各有不同，导致训练得到的模型在泛化使用时性能有所下降。但是本发明的方法流程简单易移植，通过针对特定实验条件产生的实验数据作为训练集重新训练，便可以进一步提升模型预测可溶性的性能，甚至该流程可以用于解决蛋白质表达量预测、蛋白质折叠类型和折叠速率常数预测等问题。From the comprehensive test results of the training set and three independent test sets, it can be seen that the prediction model constructed by the present invention not only has better classification performance, but also takes into account the application in regression problems (such as predicting the effect of amino acid mutations on protein solubility). Due to the different sources of different data sets, the experimental conditions (including culture medium, cell line, protein expression and purification conditions) are also different, resulting in the performance of the trained model being reduced when it is used in a generalized manner. However, the method flow of the present invention is simple and easy to transplant. By retraining the experimental data generated for specific experimental conditions as a training set, the performance of the model predicting solubility can be further improved, and even the process can be used to solve problems such as protein expression prediction, protein folding type and folding rate constant prediction.

本领域技术人员应该理解，基于本公开的蛋白质可溶性的预测方法，可以开发出一种预测蛋白质可溶性的系统。图5示出了根据本公开的预测蛋白质可溶性的系统的示意框图。具体地说，用于预测蛋白质可溶性的系统500包括获取模块510和处理模块520。Those skilled in the art will appreciate that, based on the protein solubility prediction method disclosed herein, a system for predicting protein solubility can be developed. FIG5 shows a schematic block diagram of a system for predicting protein solubility according to the present disclosure. Specifically, the system 500 for predicting protein solubility includes an acquisition module 510 and a processing module 520.

获取模块510用于获取待预测样本，所述待预测样本包括待预测样本的蛋白质序列。The acquisition module 510 is used to acquire a sample to be predicted, where the sample to be predicted includes a protein sequence of the sample to be predicted.

处理模块520用于通过输入待预测样本的蛋白质序列，得到待预测样本的蛋白质可溶性的预测结果。如图5中所示，处理模块520可以进一步包括：第一特征提取子模块521，用于基于待预测样本的蛋白质序列，使用蛋白质预训练模型提取特征；第二特征提取子模块522，用于基于待预测样本的蛋白质序列，提取蛋白质的一级结构特征；特征拼接子模块523，用于将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到所述待预测样本的蛋白质的特征向量；模型预测子模块524，用于将所述待预测样本的蛋白质的特征向量输入到基于自动机器学习框架的蛋白质可溶性预测模型中，得到对蛋白质可溶性的预测结果。The processing module 520 is used to obtain the prediction result of the protein solubility of the sample to be predicted by inputting the protein sequence of the sample to be predicted. As shown in FIG5 , the processing module 520 may further include: a first feature extraction submodule 521, which is used to extract features based on the protein sequence of the sample to be predicted using the protein pre-training model; a second feature extraction submodule 522, which is used to extract features based on the protein sequence of the sample to be predicted using the protein pre-training model; The extraction submodule 522 is used to extract the primary structural features of the protein based on the protein sequence of the sample to be predicted; the feature splicing submodule 523 is used to splice the features extracted by the protein pre-training model with the primary structural features of the protein to obtain the feature vector of the protein of the sample to be predicted; the model prediction submodule 524 is used to input the feature vector of the protein of the sample to be predicted into the protein solubility prediction model based on the automatic machine learning framework to obtain the prediction result of the protein solubility.

本领域技术人员应该理解，以上所述的预测蛋白质可溶性的系统可以通过计算机实现。例如，可以通过执行计算机程序以实现以下操作：获取待预测样本，所述待预测样本包括待预测样本的蛋白质序列；通过输入待预测样本的蛋白质序列，得到待预测样本的蛋白质可溶性的预测结果，所述操作进一步包括如下子操作：基于待预测样本的蛋白质序列，使用蛋白质预训练模型提取特征；基于待预测样本的蛋白质序列，提取蛋白质的一级结构特征；将通过蛋白质预训练模型提取的特征与蛋白质的一级结构特征进行拼接，得到所述待预测样本的蛋白质的特征向量；将所述待预测样本的蛋白质的特征向量输入到蛋白质可溶性预测模型中，得到待预测样本的蛋白质可溶性的预测结果。也就是说，尽管在根据本公开的预测蛋白质可溶性的系统中，将对应于预测方法的各个操作功能描述为模块或子模块，但本领域技术人员应该理解，这样的模块或子模块可以不是由电路元器件或其他实体组件构成的，而是通过计算机程序来构建的功能模块。Those skilled in the art should understand that the system for predicting protein solubility described above can be implemented by a computer. For example, the following operations can be implemented by executing a computer program: obtaining a sample to be predicted, the sample to be predicted includes a protein sequence of the sample to be predicted; by inputting the protein sequence of the sample to be predicted, a prediction result of the protein solubility of the sample to be predicted is obtained, and the operation further includes the following sub-operations: extracting features based on the protein sequence of the sample to be predicted using a protein pre-training model; extracting the primary structural features of the protein based on the protein sequence of the sample to be predicted; splicing the features extracted by the protein pre-training model with the primary structural features of the protein to obtain a feature vector of the protein of the sample to be predicted; inputting the feature vector of the protein of the sample to be predicted into the protein solubility prediction model to obtain a prediction result of the protein solubility of the sample to be predicted. That is, although in the system for predicting protein solubility according to the present disclosure, each operation function corresponding to the prediction method is described as a module or sub-module, those skilled in the art should understand that such a module or sub-module may not be composed of circuit components or other physical components, but a functional module constructed by a computer program.

此外，本领域普通技术人员应该认识到，本公开的方法可以实现为计算机程序。如上结合附图所述，通过一个或多个程序执行上述实施例的方法，程序中的指令使得计算机或处理器执行结合附图所述的算法。这些程序可以使用各种类型的非瞬时计算机可读介质存储并提供给计算机或处理器。非瞬时计算机可读介质包括各种类型的有形存贮介质。非瞬时计算机可读介质的示例包括磁性记录介质(诸如软盘、磁带和硬盘驱动器)、磁光记录介质(诸如磁光盘)、CD-ROM(紧凑盘只读存储器)、CD-R、CD-R/W以及半导体存储器(诸如ROM、PROM(可编程ROM)、EPROM(可擦写PROM)、闪存ROM和RAM(随机存取存储器))。进一步，这些程序可以通过使用各种类型的瞬时计算机可读介质而提供给计算机。瞬时计算机可读介质的示例包括电信号、光信号和电磁波。瞬时计算机可读介质可以用于通过诸如电线和光纤的有线通信路径或无线通信路径提供程序给计算机。 In addition, it should be recognized by those skilled in the art that the method of the present disclosure can be implemented as a computer program. As described above in conjunction with the accompanying drawings, the method of the above embodiment is executed by one or more programs, and the instructions in the program cause the computer or processor to execute the algorithm described in conjunction with the accompanying drawings. These programs can be stored and provided to the computer or processor using various types of non-transient computer-readable media. Non-transient computer-readable media include various types of tangible storage media. Examples of non-transient computer-readable media include magnetic recording media (such as floppy disks, tapes, and hard disk drives), magneto-optical recording media (such as magneto-optical disks), CD-ROMs (compact disk read-only memories), CD-Rs, CD-R/Ws, and semiconductor memories (such as ROMs, PROMs (programmable ROMs), EPROMs (erasable PROMs), flash ROMs, and RAMs (random access memories)). Further, these programs can be provided to the computer using various types of transient computer-readable media. Examples of transient computer-readable media include electrical signals, optical signals, and electromagnetic waves. Transient computer-readable media can be used to provide programs to the computer through wired communication paths or wireless communication paths such as wires and optical fibers.

例如，根据本公开的一个实施例，可以提供一种非瞬时性计算机可读存储介质，用于存储计算机程序，所述计算机程序包括指令，所述指令在由电子设备的处理器执行时使所述电子设备实施如上所述的蛋白质可溶性预测方法。For example, according to one embodiment of the present disclosure, a non-transitory computer-readable storage medium may be provided for storing a computer program, wherein the computer program includes instructions, which, when executed by a processor of an electronic device, enable the electronic device to implement the protein solubility prediction method as described above.

另外，根据本公开公开的内容，还可以提供一种计算机系统，所述计算机系统包括：处理器；存储器；和计算机程序。计算机程序存储在所述存储器中并且被配置为由所述处理器执行。所述计算机程序包括用于实施以上所述的蛋白质可溶性预测方法的指令。In addition, according to the content disclosed in the present disclosure, a computer system can also be provided, the computer system comprising: a processor; a memory; and a computer program. The computer program is stored in the memory and is configured to be executed by the processor. The computer program includes instructions for implementing the protein solubility prediction method described above.

另一方面，根据本公开的一个实施例，可以提供一种非瞬时性计算机可读存储介质，用于存储计算机程序，所述计算机程序包括指令，所述指令在由电子设备的处理器执行时使所述电子设备实施如上所述的蛋白质可溶性预测模型的构建方法。On the other hand, according to one embodiment of the present disclosure, a non-transitory computer-readable storage medium may be provided for storing a computer program, wherein the computer program includes instructions that, when executed by a processor of an electronic device, enable the electronic device to implement the method for constructing a protein solubility prediction model as described above.

另外，根据本公开公开的内容，还可以提供一种计算机系统，所述计算机系统包括：处理器；存储器；和计算机程序。计算机程序存储在所述存储器中并且被配置为由所述处理器执行。所述计算机程序包括用于实施以上所述的蛋白质可溶性预测模型的构建方法的指令。In addition, according to the content disclosed in the present disclosure, a computer system can also be provided, the computer system comprising: a processor; a memory; and a computer program. The computer program is stored in the memory and is configured to be executed by the processor. The computer program includes instructions for implementing the method for constructing the protein solubility prediction model described above.

本公开的实施方式并不限于上述实施例所述，在不偏离本公开的精神和范围的情况下，本领域普通技术人员可以在形式和细节上对本公开做出各种改变和改进，而这些均被认为落入了本公开的保护范围。 The implementation methods of the present disclosure are not limited to the above-mentioned embodiments. Without departing from the spirit and scope of the present disclosure, ordinary technicians in this field can make various changes and improvements to the present disclosure in form and details, and these are considered to fall within the protection scope of the present disclosure.

Claims

A method for predicting protein solubility, characterized in that the method comprises:

Based on the protein sequence, features are extracted using a protein pre-trained model;

Extract the primary structural features of proteins based on protein sequences;

The features extracted by the protein pre-training model are concatenated with the primary structure features of the protein to obtain a feature vector of the protein;

The characteristic vector of the protein is input into a protein solubility prediction model to obtain a prediction result of the protein solubility.

The method according to claim 1 is characterized in that the protein pre-training model is implemented by a combination of one or more of the following models: ESM-1b, UniRep, ProteinBert, TAPE, ProtGPT2, ProtTXL, ProtBert, ProtXLNet, ProtAlbert, ProtElectra, ProtT5-XL-BFD, ProtT5-XL-UniRef50 and ProtT5-XXL; preferably, the protein pre-training model is ProtT5-XL-BFD or ProtT5-XL-UniRef50; more preferably, the protein pre-training model is ProtT5-XL-UniRef50.

The method according to claim 1, characterized in that the extracting features based on the protein sequence using a protein pre-training model comprises:

Taking the protein sequence as input, the ProtT5-XL-UniRef50 pre-trained model is used to extract the embedding layer vector output by the encoder, where each amino acid corresponds to a 1024-dimensional vector and each sequence corresponds to an L×1024-dimensional feature matrix, where L is the length of the amino acid sequence of the protein;

Take the average value of the above feature matrix by column to obtain a 1024-dimensional feature vector.

The method according to claim 3 is characterized in that the step of averaging the feature matrix by columns to obtain a 1024-dimensional feature vector further comprises:

The Sentence-T5Encoder-onlymean method is used to average the values of each dimension of the embedding layer feature values of all amino acids output by the encoder by column to obtain a 1024-dimensional feature vector.

The method according to claim 1, characterized in that extracting the primary structural features of the protein based on the protein sequence comprises:

Taking protein sequence as input, the iFeature tool is used to extract the primary structural features of the protein.

The method according to claim 1, characterized in that the primary structural features of the protein are selected from one or more of AAC, DPC, DDE, TPC, CKSAAP, EAAC, GAAC, CKSAAGP, GDPC, GTPC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, KSCTriad, SOCNumber, QSOrder, PAAC, APAAC, KNNprotein, PSSM, AAINDEX and BLOSUM62; preferably, the primary structural features of the protein include AAC, GAAC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, PAAC and APAAC.

The method according to claim 1, characterized in that the protein solubility prediction model is constructed by the following steps:

Acquire a data set, wherein the data set includes a plurality of training samples, each of the training samples includes a sample protein sequence and solubility data thereof;

Based on the sample protein sequence, features are extracted using a protein pre-trained model;

Extract the primary structural features of the protein based on the sample protein sequence;

The features extracted by the protein pre-training model are combined with the primary structure features of the protein to obtain a feature vector of the sample protein;

The characteristic vector of the sample protein is used as input data, the solubility data of the sample protein is used as output data, and the machine learning model is trained to obtain a final protein solubility prediction model.

The method according to claim 7, characterized in that the training of the machine learning model comprises:

Based on the automatic machine learning framework, one or more machine learning models are selected as candidate models, and the protein solubility prediction model is constructed and trained.

The method according to claim 8 is characterized in that the automatic machine learning framework is AutoGluon; and the candidate models are deep learning models NN_Torch and FASTAI.

The method according to claim 1 or 7, characterized in that the method further comprises:

Extract other features of proteins based on protein sequences;

The features extracted by the protein pre-training model are spliced with the primary structure features of the protein and other features of the protein to obtain a spliced feature vector of the protein.

The method according to claim 10, characterized in that the other characteristics of the protein are selected from one or more of molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorbance, hydrophilicity total average, the proportion of helix/turn/fold in the secondary structure of the protein, secondary structure element content (SSEC), secondary structure element binary (SSEB), disorder, disorder content (DisorderC), disorder binary (DisorderB), barrier-free solvent accessibility (ASA), torsion angle (TA), Z-scale (ZSCALE) and 48 pseudo K-tuple reduced amino acid composition (48PseKRAAC); preferably, the other characteristics include molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorbance, hydrophilicity total average and the proportion of helix/turn/fold in the secondary structure of the protein.

A method for constructing a protein solubility prediction model, characterized in that the construction method comprises:

The features extracted by the protein pre-training model are spliced with the primary structure features of the protein to obtain a feature vector of the sample protein;

The method according to claim 12, characterized in that the protein pre-training model is implemented by a combination of one or more of the following models: ESM-1b, UniRep, ProteinBert, TAPE, ProtGPT2, ProtTXL, ProtBert, ProtXLNet, ProtAlbert, ProtElectra, ProtT5-XL-BFD, ProtT5-XL-UniRef50 and ProtT5-XXL; preferably, the protein pre-training model is ProtT5-XL-BFD or ProtT5-XL-UniRef50; more preferably, the protein pre-training model is ProtT5-XL-UniRef50.

The method according to claim 12, characterized in that the extracting features based on the sample protein sequence using a protein pre-training model comprises:

The sample protein sequence is used as input, and the ProtT5-XL-UniRef50 pre-trained model is used to extract the embedding layer vector output by the encoder, where each amino acid corresponds to a 1024-dimensional vector and each sequence corresponds to an L×1024-dimensional feature matrix, where L is the length of the amino acid sequence of the sample protein;

The method according to claim 14, characterized in that the step of averaging the feature matrix by columns to obtain a 1024-dimensional feature vector further comprises:

The method according to claim 12, characterized in that extracting the primary structural features of the protein based on the sample protein sequence comprises:

Using the sample protein sequence as input, the iFeature tool is used to extract typical primary structural features of the protein.

The method according to claim 12, characterized in that the primary structural features of the protein are selected from one or more of AAC, DPC, DDE, TPC, CKSAAP, EAAC, GAAC, CKSAAGP, GDPC, GTPC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, KSCTriad, SOCNumber, QSOrder, PAAC, APAAC, KNNprotein, PSSM, AAINDEX and BLOSUM62; preferably, the primary structural features of the protein include AAC, GAAC, Moran, Geary, NMBroto, CTDC, CTDT, CTDD, CTriad, PAAC and APAAC.

The method according to claim 12 is characterized in that the machine learning model is a deep learning model.

The method according to claim 12, characterized in that the training of the machine learning model comprises:

The method according to claim 19 is characterized in that the automatic machine learning framework is AutoGluon; and the candidate models are deep learning models NN_Torch and FASTAI.

The method according to claim 12, characterized in that the method further comprises:

Extract other features of proteins based on sample protein sequences;

The features extracted by the protein pre-training model are spliced with the primary structure features of the protein and other features of the protein to obtain a spliced feature vector of the sample protein.

The method according to claim 21, characterized in that the other characteristics of the protein are selected from one or more of molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorbance, total average hydrophilicity, the proportion of helix/turn/fold in the secondary structure of the protein, SSEC, SSEB, Disorder, DisorderC, DisorderB, ASA, TA, ZSCALE and 48PseKRAAC; preferably, the other characteristics include molecular weight, aromaticity, instability coefficient, elasticity, isoelectric point, molar absorbance, total average hydrophilicity and the proportion of helix/turn/fold in the secondary structure of the protein.

A protein solubility prediction model constructed according to the method described in any one of claims 12 to 22.

A computer-implemented system for predicting protein solubility, wherein the computer-implemented system implements the following operations by executing a computer program:

Acquire a sample to be predicted, wherein the sample to be predicted includes a protein sequence of the sample to be predicted;

By inputting the protein sequence of the sample to be predicted, a prediction result of the protein solubility of the sample to be predicted is obtained, and the operation further includes the following sub-operations:

Based on the protein sequence of the sample to be predicted, a protein pre-training model is used to extract features;

Based on the protein sequence of the sample to be predicted, extract the primary structural features of the protein;

The features extracted by the protein pre-training model are spliced with the primary structure features of the protein to obtain a feature vector of the protein of the sample to be predicted;

The characteristic vector of the protein of the sample to be predicted is input into the protein solubility prediction model to obtain the prediction result of the protein solubility of the sample to be predicted.

A non-transitory computer-readable storage medium for storing a computer program, wherein the computer program includes instructions, which, when executed by a processor of an electronic device, cause the electronic device to implement the protein solubility prediction method as described in any one of claims 1 to 11 or the method for constructing a protein solubility prediction model as described in any one of claims 12 to 22.

A computer system, comprising:

processor;

Memory; and

A computer program, wherein the computer program is stored in the memory and is configured to be executed by the processor, and the computer program includes instructions for implementing the protein solubility prediction method according to any one of claims 1 to 11 or the method for constructing a protein solubility prediction model according to any one of claims 12 to 22.