CN111816255A

CN111816255A - Fusion of multi-view and optimal multi-label chain learning for RNA-binding protein identification

Info

Publication number: CN111816255A
Application number: CN202010658127.8A
Authority: CN
Inventors: 邓赵红; 杨海涛; 吴敬; 王蕾; 王士同
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2020-10-23
Anticipated expiration: 2040-07-09
Also published as: CN111816255B

Abstract

The invention belongs to the field of bioinformatics, and relates to RNA-binding protein recognition integrating multi-view and optimal multi-label chain learning. The method includes two parts: training phase and use phase. The training phase includes initial multi-view data construction, multi-view deep feature extraction model training, multi-label feature learning and optimal multi-label chain classifier training. Multiple perspectives include RNA sequence perspective, amino acid sequence perspective, multi-gap dipeptide component perspective and RNA sequence semantics perspective. In order to improve the effectiveness of multi-view features, the present invention constructs deep multi-view features by using CNN to perform deep learning based on the original multi-view data. In order to link multi-view features with multi-label learning, the present invention establishes a multi-label feature learning model, which is used to integrate the advantages of all perspectives, and uses the optimal CC chain classifier which is different from ordinary CC multi-label classifiers to learn labels. The correlation between them can improve the classification accuracy more effectively.

Description

Fusion of multi-view and optimal multi-label chain learning for RNA-binding protein identification

技术领域technical field

本发明属于生物信息学领域，涉及融合多视角和最优多标签链式学习的RNA结合蛋白识别。The invention belongs to the field of bioinformatics, and relates to RNA-binding protein recognition integrating multi-view and optimal multi-label chain learning.

背景技术Background technique

RNA，全称核糖核酸，存在于生物细胞以及部分病毒、类病毒中的遗传信息载体之中，在生命体中主要起到调控编码基因表达的作用，同时也担任基因转录后合成蛋白质模板的角色，是生命体中不可缺少的成分。一条RNA想要顺利发挥其功能，一般需要借助RNA结合蛋白(RBP)进行介导，所以缺少某种RBP可能会导致某类RNA无法发挥其调控或翻译的功能，从而使生命体缺少某些重要蛋白质或某些蛋白质异常增殖，影响自身机能。RNA, the full name of ribonucleic acid, exists in biological cells and the genetic information carrier in some viruses and viroids. It mainly plays a role in regulating the expression of coding genes in living organisms, and also acts as a template for protein synthesis after gene transcription. It is an indispensable ingredient in life. If an RNA wants to perform its function smoothly, it generally needs to be mediated by RNA-binding protein (RBP), so the lack of a certain RBP may cause a certain type of RNA to fail to perform its regulatory or translational functions, thus making life lack some important Abnormal proliferation of protein or certain proteins, affecting its own function.

RNA结合蛋白(RBP)是转录后事件的关键参与者，它们结构域的多功能性与结构灵活性使得RBP能够控制大量转录物的代谢。RBP几乎涉及转录后调控层的所有步骤，它们与其他蛋白质以及编码和非编码RNA建立高度动态的相互作用，产生称为核糖核蛋白复合物的功能单元，调节RNA剪切、多腺苷酸化、稳定性、定位、翻译和退化。研究发现，某些特定RBP具有调节RNA合成癌蛋白和肿瘤抑制蛋白的功效，因此破译RBP与其癌症相关RNA靶标之间错综复杂的相互结合网络将提供对肿瘤生物学的更好理解，并可能发现治疗癌症的新方法。RNA-binding proteins (RBPs) are key players in post-transcriptional events, and the versatility and structural flexibility of their domains enable RBPs to control the metabolism of a large number of transcripts. RBPs are involved in almost all steps of the post-transcriptional regulatory layer, and they establish highly dynamic interactions with other proteins as well as coding and noncoding RNAs to generate functional units called ribonucleoprotein complexes that regulate RNA splicing, polyadenylation, Stability, localization, translation and degradation. Studies have found that some specific RBPs have the efficacy of regulating RNA synthesis of oncoproteins and tumor suppressor proteins, so deciphering the intricate inter-binding network between RBPs and their cancer-associated RNA targets will provide a better understanding of tumor biology and may lead to therapeutic discoveries A new approach to cancer.

在大数据和测序技术高度发展的背景下，生物医疗行业很难做到对每对RNA和RBP进行结合性检测，所以涌现了很多利用机器学习模型从RNA序列中识别RBP结合位点的算法。例如：Maticzka等人提出了GraphProt方法，其从高通量实验数据中学习RBP序列和结构的结合偏好，设计出独特的计算框架；Corrado等人提出RNACommender，一种预测结合位点的方法，能够通过可用的相互作用信息，考虑蛋白质结构和RNA的模拟二级结构，向未探索的RBP推荐RNA靶点；由Zhang等人提出的HOCNNLB使用高阶核苷酸编码来作为初始特征，预测某段给定的RNA是否是结合位点。这些方法的关注点在于利用原始RNA序列的序列特征或结构特征判断给定的RNA序列片段是否为某特定RBP的结合位点，但很少有方法利用RNA与RBP已有的结合信息来为预测提供帮助。针对此，潘等人提出了iDeepM方法，其利用多标签分类和深度学习法为一种RNA寻找多个可以与之相结合的RBP，成功达到多标签分类的预期效果。但iDeepM也存在如下的不足：其使用的RNA序列单视角数据虽然对预测分类具有一定的有效性，但是受限于RNA序列的信息量不足，导致精度较低；另外该方法使用卷积神经网络和长短时记忆网络来进行多标签分类，未能充分学习到标签之间的联系，同样对预测精度产生影响。In the context of the high development of big data and sequencing technology, it is difficult for the biomedical industry to detect the binding of each pair of RNA and RBP, so many algorithms that use machine learning models to identify RBP binding sites from RNA sequences have emerged. For example, Maticzka et al. proposed the GraphProt method, which learned the binding preferences of RBP sequences and structures from high-throughput experimental data, and designed a unique computational framework; Corrado et al. proposed RNACommender, a method for predicting binding sites that can Recommend RNA targets to unexplored RBPs by taking into account the protein structure and the simulated secondary structure of RNA through available interaction information; HOCNNLB, proposed by Zhang et al., uses higher-order nucleotide codes as initial features to predict a certain segment Whether a given RNA is a binding site. The focus of these methods is to use the sequence features or structural features of the original RNA sequence to determine whether a given RNA sequence fragment is a binding site for a specific RBP, but few methods use the existing binding information of RNA and RBP to predict provide help. In response to this, Pan et al. proposed the iDeepM method, which uses multi-label classification and deep learning to find multiple RBPs that can be combined with one RNA, and successfully achieves the expected effect of multi-label classification. However, iDeepM also has the following shortcomings: although the single-view RNA sequence data it uses is effective for prediction and classification, it is limited by the insufficient amount of RNA sequence information, resulting in low accuracy; in addition, this method uses convolutional neural network. Using long and short-term memory network to perform multi-label classification, it fails to fully learn the relationship between labels, which also affects the prediction accuracy.

发明内容SUMMARY OF THE INVENTION

本发明实现了融合多视角和最优多标签链式学习的RNA结合蛋白识别，该方法包括训练阶段和使用阶段两部分，训练阶段包括初始多视角特征构建模型、深度多视角特征提取模型，多标签特征学习和最优多标签链式分类器训练。The invention realizes the recognition of RNA binding proteins that integrates multi-view and optimal multi-label chain learning. The method includes two parts: a training stage and a use stage. The training stage includes an initial multi-view feature construction model, a deep multi-view feature extraction model, and a multi-view feature extraction model. Label feature learning and optimal multi-label chain classifier training.

训练阶段：初始多视角特征构建模型使用分子生物学原理，统计学原理和Word2Vec(单词转向量)技术将原始的RNA序列转换为氨基酸序列，多间隙二肽成分和RNA序列语义矩阵，获得序列的次序，成分和语义特征，然后和原始的RNA序列一起构建成初始多视角特征，获得初始多视角特征构建模型；深度多视角特征提取模型构建出四个卷积神经网络，对初始的四个视角特征进行训练，以获得具有更好分类能力的深度多视角特征，得到深度多视角特征提取模型；提取到的深度特征用于训练多标签特征学习模型，经过多标签特征学习模型学习到的标签相关联的加权特征向量用于训练最优CC多标签链式分类器，以学习标签之间的关联，获得具有识别RNA结合蛋白能力的模型。Training phase: The initial multi-view feature construction model uses molecular biology principles, statistical principles and Word2Vec (word turn vector) technology to convert the original RNA sequence into amino acid sequence, multi-gap dipeptide components and RNA sequence semantic matrix, obtain the sequence of The order, composition and semantic features are then constructed together with the original RNA sequence to form the initial multi-view feature, and the initial multi-view feature construction model is obtained; the deep multi-view feature extraction model constructs four convolutional neural networks, and the initial four views are Features are trained to obtain deep multi-view features with better classification ability, and a deep multi-view feature extraction model is obtained; the extracted deep features are used to train a multi-label feature learning model, and the labels learned by the multi-label feature learning model are related. The linked weighted feature vectors are used to train an optimal CC multi-label chain classifier to learn the associations between labels and obtain a model with the ability to identify RNA-binding proteins.

使用阶段：获取待测RNA序列，利用分子生物学原理，统计学原理和Word2Vec(单词转向量)技术构建出此条序列的初始多视角特征；再利用训练出来的四个卷积神经网络提取出4个视角的深度特征；接着使用训练出来的多标签特征学习模型对拼接起来的4个视角的深度特征进行加权操作，得到标签相关的加权特征向量；然后使用训练出来的最优CC多标签链式分类器对标签相关的加权特征向量进行预测，得到最终的预测结果。Use stage: Obtain the RNA sequence to be tested, and use the principles of molecular biology, statistics and Word2Vec (word turn vector) technology to construct the initial multi-view feature of the sequence; then use the four trained convolutional neural networks to extract the The depth features of 4 perspectives; then use the trained multi-label feature learning model to perform weighting operations on the depth features of the spliced 4 perspectives to obtain the weighted feature vectors related to labels; then use the trained optimal CC multi-label chain The classifier predicts the weighted feature vector related to the label to obtain the final prediction result.

所述的融合多视角和最优多标签链式学习的RNA结合蛋白识别集合多视角深度特征学习技术，多标签特征学习技术和多标签学习技术，深度学习的深层次结构优化特征表示，多标签特征学习技术融合并修正了每个视角的深度特征，充分利用了每个视角特征的优势，建立了与标签相关的加权特征向量，多标签技术有效地利用每个标签的独立性和标签之间的相关性。将多视角深度学习技术，多标签特征学习技术和多标签学习技术有效结合可以充分提取RNA序列中的有效信息，提高分类器的泛化能力。The described fusion multi-view and optimal multi-label chain learning RNA-binding protein identification set multi-view deep feature learning technology, multi-label feature learning technology and multi-label learning technology, deep learning deep structure optimization feature representation, multi-label The feature learning technology fuses and corrects the depth features of each view, makes full use of the advantages of each view feature, and establishes a weighted feature vector related to the label, and the multi-label technology effectively utilizes the independence of each label and between labels. correlation. The effective combination of multi-view deep learning technology, multi-label feature learning technology and multi-label learning technology can fully extract effective information in RNA sequences and improve the generalization ability of the classifier.

RNA序列是一段用文字序列描述的生物遗传物质，深度卷积模型无法处理文字信息，所以需要先将RNA文字序列进行预处理，转换成程序所能接受的数值形式。one-hot(独热编码)是目前较为流行的编码技术，其原理是将一条由n种元素组成的长度为m的文字序列构建为一个n*m的矩阵，其中把每种元素转化成n维的标准正交基向量填充至m长度中的对应位置。以RNA序列来说，one-hot(独热编码)会为一条长度为m的RNA序列构造一个初始的4*m大小的空白矩阵，将每种碱基转化为4维正交基向量，填充至序列的对应位置，如图7所示。行标题为一条具体的RNA序列，实际长度为2700。对照列中碱基所在的位置，可以把序列中的碱基A表示为向量(1,0,0,0)^T，碱基C表示为向量(0,1,0,0)^T，碱基G表示为(0,0,1,0)^T，碱基U表示为(0,0,0,1)^T，以此类推。An RNA sequence is a piece of biological genetic material described by a text sequence. The deep convolution model cannot process text information, so it is necessary to preprocess the RNA text sequence and convert it into a numerical form acceptable to the program. One-hot (one-hot encoding) is a relatively popular encoding technology at present. Its principle is to construct a text sequence of length m composed of n elements into an n*m matrix, in which each element is converted into n Orthogonal basis vectors of dimension are padded to corresponding positions in length m. In terms of RNA sequence, one-hot (one-hot encoding) will construct an initial 4*m size blank matrix for an RNA sequence of length m, convert each base into a 4-dimensional orthogonal basis vector, fill in to the corresponding position of the sequence, as shown in Figure 7. The row header is a specific RNA sequence with an actual length of 2700. To compare the position of the base in the column, the base A in the sequence can be represented as a vector (1,0,0,0) ^T , the base C as a vector (0,1,0,0) ^T , the base G is represented by (0,0,1,0) ^T , base U is represented by (0,0,0,1) ^T , and so on.

上述方法构建的初始特征矩阵虽然对提取特征有帮助，但缺点是信息量较少。氨基酸序列由20种氨基酸构成，其信息量远比RNA序列丰富，所以使用氨基酸序列转化得到的one-hot(独热编码)编码矩阵会为特征提取提供更好的效果。将RNA序列翻译成氨基酸序列是单向且唯一的，但是因为一个氨基酸可对应多种碱基组合，所以由此得到的氨基酸序列无法还原至原始RNA序列，这会造成信息丢失和曲解信息的后果。例如碱基组合GCA可翻译得到固定的氨基酸A，但是氨基酸A却可以表示为GCA、GCC、GCG、GCU。为了处理这个问题，使用RNA序列到氨基酸序列的三种翻译方式，即从头开始翻译的第一形态，跳过第一个碱基开始翻译的第二形态，跳过第一和第二个碱基开始翻译的第三形态。用此方法可将上述长度为m的RNA序列转化为3条长度为1/3m的氨基酸序列，这三种形态的氨基酸序列可以通过序列信息互补来还原原始RNA序列信息。如上述的碱基组合GCA，可使用三种形态序列对应位置的氨基酸R、A、H来唯一确定。所以将三种形态的氨基酸序列拼接起来，得到一条长度为m的氨基酸长链，能够完全继承原始RNA序列的序列信息，且具有更加丰富的表现形式。对这条长链进行one-hot编码，原理同RNA序列，可得到一个20*m大小的初始特征矩阵，如图8所示，即为本发明所提出的氨基酸视角数据。行标题为一条具体的氨基酸序列，实际长度为2700。对照列标题中氨基酸所在的位置，可以将行序列中的所有氨基酸表示为一个个20维的标准正交基向量。Although the initial feature matrix constructed by the above method is helpful for feature extraction, the disadvantage is that the amount of information is less. The amino acid sequence is composed of 20 kinds of amino acids, and its information is much richer than that of the RNA sequence, so the one-hot (one-hot encoding) encoding matrix obtained by converting the amino acid sequence will provide better results for feature extraction. The translation of RNA sequence into amino acid sequence is unidirectional and unique, but because one amino acid can correspond to multiple base combinations, the resulting amino acid sequence cannot be restored to the original RNA sequence, which will cause information loss and misinterpretation of information. . For example, the base combination GCA can be translated to give a fixed amino acid A, but the amino acid A can be expressed as GCA, GCC, GCG, GCU. To deal with this problem, three modes of translation of RNA sequences to amino acid sequences are used, namely the first form of translation from scratch, the second form of translation to start translation by skipping the first base, and the skipping of the first and second bases Begin the third form of translation. Using this method, the above-mentioned RNA sequence with a length of m can be converted into three amino acid sequences with a length of 1/3 m, and the amino acid sequences of these three forms can restore the original RNA sequence information by complementing the sequence information. As described above, the base combination GCA can be uniquely determined by using the amino acids R, A, and H at the corresponding positions of the three morphological sequences. Therefore, the amino acid sequences of the three forms are spliced together to obtain a long amino acid chain with a length of m, which can completely inherit the sequence information of the original RNA sequence and has a richer expression. One-hot coding is performed on this long chain, the principle is the same as the RNA sequence, and an initial feature matrix of 20*m size can be obtained, as shown in Figure 8, which is the amino acid perspective data proposed by the present invention. The row title is a specific amino acid sequence, the actual length is 2700. All amino acids in the row sequence can be represented as a 20-dimensional standard orthonormal basis vector by comparing the positions of the amino acids in the column headings.

上述提到的RNA视角和氨基酸视角数据都偏向于对序列次序提取特征，而一条序列除了次序外，其组成成分同样重要。因为0-gap二肽偏向于二维序列的成分组成，而1-gap二肽带有三维结构成分信息，所以使用0-gap二肽和1-gap二肽提取RNA序列成分信息效果最好，本发明采用它们的组合形式来提取序列成分，构成多间隙二肽成分视角。因为二肽对左右氨基酸排列是敏感的，对于本发明中21种氨基酸(20种天然氨基酸和本发明增加的临时氨基酸O)，共有21*21*2个多间隙二肽种类，由于OO和O*O的组合对我们的研究无太多意义，所以被舍弃。统计这880种多间隙二肽出现的次数得到特征向量，可以有效地捕获到此条氨基酸序列和RNA序列的成分信息和氨基酸空间成分的信息。由于880维的特征向量是一维的，用于提取深度特征的效果不理想，所以我们将其转化为二维柱状图，可以更有效的使用机器学习模型来提取深度特征，如图9所示。图中上部分表格的横坐标为多间隙二肽种类，其中“AA”表示左右都是丙氨酸的0-gap二肽，18代表其在样本氨基酸序列中的数量；“A*D”表示左侧为丙氨酸，中间间隔任意一个氨基酸，右侧为天冬氨酸的1-gap二肽。下图只列举了12种多间隙二肽，实际数量为880种。下部分图表为转化后的柱状图，每种多间隙二肽数量的上限设为30，所以我们取30*880大小的矩阵作为此条氨基酸序列的多间隙二肽初始数据。Both the RNA-view and amino-acid-view data mentioned above are biased towards extracting features for sequence order, and the composition of a sequence is equally important in addition to its sequence. Because 0-gap dipeptide is biased towards the composition of two-dimensional sequences, while 1-gap dipeptide has three-dimensional structural component information, so using 0-gap dipeptide and 1-gap dipeptide to extract RNA sequence composition information is the best, The present invention adopts their combined form to extract sequence components to form the perspective of multi-gap dipeptide components. Because dipeptides are sensitive to the arrangement of left and right amino acids, for the 21 amino acids in the present invention (20 natural amino acids and the temporary amino acid O added in the present invention), there are 21*21*2 multi-gap dipeptide species, due to OO and O The combination of *O is of little significance to our research, so it was discarded. Count the occurrences of these 880 kinds of multi-gap dipeptides to obtain feature vectors, which can effectively capture the component information of this amino acid sequence and RNA sequence and the information of amino acid space components. Since the 880-dimensional feature vector is one-dimensional, the effect of extracting deep features is not ideal, so we convert it into a two-dimensional histogram, which can more effectively use machine learning models to extract deep features, as shown in Figure 9 . The abscissa of the table in the upper part of the figure is the multi-gap dipeptide species, in which "AA" represents the 0-gap dipeptide with alanine on the left and right, 18 represents its number in the sample amino acid sequence; "A*D" represents The left side is alanine with any amino acid in the middle, and the right side is the 1-gap dipeptide of aspartic acid. The figure below lists only 12 multi-gap dipeptides, and the actual number is 880. The lower part of the chart is the converted histogram, and the upper limit of the number of each multi-gap dipeptide is set to 30, so we take a 30*880 matrix as the initial data of the multi-gap dipeptide for this amino acid sequence.

自然语言处理(NLP)是计算机科学领域与人工智能领域中的一个重要方向，从初始数据的角度来看，生物信息学与NLP的初始研究数据具有相同的形式。因此，可以使用NLP的方法来解决生物信息学中对文本的编码及初始特征构建。本发明使用6聚体RNA作为训练语义模型的词库，6聚体RNA为6个连续碱基组成的结构，因此词库共由46种6聚体RNA组成。本发明使用现流行的Word2Vec技术构建语义模型，其原理如图10所示。基于本发明所用数据集中的92102条RNA序列，逐条对它们进行以下操作：1)使用6位碱基为大小的滑动窗口，获取RNA序列中6聚体RNA的排列顺序；2)对每个6聚体RNA进行编码，即它在4096种形态中的位置(以‘AAAAAA’为1，‘UUUUUU’为4096的规则)；3)将相邻的2个6聚体RNA分别作为特征X和标签Y，投入至语义模型中训练；4)从训练完的语义模型中提取4096种6聚体RNA各自的词向量结果；5)使用词向量替代RNA序列中每个6聚体RNA，构建RNA序列语义矩阵。由6聚体RNA词向量构成的RNA序列语义矩阵不仅具有较小的维度，而且包含了以6位碱基为基序的RNA序列次序和上下文结构信息，可以更好地进行为深度特征学习。Natural Language Processing (NLP) is an important direction in the field of computer science and artificial intelligence. From the perspective of initial data, bioinformatics and NLP have the same form of initial research data. Therefore, NLP methods can be used to solve the encoding of text and initial feature construction in bioinformatics. The present invention uses 6-mer RNA as the lexicon for training the semantic model, and the 6-mer RNA is a structure composed of 6 consecutive bases, so the lexicon is composed of 46 kinds of 6-mer RNA. The present invention uses the popular Word2Vec technology to construct a semantic model, and its principle is shown in FIG. 10 . Based on the 92102 RNA sequences in the data set used in the present invention, the following operations were performed on them one by one: 1) Using a sliding window with a size of 6 bases, obtain the arrangement order of 6-mer RNA in the RNA sequence; 2) For each 6-mer RNA The polymer RNA is encoded, that is, its position in the 4096 forms (with 'AAAAAA' as 1 and 'UUUUUU' as the rule of 4096); 3) The adjacent 2 6-mer RNAs are used as feature X and label respectively Y, put into training in the semantic model; 4) Extract the word vector results of 4096 kinds of 6-mer RNAs from the trained semantic model; 5) Use the word vector to replace each 6-mer RNA in the RNA sequence to construct the RNA sequence Semantic Matrix. The RNA sequence semantic matrix composed of 6-mer RNA word vectors not only has a small dimension, but also contains RNA sequence order and context structure information with 6 bases as motifs, which can be better used for deep feature learning.

该部分的具体步骤如下：The specific steps in this part are as follows:

第一步：使用原始RNA序列的one-hot转换矩阵作为RNA初始特征X¹。Step 1: Use the one-hot transformation matrix of the original RNA sequence as the RNA initial feature X ¹ .

第二步：使用分子生物学原理和one-hot方法将原始RNA序列转换成氨基酸序列初始特征X²。The second step: using the principles of molecular biology and the one-hot method to convert the original RNA sequence into the initial feature X ² of the amino acid sequence.

第三步：使用统计学原理将氨基酸序列转换成多间隙二肽成分初始特征X³。Step 3: Use statistical principles to convert the amino acid sequence into an initial feature ^X3 of the multi-gap dipeptide component.

第四步：使用Word2Vec技术训练RNA序列语义模型，获取6聚体RNA词向量，组成RNA序列语义矩阵作为RNA序列语义初始特征X⁴。由此得到初步多视角数据集D＝{X¹,X²,X³,X⁴,y}The fourth step: use Word2Vec technology to train the RNA sequence semantic model, obtain 6-mer RNA word vectors, and form an RNA sequence semantic matrix as the initial feature X ⁴ of RNA sequence semantics. From this, the preliminary multi-view data set D={X ¹ , X ² , X ³ , X ⁴ , y} is obtained

本发明的深度多视角特征提取部分使用卷积神经网络对RNA序列的各个视角特征进行自动提取。对于原始的RNA序列，经过预处理后可以得到RNA序列特征、氨基酸序列特征、多间隙二肽成分特征和RNA序列语义特征，针对四个不同视角的特征，分别构建四个不同的卷积神经网络来对不同视角特征进行深度自动提取。The deep multi-view feature extraction part of the present invention uses a convolutional neural network to automatically extract each view feature of the RNA sequence. For the original RNA sequence, RNA sequence features, amino acid sequence features, multi-gap dipeptide component features, and RNA sequence semantic features can be obtained after preprocessing. Four different convolutional neural networks are constructed for features from four different perspectives. To automatically extract the depth of features from different perspectives.

CNN网络在训练时采用最后一层输出层的结果计算误差并进行反向传播，由此来进行网络的学习。因为倒数第二层计算得到的特征向量到输出层只经过一个全连接层，可以认为根据网络输出层训练优化网络结构的同时，对倒数第二层输出特征向量的表达也进行了优化，即网络在训练的同时也学习到了更好的特征表达，所以选择网络倒数第二层的输出作为网络学习到的特征。通过卷积神经网络的自动学习获得的特征，具有比原始特征更小的维度，并且得到的特征是经过非线性组合的具有更好划分能力的特征，可以使后续的分类模型具有更好的泛化效果。During training, the CNN network uses the result of the last output layer to calculate the error and backpropagates, so as to learn the network. Because the feature vector calculated by the penultimate layer goes through only one fully connected layer to the output layer, it can be considered that while optimizing the network structure according to the training of the network output layer, the expression of the output feature vector of the penultimate layer is also optimized, that is, the network A better feature representation is also learned during training, so the output of the penultimate layer of the network is selected as the feature learned by the network. The features obtained through the automatic learning of the convolutional neural network have smaller dimensions than the original features, and the obtained features are features with better division ability after nonlinear combination, which can make the subsequent classification model have better generalization ability. effect.

图11，图12，图13，图14为四个视角深度特征提取所使用的CNN网络架构图。用k@m*n表示网络各个层的特征图，k表示该层特征图的个数，m*n表示特征图的大小。网络的二维卷积核用k*m*n表示，k是卷积核的个数，m*n为卷积核的大小。卷积核的步长默认为1。网络的输入为各个视角特征，输出为一个向量，向量长度等于68(即该条RNA序列和68种RBP的结合情况)。结果的前67维表示，若样本可以与该维的RBP结合，则等于1，否则等于0；结果的第68维表示，若样本RNA序列不可以与前67种中的任意一种RBP结合，则为1，否则为0。Figure 11, Figure 12, Figure 13, and Figure 14 are the CNN network architecture diagrams used for deep feature extraction from four perspectives. Use k@m*n to represent the feature map of each layer of the network, k to represent the number of feature maps of this layer, and m*n to represent the size of the feature map. The two-dimensional convolution kernel of the network is represented by k*m*n, k is the number of convolution kernels, and m*n is the size of the convolution kernel. The stride of the convolution kernel is 1 by default. The input of the network is each perspective feature, and the output is a vector whose length is equal to 68 (that is, the combination of this RNA sequence and 68 RBPs). The first 67 dimensions of the result indicate that if the sample can be combined with the RBP of this dimension, it is equal to 1, otherwise it is equal to 0; the 68th dimension of the result indicates that if the sample RNA sequence cannot be combined with any of the first 67 RBPs, 1 otherwise, 0.

图11为RNA视角深度特征提取使用的CNN网络架构，包括1个二维卷积层,1个池化层，1个扁平层,2个dropout层和2个全连接层。网络的输入为4*2710的二维矩阵。CNN网络架构第一层卷积层为101个4*10的卷积核，得到的101个1*2701的特征图；第二层池化层的池化长度为3，得到101个1*900的特征图；第三层为扁平层，得到1个1*90900的特征图；第四层为概率0.5的dropout层，得到1个1*90900的特征图；第五为全连接层，将1个1*90900的特征图转换成一个1*202的向量；第六层为概率0.5的dropout层，得到1个1*202的特征图；第五为全连接层，将1个1*202的特征图转换成一个1*68的向量.Figure 11 shows the CNN network architecture used for deep feature extraction from RNA perspective, including 1 2D convolutional layer, 1 pooling layer, 1 flattening layer, 2 dropout layers and 2 fully connected layers. The input of the network is a two-dimensional matrix of 4*2710. CNN network architecture The first convolutional layer is 101 convolution kernels of 4*10, and 101 feature maps of 1*2701 are obtained; the pooling length of the second pooling layer is 3, and 101 1*900 are obtained. The third layer is a flat layer, and a feature map of 1*90900 is obtained; the fourth layer is a dropout layer with a probability of 0.5, and a feature map of 1*90900 is obtained; the fifth is a fully connected layer. A 1*90900 feature map is converted into a 1*202 vector; the sixth layer is a dropout layer with a probability of 0.5, and a 1*202 feature map is obtained; the fifth is a fully connected layer, a 1*202 The feature map is converted into a 1*68 vector.

图12为氨基酸视角深度特征提取使用的CNN网络架构，总共包括1个二维卷积层,1个池化层，1个扁平层,2个dropout层和2个全连接层。输入为20*2710的二维矩阵。CNN网络架构第一层卷积层为101个20*10的卷积核，得到的101个1*2701的特征图；第二层池化层的池化长度为3，得到101个1*900的特征图；第三层为扁平层，得到1个1*90900的特征图；第四层为概率0.5的dropout层，得到1个1*90900的特征图；第五为全连接层，将1个1*90900的特征图转换成一个1*202的向量；第六层为概率0.5的dropout层，得到1个1*202的特征图；第五为全连接层，将1个1*202的特征图转换成1个1*68的向量.Figure 12 shows the CNN network architecture used for deep feature extraction from the amino acid perspective, which includes a total of 1 2D convolutional layer, 1 pooling layer, 1 flattening layer, 2 dropout layers and 2 fully connected layers. The input is a two-dimensional matrix of 20*2710. CNN network architecture The first convolutional layer is 101 convolution kernels of 20*10, and 101 feature maps of 1*2701 are obtained; the pooling length of the second pooling layer is 3, and 101 1*900 are obtained. The third layer is a flat layer, and a feature map of 1*90900 is obtained; the fourth layer is a dropout layer with a probability of 0.5, and a feature map of 1*90900 is obtained; the fifth is a fully connected layer. A 1*90900 feature map is converted into a 1*202 vector; the sixth layer is a dropout layer with a probability of 0.5, and a 1*202 feature map is obtained; the fifth is a fully connected layer, a 1*202 The feature map is converted into a 1*68 vector.

图13为多间隙二肽视角深度特征提取使用的CNN网络架构，总共包括1个二维卷积层,1个扁平层,2个dropout层和2个全连接层。网络的输入为30*440的二维矩阵。CNN网络架构第一层卷积层为101个30*10的卷积核，得到的101个1*871的特征图；第二层为扁平层，得到1个1*87971的特征图；第三层为概率0.5的dropout层，得到1个1*87971的特征图；第四为全连接层，将1个1*87971的特征图转换成一个1*202的向量；第五层为概率0.5的dropout层，得到1个1*202的特征图；第六为全连接层，将1个1*202的特征图转换成1个1*68的向量.Figure 13 shows the CNN network architecture used for multi-gap dipeptide view depth feature extraction, which includes a total of 1 2D convolutional layer, 1 flat layer, 2 dropout layers and 2 fully connected layers. The input to the network is a 30*440 two-dimensional matrix. CNN network architecture The first convolutional layer is 101 30*10 convolution kernels, and 101 1*871 feature maps are obtained; the second layer is a flat layer, and 1 1*87971 feature map is obtained; the third The layer is a dropout layer with a probability of 0.5, and a feature map of 1*87971 is obtained; the fourth is a fully connected layer, which converts a feature map of 1*87971 into a vector of 1*202; the fifth layer is a feature map with a probability of 0.5 The dropout layer obtains a 1*202 feature map; the sixth is the fully connected layer, which converts a 1*202 feature map into a 1*68 vector.

图14为RNA序列语义视角深度特征提取使用的CNN网络架构，总共包括1个二维卷积层,1个池化层，1个扁平层,2个dropout层和2个全连接层。输入为25*2710的二维矩阵。CNN网络架构第一层卷积层为101个25*10的卷积核，得到的101个1*2701的特征图；第二层池化层的池化长度为3，得到101个1*900的特征图；第三层为扁平层，得到1个1*90900的特征图；第四层为概率0.5的dropout层，得到1个1*90900的特征图；第五为全连接层，将1个1*90900的特征图转换成一个1*202的向量；第六层为概率0.5的dropout层，得到1个1*202的特征图；第五为全连接层，将1个1*202的特征图转换成1个1*68的向量.Figure 14 shows the CNN network architecture used for deep feature extraction from the RNA sequence semantic perspective, which includes a total of 1 2D convolutional layer, 1 pooling layer, 1 flattening layer, 2 dropout layers and 2 fully connected layers. The input is a 25*2710 2D matrix. CNN network architecture The first convolutional layer is 101 convolution kernels of 25*10, and 101 feature maps of 1*2701 are obtained; the pooling length of the second pooling layer is 3, and 101 1*900 are obtained. The third layer is a flat layer, and a feature map of 1*90900 is obtained; the fourth layer is a dropout layer with a probability of 0.5, and a feature map of 1*90900 is obtained; the fifth is a fully connected layer. A 1*90900 feature map is converted into a 1*202 vector; the sixth layer is a dropout layer with a probability of 0.5, and a 1*202 feature map is obtained; the fifth is a fully connected layer, a 1*202 The feature map is converted into a 1*68 vector.

四个网络的最后一层都使用sigmoid函数作为激活函数来引入非线性变换，sigmoid函数的表达如下：The last layer of the four networks uses the sigmoid function as the activation function to introduce nonlinear transformation. The expression of the sigmoid function is as follows:

其余层都使用relu函数作为激活函数，relu函数的表达如下：The rest of the layers use the relu function as the activation function. The expression of the relu function is as follows:

R(x)＝max(0，x)R(x)=max(0,x)

网络的损失函数采用二进制交叉熵(binary_crossentropy)损失函数，该函数的定义如下。The loss function of the network adopts the binary cross entropy (binary_crossentropy) loss function, which is defined as follows.

其中p(x_i)和q(x_i)都代表序列x对于类别i的隶属度，p代表真实标签值，即1或0，q代表预测值，在这里因为经过Sigmoid函数激活，所以q∈(0,1)。where p(x _i ) and q( _xi ) both represent the membership of the sequence x to category i, p represents the true label value, that is, 1 or 0, and q represents the predicted value. Here, because it is activated by the Sigmoid function, q∈ (0,1).

第一步：利用X¹，y对RNA序列深度特征提取网训练，取RNA视角深度特征提取使用的CNN网络架构的倒数第二层用做RNA序列深度特征

The first step: use X ¹ , y to train the RNA sequence deep feature extraction network, and take the penultimate layer of the CNN network architecture used for RNA perspective deep feature extraction as the RNA sequence deep feature

第二步：利用X²，y对氨基酸序列深度特征提取网络训练，取氨基酸视角深度特征提取使用的CNN网络架构的倒数第二层用做氨基酸序列深度特征

Step 2: Use X ² , y to train the amino acid sequence depth feature extraction network, and take the penultimate layer of the CNN network architecture used for amino acid perspective depth feature extraction as the amino acid sequence depth feature

第三步：利用X³，y对多间隙二肽成分深度特征提取网络训练，取多间隙二肽视角深度特征提取使用的CNN网络架构的倒数第二层用做多间隙二肽成分深度特征

Step 3: Use X ³ , y to train the multi-gap dipeptide component depth feature extraction network, and take the penultimate layer of the CNN network architecture used for multi-gap dipeptide perspective depth feature extraction as the multi-gap dipeptide component depth feature

第四步：利用X⁴,y对RNA序列语义深度特征提取网络训练，取RNA序列语义视角深度特征提取使用的CNN网络架构的倒数第二层用做RNA序列语义深度特征

得到多视角数据集

Step 4: Use X ⁴ , y to train the RNA sequence semantic depth feature extraction network, and take the penultimate layer of the CNN network architecture used for RNA sequence semantic depth feature extraction as the RNA sequence semantic depth feature

Get a multi-view dataset

本发明使用基于多视角的最优多标签链式学习算法，其包含多标签特征学习和最优多标签链式学习两部分。CC算法是一种可以高效学习标签之间关联的多标签分类算法，其原理是构建若干个二分类器来预测对应的若干标签，每训练完一个二分类器，算法都会将该分类器预测的对应标签结果附加到初始特征之后，作为下一个二分类器训练的输入特征，直至所有分类器训练完毕。不同于现有方法，本发明改进了单视角CC算法，将其应用到多视角场景，把多视角数据的优势附加到CC算法中，使之可以更好地学习标签之间的关联，具体原理如图15所示。算法分两部分：多标签特征学习和多标签学习。首先我们从上游的CNN模型获取各个视角的深度特征向量，将它们拼接起来，投入至多标签特征学习模型训练。该模型的输入大小为808维向量，输出为68维结果，对应68个标签。通过这个模型的学习，我们可以获取68组808维的权重系数，对应了输入向量的每一维特征对预测每个标签的贡献权重。将808维的特征向量依次与这68组权重系数相乘，获得68组加权特征向量，用于训练下游的CC多标签分类器。本实验的CC多标签分类器由68个二分类器组成，预测一条RNA对于68个标签的隶属情况。首先，我们从多标签特征学习模块获得加权特征向量x₁，并将其用作输入特征开始训练第一个二分类器。由它预测的第一个标签值被附加到加权特征向量x₂末尾，用以训练第二个二分类器。重复该过程，直到训练完最后一个二分类器为止。不同于传统的CC多标签分类器，本发明提出的最优CC多标签分类器，其特点在于，当训练完第i个二分类器后，将目前预测的所有标签值附加到与下个标签关联的加权特征向量xi+1的末尾，进行第i+1个二分类器的训练。这样不仅保留了CC算法学习标签关联性的能力，而且可以将多视角数据的优势体现在训练子分类器过程中，把多视角和多标签算法的优势结合在一起。训练最优CC多标签分类器和预测算法如算法1，2所示。The present invention uses an optimal multi-label chain learning algorithm based on multiple perspectives, which includes multi-label feature learning and optimal multi-label chain learning. The CC algorithm is a multi-label classification algorithm that can efficiently learn the association between labels. The principle is to build several binary classifiers to predict the corresponding labels. After training a binary classifier, the algorithm will predict the classifier After the corresponding label result is appended to the initial feature, it is used as the input feature for the next binary classifier training until all classifiers are trained. Different from the existing methods, the present invention improves the single-view CC algorithm, applies it to multi-view scenarios, and adds the advantages of multi-view data to the CC algorithm, so that it can better learn the association between tags. The specific principle As shown in Figure 15. The algorithm is divided into two parts: multi-label feature learning and multi-label learning. First, we obtain the deep feature vectors of each perspective from the upstream CNN model, stitch them together, and put them into the training of the multi-label feature learning model. The input size of this model is an 808-dimensional vector, and the output is a 68-dimensional result, corresponding to 68 labels. Through the learning of this model, we can obtain 68 groups of 808-dimensional weight coefficients, which correspond to the contribution weight of each dimension of the input vector to predicting each label. Multiply the 808-dimensional feature vector with these 68 sets of weight coefficients in turn to obtain 68 sets of weighted feature vectors, which are used to train the downstream CC multi-label classifier. The CC multi-label classifier in this experiment consists of 68 binary classifiers, which predict the membership of an RNA to 68 labels. First, we obtain the weighted feature vector x ₁ from the multi-label feature learning module and use it as the input feature to start training the first binary classifier. The first label value predicted by it is appended to the end of the weighted feature vector x ₂ to train the second binary classifier. This process is repeated until the last binary classifier is trained. Different from the traditional CC multi-label classifier, the optimal CC multi-label classifier proposed by the present invention is characterized in that after training the i-th binary classifier, all the currently predicted label values are appended to the next label. At the end of the associated weighted feature vector xi+1, the training of the i+1th binary classifier is performed. This not only retains the CC algorithm's ability to learn label associations, but also reflects the advantages of multi-view data in the process of training sub-classifiers, combining the advantages of multi-view and multi-label algorithms. Training the optimal CC multi-label classifier and prediction algorithm is shown in Algorithms 1, 2.

第一步：拼接

形成

使用

和y训练多标签特征学习模型，获得68个标签相关的加权特征向量

Step 1: Splicing

form

use

Train a multi-label feature learning model with y to obtain 68 label-related weighted feature vectors

第二步：使用

y¹训练最优CC链式多标签分类器的第一个二分类器；Step 2: Use

y ¹ trains the first binary classifier of the optimal CC chained multi-label classifier;

第二步：将上述步骤预测到的标签附加到

后，使用附加标签后的

和y²训练最优CC链式多标签分类器的第二个二分类器；Step 2: Append the labels predicted by the above steps to

after using the additional tag

and y ² to train the second binary classifier of the optimal CC chained multi-label classifier;

第三步：将上述步骤预测到的标签附加到

后，使用附加标签后的

和y³训练最优CC链式多标签分类器的第三个二分类器，以此类推，直至第68个二分类器训练完毕。Step 3: Append the labels predicted by the above steps to

after using the additional tag

and y ³ to train the third binary classifier of the optimal CC chained multi-label classifier, and so on, until the 68th binary classifier is trained.

在本方法的使用阶段，具体的步骤如下：In the use stage of this method, the specific steps are as follows:

第一步：对测试数据使用初始多视角特征构建模型构建初步多视角测试数据集

Step 1: Build a preliminary multi-view test dataset using the initial multi-view feature building model on the test data

第二步：使用深度多视角特征提取模型得到深度多视角测试数据集

Step 2: Use a deep multi-view feature extraction model to obtain a deep multi-view test dataset

第三步：拼接

形成

输入至训练好的多标签特征模型，得到加权特征向量

Step 3: Splicing

form

Input to the trained multi-label feature model to get the weighted feature vector

第四步：将

输入至训练好的最优CC多标签链式分类器中，获取预测到的的所有标签值

Step 4: put

Input into the trained optimal CC multi-label chain classifier to obtain all predicted label values

本发明的优点包括以下几点：The advantages of the present invention include the following points:

1)初始多视角RNA序列特征的构建：RNA序列有很多构建特征的方法，用不同方式构造出的特征都具有一定的效果，也各有优缺点。使用多视角特征来进行RNA序列的特征提取以及识别能与其结合的RNA结合蛋白可以很好的将不同方法构造特征的优势结合起来。本发明引用能够很好表述RNA序列次序和上下文特征的氨基酸序列表示形式、能够很好表述RNA序列成分信息和结构信息的多间隙二肽数据、能够很好表述RNA序列语义性信息的RNA序列语义数据来构建多视角初始特征，可以从多个不同的方面来进行视角信息互补。1) Construction of initial multi-view RNA sequence features: There are many methods for constructing features in RNA sequences, and features constructed in different ways have certain effects and have their own advantages and disadvantages. The use of multi-view features for feature extraction of RNA sequences and identification of RNA-binding proteins that can bind to them can well combine the advantages of different methods for constructing features. The present invention cites amino acid sequence representation that can well express RNA sequence order and context features, multi-gap dipeptide data that can well represent RNA sequence component information and structural information, and RNA sequence semantics that can well represent RNA sequence semantic information Data to construct multi-perspective initial features, which can complement perspective information from multiple different aspects.

2)深度多视角特征的构建：为了提高多视角特征的有效性，基于最初的多视角数据，利用CNN进行深度学习来构造出深度多视角特征。相对于原始多视角特征，经过深度特征提取的多视角特征具有更小的数据维度和更高的分类效果；2) Construction of deep multi-view features: In order to improve the effectiveness of multi-view features, based on the original multi-view data, CNN is used for deep learning to construct deep multi-view features. Compared with the original multi-view features, the multi-view features extracted by deep features have smaller data dimensions and higher classification effects;

3)多标签特征学习模型的构建：使用多标签特征学习技术，将学习到的多个视角的深度特征整合，并且利用逻辑回归原理，使它针对不同标签进行特征修正，得到可以更好训练多标签分类器的多标签特征；3) Construction of multi-label feature learning model: Use multi-label feature learning technology to integrate the learned in-depth features from multiple perspectives, and use the principle of logistic regression to make feature corrections for different labels, resulting in better training capabilities. Multi-label features for label classifiers;

4)最优链式多标签分类器的构建：改进CC多标签分类器，利用上述多标签特在学习模型得到的经过修正的加权特征向量来进行多标签学习，获得更具有泛化能力的多标签分类器用于RNA结合蛋白识别。4) Construction of the optimal chained multi-label classifier: improve the CC multi-label classifier, use the modified weighted feature vector obtained from the above multi-label special learning model to perform multi-label learning, and obtain more generalization capabilities. Tag classifiers for RNA-binding protein identification.

附图说明Description of drawings

图1是本发明的算法方法框架图。FIG. 1 is a frame diagram of an algorithm method of the present invention.

图2是本发明的不同视角初始特征数据获取算法框架图。FIG. 2 is a frame diagram of an algorithm for obtaining initial feature data from different perspectives of the present invention.

图3是本发明的多视角深度特征学习算法框架图。FIG. 3 is a frame diagram of a multi-view deep feature learning algorithm of the present invention.

图4是本发明的多标签特征学习算法框架图。FIG. 4 is a frame diagram of the multi-label feature learning algorithm of the present invention.

图5是本发明的多视角学习算法框架图。FIG. 5 is a frame diagram of the multi-view learning algorithm of the present invention.

图6是本发明的RNA结合蛋白识别算法框架图。Figure 6 is a framework diagram of the RNA-binding protein identification algorithm of the present invention.

图7是RNA序列one-hot矩阵数据。Figure 7 is RNA-seq one-hot matrix data.

图8是图7RNA序列转化得到的氨基酸序列one-hot矩阵数据。Fig. 8 is the one-hot matrix data of amino acid sequence obtained by transforming the RNA sequence of Fig. 7.

图9是图8氨基酸序列转化得到的多间隙二肽成分柱状数据。FIG. 9 is the columnar data of the multi-gap dipeptide composition obtained by converting the amino acid sequence of FIG. 8 .

图10是图7RNA序列经过语义模型训练后转化得到的语义矩阵数据。Fig. 10 is the semantic matrix data obtained by transforming the RNA sequence of Fig. 7 after being trained by the semantic model.

图11是RNA序列深度特征提取网络。Figure 11 is a deep feature extraction network for RNA sequences.

图12是氨基酸序列深度特征提取网络。Figure 12 is an amino acid sequence deep feature extraction network.

图13是多间隙二肽成分深度特征提取网络。Figure 13 is a deep feature extraction network for multi-gap dipeptide components.

图14是RNA序列语义深度特征提取网络。Figure 14 is an RNA sequence semantic deep feature extraction network.

图15是融合多标签特征学习和多标签学习的最优多标签链式学习算法流程图。Figure 15 is a flow chart of the optimal multi-label chain learning algorithm integrating multi-label feature learning and multi-label learning.

图16是本发明所用算法与现有算法在单个类上的性能比较折线图。FIG. 16 is a line graph comparing the performance of the algorithm used in the present invention and the existing algorithm on a single class.

具体实施方式Detailed ways

下面结合附图和实施例对本发明进行详细的描述。The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

如图1～图6所示，本发明实现了融合多视角和最优多标签链式学习的RNA结合蛋白识别，该方法包括初始多视角特征构建、深度多视角特征提取，多标签特征模型训练和最优多标签链式分类器训练四部分。初始多视角特征构建部分获得原始RNA序列的初始的多视角特征；深度多视角特征提取部分对初始多视角特征进行深度特征学习，获得多视角深度特征；多标签特征模型训练部分使用多视角深度特征，构建与标签相关的加权特征向量；最优多标签链式分类器训练部分使用加权特征向量，学习标签关联的CC分类器，得到最终预测结果。As shown in Figures 1 to 6, the present invention realizes the recognition of RNA-binding proteins that integrates multi-view and optimal multi-label chain learning. The method includes initial multi-view feature construction, deep multi-view feature extraction, and multi-label feature model training. and the optimal multi-label chain classifier training in four parts. The initial multi-view feature construction part obtains the initial multi-view features of the original RNA sequence; the deep multi-view feature extraction part performs deep feature learning on the initial multi-view features to obtain multi-view depth features; the multi-label feature model training part uses multi-view depth features , construct the weighted feature vector related to the label; the training part of the optimal multi-label chain classifier uses the weighted feature vector to learn the CC classifier associated with the label, and obtain the final prediction result.

训练阶段的具体步骤。本方法的初始多视角特征构建部分首先从原始RNA序列中提取出RNA序列、氨基酸序列，多间隙二肽成分和RNA序列语义矩阵四种特征，构造成共有4个视角的多视角数据。Specific steps in the training phase. The initial multi-view feature construction part of the method first extracts four features from the original RNA sequence: RNA sequence, amino acid sequence, multi-gap dipeptide component and RNA sequence semantic matrix, and constructs multi-view data with a total of 4 views.

原始RNA序列是一种文本序列，利用one-hot编码技术转化可以得到其数值矩阵表达形式。本算法利用RNA序列数据作为RNA视角的特征。图7绘制了one-hot编码后的RNA序列特征，其中横轴代表一条具体的RNA序列，纵轴代表one-hot编码规则。The original RNA sequence is a text sequence, and its numerical matrix expression can be obtained by transforming it using one-hot encoding technology. This algorithm utilizes RNA-seq data as features of RNA perspective. Figure 7 plots the RNA sequence features after one-hot encoding, where the horizontal axis represents a specific RNA sequence, and the vertical axis represents one-hot encoding rules.

实施例1Example 1

按照训练阶段的实施方式，针对AURA2数据集的RNA-RBP结合数据完成实施例。该数据集包含67种RBP和73681条RNA序列以及它们的550386个结合位点信息，如表1所示。每种RBP可结合的样本RNA数量都不相同，差别很大。每条RNA序列的长度都不一样，所以我们统一规定了一个长度2700，不足的用碱基B补齐。表2展示了本发明所用方法RRMVL和目前该领域先进方法的对比结果。In accordance with the implementation of the training phase, the examples were completed for the RNA-RBP binding data of the AURA2 dataset. The dataset contains 67 RBPs and 73,681 RNA sequences and their 550,386 binding site information, as shown in Table 1. The amount of sample RNA that can be bound by each RBP varies widely. The length of each RNA sequence is different, so we uniformly stipulate a length of 2700, and the lack is filled with base B. Table 2 shows the comparison results between the method RRMVL used in the present invention and the current advanced methods in this field.

表2实施例1中的本算法的性能指标The performance index of this algorithm in table 2 embodiment 1

其中包含没有经过深度学习的决策树分类器模型，目前该领域先进的iDeepM方法，以及RRMVL模型下的各个子视角模型和总体模型的预测性能指标。从上表可以看出，使用深度学习的iDeepM模型和RRMVL下的任意单视角模型效果均优于决策树模型，证明深度学习在提取较长样本特征上的优势明显。与此同时，所有视角模型整合下的RRMVL方法在AUC数值和F1数值上均比任意单视角模型要高，体现了多视角数据之间的信息互补性，同时也说明数据的多视角化在生物信息学领域可以取得较好的效果。从单视角来看，多间隙二肽成分视角取得了最好的效果，这是因为多间隙二肽不仅包含序列次序信息，而且包含了序列成分和结构信息，是所有视角中信息量最丰富的。RNA序列语义单视角的效果相比初始RNA序列视角略低，这是由于通常训练一个好的语义模型需要百万级的样本数据，而本发明的数据集仅包含92102条RNA序列，不足以训练出效果理想的6聚体RNA词向量，因此实验效果不佳。总体而言，在3种对比算法中，本发明提出的RRMVL取得了3项AUC和3项F1的最佳效果，由此证明基于多视角的最优多标签链式学习方法在识别RNA结合蛋白的问题上达到了预期效果。It includes the decision tree classifier model without deep learning, the current advanced iDeepM method in this field, and the prediction performance indicators of each sub-perspective model and the overall model under the RRMVL model. As can be seen from the above table, the iDeepM model using deep learning and any single-view model under RRMVL are better than the decision tree model, which proves that deep learning has obvious advantages in extracting longer sample features. At the same time, the AUC value and F1 value of the RRMVL method under the integration of all perspective models are higher than those of any single-view model, which reflects the information complementarity between multi-view data, and also shows that the multi-view data is important in biology. In the field of informatics, better results can be achieved. From a single perspective, the multi-gap dipeptide component perspective achieves the best results, because the multi-gap dipeptide contains not only sequence order information, but also sequence component and structural information, which is the most informative of all perspectives. . The effect of RNA sequence semantic single perspective is slightly lower than that of the initial RNA sequence perspective. This is because training a good semantic model usually requires millions of sample data, and the dataset of the present invention only contains 92,102 RNA sequences, which is not enough for training. The 6-mer RNA word vector with ideal effect is obtained, so the experimental effect is not good. Overall, among the three comparison algorithms, the RRMVL proposed in the present invention achieved the best results in 3 AUCs and 3 F1s, which proves that the optimal multi-label chain learning method based on multiple perspectives is effective in identifying RNA-binding proteins. The expected effect has been achieved.

实施例2Example 2

为检验本发明使用的多标签特征学习和最优多标签链式学习效果，本发明在AURA数据集上对RRMVL及其变体方法进行了2重对比实验，分别为使用基于多视角投票的集成学习RRMVL方法与使用多标签特征学习RRMVL方法对比，以及未使用多标签学习的RRMVL方法和使用最优多标签链式学习的RRMVL方法对比。因基于多视角投票的集成学习模型不是一种分类器，所以没有AUC指标，其余方法的五折交叉验证结果下表所示。In order to test the multi-label feature learning and the optimal multi-label chain learning effect used in the present invention, the present invention conducts two comparison experiments on RRMVL and its variant methods on the AURA data set, respectively using the integration based on multi-view voting. Learning RRMVL versus learning RRMVL using multi-label features, and comparing RRMVL without multi-label learning with RRMVL using optimal multi-label chain learning. Since the ensemble learning model based on multi-view voting is not a classifier, there is no AUC indicator, and the five-fold cross-validation results of the remaining methods are shown in the following table.

表3实施例2中的本算法关于多标签特征学习模型和最优多标签链式学习模型的性能测试Table 3 The performance test of the algorithm in Embodiment 2 about the multi-label feature learning model and the optimal multi-label chain learning model

从上表可以看出，对于多视角数据而言，在对其使用多标签特征学习后，模型的预测性能始终比基于投票的集成学习突出，说明多标签特征学习充分利用了多视角数据的优势。另一方面，在处理多标签分类问题上，使用多标签分类器的方法始终优于未使用多标签技术的方法，证明了标签之间的关联对预测产生了不可忽视的作用。值得注意的是进行多标签学习后，RRMVL的AUC指标有所下降，这是由于多标签CC分类器的分类性能与神经网络最后一层的“Sigmoid”网络分类能力略有差距。对于三项F1指标，基于多视角下的多标签学习方法RRMVL取得了最好效果，再次证明本发明提出的方法能够较为准确的识别出某条未探索的RNA可以和哪些RBP结合。As can be seen from the above table, for multi-view data, after using multi-label feature learning on it, the prediction performance of the model is always better than voting-based ensemble learning, indicating that multi-label feature learning takes full advantage of multi-view data. . On the other hand, in dealing with multi-label classification problems, methods using multi-label classifiers consistently outperform methods without multi-label techniques, proving that the association between labels has a non-negligible effect on prediction. It is worth noting that the AUC metric of RRMVL decreased after multi-label learning, which is due to a slight gap between the classification performance of the multi-label CC classifier and the classification ability of the “Sigmoid” network in the last layer of the neural network. For the three F1 indicators, the multi-label learning method RRMVL based on multiple perspectives achieved the best results, again proving that the method proposed in the present invention can more accurately identify which RBPs an unexplored RNA can bind to.

实施例3Example 3

为研究类样本数量对实验效果的影响，本发明使用RRMVL对68个类数据集进行单独实验，对比iDeepM方法实验结果如下表所示。In order to study the influence of the number of class samples on the experimental effect, the present invention uses RRMVL to conduct a separate experiment on 68 class data sets, and the experimental results of the iDeepM method are shown in the following table.

表3不同RBP预测效果Table 3. Different RBP prediction effects

预测精度折线图如图16所示。从图16可以看出，两种对比算法中，RRMVL在大部分类的预测精度取得了最佳效果，两个方法随着类样本数量的逐渐提升，各指标都呈现逐渐提高并趋于平缓的趋势。注意到当样本数量低于5000时，各项指标的起伏很大，这是因为某些类样本数量过少导致模型不能很好地学习到这些类样本的深度特征。并且从2条曲线的对比来看，iDeepM方法在低样本环境下的学习能力不如RRMVL，表现在动荡幅度更剧烈，间接体现多视角数据在小样本学习下的优势。总体而言，本发明所提方法在各个类数据集上达到了预期效果。The line chart of prediction accuracy is shown in Figure 16. As can be seen from Figure 16, among the two comparison algorithms, RRMVL has achieved the best results in the prediction accuracy of most classes. With the gradual increase of the number of class samples, the indicators of the two methods gradually improve and tend to be flat. trend. It is noted that when the number of samples is less than 5000, the fluctuations of various indicators are large, because the number of samples of some classes is too small, so that the model cannot learn the deep features of these samples well. And from the comparison of the two curves, the learning ability of the iDeepM method in the low-sample environment is not as good as that of the RRMVL, and the volatility is more severe, which indirectly reflects the advantages of multi-view data in small-sample learning. In general, the method proposed in the present invention achieves the expected effect on various data sets.

Claims

1. an RNA-binding protein identification of fusion multi-view and optimal multi-label chain learning, is characterized in that, the step of training stage is:

The first step: use the one-hot encoding technology to encode the original RNA sequence into a numerical matrix as the initial RNA sequence feature X ¹ ;

The second step: using the principle of molecular biology to convert the original RNA sequence into an amino acid sequence, and then convert it into a numerical matrix using one-hot encoding technology, as the initial amino acid sequence feature X ² ;

The third step: using statistical principles to convert the amino acid sequence into a multi-gap dipeptide columnar numerical matrix, as the initial dipeptide component feature X ³ ;

Step 4: Use the Word2Vec technology to build a model, learn the word vector with 6-mer RNA as the vocabulary, convert the RNA sequence into an RNA sequence semantic matrix, and use it as the initial RNA sequence semantic feature X ⁴ ; Obtain the initial multi-view data set D= {X ¹ , X ² , X ³ , X ⁴ , y};

Step 5: Use X ¹ , y to train the RNA sequence depth feature extraction network, and take the penultimate layer of the CNN network architecture used for RNA sequence depth feature extraction as the RNA sequence depth feature

Step 6: Use X ² , y to train the amino acid sequence depth feature extraction network, and take the penultimate layer of the CNN network architecture used for amino acid perspective depth feature extraction as the amino acid sequence depth feature

Step 7: Use X ³ , y to train the multi-gap dipeptide component depth feature extraction network, and take the penultimate layer of the CNN network architecture used for multi-gap dipeptide perspective depth feature extraction as the multi-gap dipeptide component depth feature

Step 8: Use X ⁴ , y to train the RNA sequence semantic depth feature extraction network, and take the penultimate layer of the CNN network architecture used for RNA sequence semantic depth feature extraction as the RNA sequence semantic depth feature

Step 9: Splicing

form

use

Train a multi-label feature learning model with y to obtain the weighted feature vector associated with each label

Step 10: Use

y trains the optimal chained CC multi-label classifier model;

The described optimal chained CC multi-label classifier model uses the weighted feature vector corresponding to each label to train each sub-classifier in the CC multi-label classifier, thereby obtaining a classifier model with better classification effect;

Step 11: Use the initial multi-view feature building model on the test data to build a preliminary multi-view test dataset

Step 12: Use a deep multi-view feature extraction model to obtain a deep multi-view test dataset

Step Thirteen: Splicing

form

Step Fourteen: Put the

Input into the trained optimal chained CC multi-label classifier to obtain all predicted label values

2. the RNA-binding protein identification of fusion multi-view and optimal multi-label chain learning as claimed in claim 1, it is characterized in that, the CNN network architecture that the RNA-view depth feature extraction in the described 5th step uses, comprises 1 2-dimensional convolutional layers, 1 pooling layer, 1 flattening layer, 2 dropout layers and 2 fully connected layers; the first convolutional layer of the CNN network architecture is 101 4*10 convolution kernels, obtaining 101 feature maps of 1*2701; the pooling length of the second pooling layer is 3, and 101 feature maps of 1*900 are obtained; the third layer is a flat layer, and one feature map of 1*90900 is obtained; The fourth layer is a dropout layer with a probability of 0.5, and a 1*90900 feature map is obtained; the fifth is a fully connected layer, which converts a 1*90900 feature map into a 1*202 vector; the sixth layer is a probability The dropout layer of 0.5 obtains a 1*202 feature map; the fifth is the fully connected layer, which converts a 1*202 feature map into a 1*68 vector.

3. The RNA-binding protein identification of fusion multi-perspective and optimal multi-label chain learning as claimed in claim 1, wherein the CNN network architecture used in the amino acid perspective depth feature extraction in the sixth step comprises 1 2-dimensional convolutional layers, 1 pooling layer, 1 flattening layer, 2 dropout layers and 2 fully connected layers; the first convolutional layer of the CNN network architecture is 101 20*10 convolution kernels, obtaining 101 feature maps of 1*2701; the pooling length of the second pooling layer is 3, and 101 feature maps of 1*900 are obtained; the third layer is a flat layer, and one feature map of 1*90900 is obtained; The fourth layer is a dropout layer with a probability of 0.5, and a 1*90900 feature map is obtained; the fifth is a fully connected layer, which converts a 1*90900 feature map into a 1*202 vector; the sixth layer is a probability The dropout layer of 0.5 obtains a 1*202 feature map; the fifth is the fully connected layer, which converts a 1*202 feature map into a 1*68 vector.

4. The RNA-binding protein identification of fusion multi-view and optimal multi-label chain learning as claimed in claim 1, wherein the CNN network architecture used in the multi-gap dipeptide angle depth feature extraction in the seventh step , including 1 two-dimensional convolutional layer, 1 flat layer, 2 dropout layers and 2 fully connected layers; the first convolutional layer of the CNN network architecture is 101 30*10 convolution kernels, resulting in 101 The feature map of 1*871; the second layer is a flat layer, and a feature map of 1*87971 is obtained; the third layer is a dropout layer with a probability of 0.5, and a feature map of 1*87971 is obtained; the fourth is a fully connected layer. , convert a 1*87971 feature map into a 1*202 vector; the fifth layer is a dropout layer with a probability of 0.5, and a 1*202 feature map is obtained; the sixth is a fully connected layer, which converts a 1 The feature map of *202 is converted into a 1*68 vector.

5. The RNA-binding protein identification of fusion multi-view and optimal multi-label chain learning as claimed in claim 1, characterized in that, the CNN network architecture used in the RNA sequence semantic depth feature extraction in the eighth step, comprising 1 two-dimensional convolutional layer, 1 pooling layer, 1 flat layer, 2 dropout layers and 2 fully connected layers; the first convolutional layer of the CNN network architecture is 101 25*10 convolution kernels, The obtained 101 feature maps of 1*2701; the pooling length of the second pooling layer is 3, and 101 feature maps of 1*900 are obtained; the third layer is a flat layer, and 1 feature map of 1*90900 is obtained ; The fourth layer is a dropout layer with a probability of 0.5, and a 1*90900 feature map is obtained; the fifth is a fully connected layer, which converts a 1*90900 feature map into a 1*202 vector; The sixth layer is The dropout layer with a probability of 0.5 obtains a 1*202 feature map; the fifth is a fully connected layer, which converts a 1*202 feature map into a 1*68 vector.

6. The RNA-binding protein identification of fusion multi-view and optimal multi-label chain learning as claimed in claim 2, characterized in that, the CNN network architecture used in the deep feature extraction from the RNA perspective, and the deep feature extraction from the amino acid perspective are used. The last layer of the CNN network architecture, the multi-gap dipeptide perspective depth feature extraction, and the CNN network architecture used for RNA sequence semantic perspective depth feature extraction all use the sigmoid function as the activation function to introduce nonlinear transformations, and the remaining layers The relu function is used as the activation function, and the loss function of the four networks adopts the Binary cross-entropy binary cross entropy loss function.

7. The RNA-binding protein identification of fusion multi-view and optimal multi-label chain learning as claimed in claim 3, it is characterized in that, described RNA view depth feature extraction uses CNN network architecture, amino acid view depth feature extraction and uses The last layer of the CNN network architecture, the multi-gap dipeptide perspective depth feature extraction, and the CNN network architecture used for RNA sequence semantic perspective depth feature extraction all use the sigmoid function as the activation function to introduce nonlinear transformations, and the remaining layers The relu function is used as the activation function, and the loss function of the four networks adopts the Binary cross-entropy binary cross entropy loss function.

8. The RNA-binding protein identification of fusion multi-perspective and optimal multi-label chain learning as claimed in claim 4, characterized in that, the CNN network architecture used for deep feature extraction from described RNA perspective, and the deep feature extraction from amino acid perspective are used. The last layer of the CNN network architecture, the multi-gap dipeptide perspective depth feature extraction, and the CNN network architecture used for RNA sequence semantic perspective depth feature extraction all use the sigmoid function as the activation function to introduce nonlinear transformations, and the remaining layers The relu function is used as the activation function, and the loss function of the four networks adopts the Binary cross-entropy binary cross entropy loss function.

9. The RNA-binding protein identification of fusion multi-perspective and optimal multi-label chain learning as claimed in claim 5, wherein the CNN network architecture used for deep feature extraction from described RNA perspective, and the deep feature extraction from amino acid perspective are used. The last layer of the CNN network architecture, the multi-gap dipeptide perspective depth feature extraction, and the CNN network architecture used for RNA sequence semantic perspective depth feature extraction all use the sigmoid function as the activation function to introduce nonlinear transformations, and the remaining layers The relu function is used as the activation function, and the loss function of the four networks adopts the Binary cross-entropy binary cross entropy loss function.