WO2023044927A1 - Rna-蛋白质相互作用预测方法、装置、介质及电子设备 - Google Patents
Rna-蛋白质相互作用预测方法、装置、介质及电子设备 Download PDFInfo
- Publication number
- WO2023044927A1 WO2023044927A1 PCT/CN2021/121089 CN2021121089W WO2023044927A1 WO 2023044927 A1 WO2023044927 A1 WO 2023044927A1 CN 2021121089 W CN2021121089 W CN 2021121089W WO 2023044927 A1 WO2023044927 A1 WO 2023044927A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- rna
- protein
- sequence
- mer
- pair
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
Definitions
- the present disclosure relates to the technical field of artificial intelligence, in particular, to a method for predicting RNA-protein interaction, a device for predicting RNA-protein interaction, a computer-readable storage medium and electronic equipment.
- Noncoding RNA (noncoding RNA, ncRNA) participates in many complex cellular processes, plays an important role in life processes such as alternative splicing, chromatin modification and epigenetics, and is closely related to many diseases. Studies have shown that most non-coding RNAs achieve their regulatory functions by interacting with proteins. Therefore, studying the interaction between non-coding RNA and protein is of great significance for revealing the molecular mechanism of non-coding RNA in human diseases and life activities, and has become one of the important ways to analyze the function of non-coding RNA and protein.
- the present disclosure provides a method for predicting RNA-protein interaction, a device for predicting RNA-protein interaction, a computer-readable storage medium and electronic equipment.
- the present disclosure provides a method for predicting RNA-protein interaction, including:
- RNA-protein pair to be predicted, and obtaining an RNA sequence representation vector and a protein sequence representation vector in the RNA-protein pair to be predicted;
- the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted Based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, using multiple interaction prediction models to obtain the RNA-protein to be predicted respectively Multiple interaction predictors for pairs;
- An interaction between the RNA and the protein is determined based on the plurality of interaction prediction values.
- the feature extraction of the RNA-protein pair to be predicted is performed to obtain the sequence features of the RNA-protein pair to be predicted, including:
- sequence features of the RNA-protein pair to be predicted are determined according to the original sequence feature set.
- the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
- Each k-mer subsequence is searched in the original sequence feature set, and the sequence feature of the RNA-protein pair to be predicted is obtained according to the search result.
- the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
- RNA sequence and the protein sequence in the RNA-protein pair to be predicted are respectively converted into k-mer subsequences, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences;
- the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
- RNA sequence and the protein sequence in the RNA-protein pair to be predicted are respectively converted into k-mer subsequences, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences;
- the sequence characteristics of the RNA-protein pair to be predicted are composed of the first sequence characteristics and the second sequence characteristics.
- said obtaining the original sequence feature set includes:
- the feature extraction is performed on each RNA-protein pair in the original data set to obtain the original sequence feature set, including:
- RNA and protein Arrange and combine the basic units of RNA and protein to obtain k-mer subsequences
- the original sequence feature set is determined according to the variance of each k-mer sequence.
- the statistics are performed on the frequency of occurrence of each k-mer subsequence in the original data set, and the variance of each k-mer subsequence is calculated according to the frequency of occurrence ,include:
- the variance of each k-mer subsequence is calculated according to the occurrence frequency of each k-mer subsequence in the original data set and the marker value in each RNA-protein pair.
- said each k-mer subsequence is calculated according to the occurrence frequency of said each k-mer subsequence in said original data set and the tag value in each RNA-protein pair Variance of k-mer subsequences, including:
- Var i of the i-th k-mer subsequence. in is the marker value of the i-th k-mer subsequence in the n-th RNA-protein pair
- Freq i is the occurrence frequency of the i-th k-mer subsequence in the original data set
- N is the RNA-protein pair in the original data set total number of .
- the determining the original sequence feature set according to the variance of each k-mer sequence includes:
- the k-mer subsequences satisfying the preset condition are determined according to the variance of each k-mer subsequence, and the original sequence feature set is composed of the k-mer subsequences satisfying the preset condition.
- performing feature extraction on each RNA-protein pair in the original data set to obtain the original sequence feature set further includes:
- RNA sequence and the protein sequence in each RNA-protein pair are converted into k-mer subsequences respectively, and the first candidate item set is formed by the k-mer subsequences, and the k-mer subsequences include RNA k -mer subsequence and protein k-mer subsequence;
- RNA k-mer subsequence and the protein k-mer subsequence in the frequent itemset are cross-combined, and the k-mer subsequence obtained by combination forms the second candidate itemset;
- the original sequence feature set is composed of k-mer subsequence pairs whose support meets a preset condition.
- the vectorization of the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted includes:
- RNA sequence and the protein sequence in the RNA-protein pair to be predicted are converted into k-mer subsequences respectively, and the k-mer subsequences include M RNA k-mer subsequences and N protein k-mer subsequences sequence;
- the interaction prediction model respectively obtains multiple interaction prediction values of the RNA-protein pair to be predicted, including:
- RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted into at least one second interaction prediction model to obtain at least one second interaction prediction value.
- the interaction prediction model respectively obtains multiple interaction prediction values of the RNA-protein pair to be predicted, including:
- RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted into at least one deep learning model to obtain at least one second interaction prediction value.
- each of the deep learning models includes at least two sub-deep learning models; the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted input into at least one deep learning model to obtain at least one second interaction prediction value, including:
- the first sequence feature and the second sequence feature are fused, and the second interaction prediction value is obtained according to the fused feature.
- the traditional machine learning model includes at least one of a logistic regression model, a support vector machine model, and a decision tree model
- the deep learning model includes a convolutional neural network model and a recurrent neural network At least one of the models.
- the determining the interaction between the RNA and the protein according to the plurality of interaction prediction values includes:
- the interaction between the RNA and the protein is determined based on the calculation results.
- the determining the interaction between the RNA and the protein according to the calculation results includes:
- the calculation result is less than or equal to the preset interaction prediction value threshold, it is determined that there is no interaction between the RNA and the protein.
- the method further includes:
- the plurality of interaction prediction models are jointly trained.
- the joint training of the multiple interaction prediction models includes:
- the joint training of the multiple interaction prediction models includes:
- RNA-protein pair a positive example RNA-protein pair and a negative example RNA-protein pair in the training data set
- Model parameters of the plurality of interaction prediction models are adjusted according to the loss value.
- the obtaining the joint prediction value of each RNA-protein pair in the training data set according to the plurality of interaction prediction values includes:
- a weighted summation is performed on multiple interaction prediction values of each RNA-protein pair in the training data set to obtain a joint prediction value of each RNA-protein pair.
- the weighted summation of multiple interaction prediction values of each RNA-protein pair in the training data set to obtain a joint prediction value of each RNA-protein pair includes :
- the joint prediction value y out of each RNA-protein pair in the training data set is calculated.
- y 1 is the output value of the traditional machine learning model
- y 2 is the output value of the convolutional neural network model
- y 3 is the output value of the recurrent neural network model
- ⁇ , ⁇ , ⁇ are the traditional machine learning model, convolution Weight parameters for neural network models and recurrent neural network models.
- the adjusting model parameters of the plurality of interaction prediction models according to the loss value includes:
- the plurality of interaction prediction models are used to predict the interaction of the RNA-protein pair to be predicted.
- the method further includes:
- a prediction result of the interaction between the RNA and the protein is output.
- RNA-protein interaction prediction device including:
- the data acquisition module is used to acquire the RNA-protein pair to be predicted
- a feature extraction module configured to perform feature extraction on the RNA-protein pair to be predicted, to obtain sequence features of the RNA-protein pair to be predicted;
- a data vectorization module used to vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted;
- the interaction prediction module is used to obtain the sequence characteristics of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted by using multiple interaction prediction models, respectively. Describe multiple interaction predictions for RNA-protein pairs to be predicted;
- An interaction determination module configured to determine the interaction between the RNA and the protein according to the plurality of interaction prediction values.
- the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the methods described above is implemented.
- the present disclosure provides an electronic device, including: a processor; and a memory, configured to store executable instructions of the processor; wherein, the processor is configured to execute any one of the above-mentioned instructions by executing the executable instructions described method.
- Figure 1 shows a schematic diagram of an exemplary system architecture of an RNA-protein interaction prediction method and device that can be applied to an embodiment of the present disclosure
- Figure 2 schematically shows a flow chart of an RNA-protein interaction prediction method according to an embodiment of the present disclosure
- Fig. 3 schematically shows a flow chart of determining the sequence characteristics of the RNA-protein pair to be predicted according to an embodiment of the present disclosure
- Fig. 4 schematically shows a flow chart of obtaining an original sequence feature set according to an embodiment of the present disclosure
- Fig. 5 schematically shows a flow chart of obtaining an original sequence feature set according to another embodiment of the present disclosure
- FIG. 6 schematically shows a flow chart of training multiple interaction prediction models according to an embodiment of the present disclosure
- Figure 7 schematically shows a block diagram of an RNA-protein interaction prediction device according to an embodiment of the present disclosure
- FIG. 8 shows a schematic structural diagram of a computer system suitable for implementing the electronic device of the embodiment of the present disclosure.
- Example embodiments will now be described more fully with reference to the accompanying drawings.
- Example embodiments may, however, be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of example embodiments to those skilled in the art.
- the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
- numerous specific details are provided in order to give a thorough understanding of embodiments of the present disclosure.
- those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details being omitted, or other methods, components, devices, steps, etc. may be adopted.
- well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
- FIG. 1 shows a schematic diagram of the system architecture of an exemplary application environment in which a method and device for predicting RNA-protein interaction according to an embodiment of the present disclosure can be applied.
- the system architecture 100 may include one or more of terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
- the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
- Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
- Terminal devices 101, 102, 103 may be various electronic devices, including but not limited to desktop computers, portable computers, smart phones, and tablet computers. It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
- the server 105 may be one server, or a server cluster composed of multiple servers, or a cloud computing platform or a virtualization center.
- the server 105 can be used to perform: obtaining the RNA-protein pair to be predicted; performing feature extraction on the RNA-protein pair to be predicted to obtain the sequence features of the RNA-protein pair to be predicted; Describe the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted; based on the sequence characteristics of the RNA-protein pair to be predicted, the RNA-protein pair to be predicted The RNA sequence representation vector and the protein sequence representation vector in the protein pair, using multiple interaction prediction models to obtain multiple interaction prediction values of the RNA-protein pair to be predicted; determine according to the multiple interaction prediction values The interaction between the RNA and protein.
- the RNA-protein interaction prediction method provided by the embodiments of the present disclosure is generally executed by the server 105.
- the RNA-protein interaction prediction device is generally set in the server 105, and the server can use the RNA-protein interaction to be predicted
- the prediction result is sent to the terminal device, and displayed to the user by the terminal device.
- the RNA-protein interaction prediction method provided by the embodiments of the present disclosure can also be executed by one or more of the terminal devices 101, 102, and 103.
- the prediction device can also be set in the terminal equipment 101, 102, 103, for example, after being executed by the terminal equipment, the prediction result can be directly displayed on the display screen of the terminal equipment, or the prediction result can be provided to the user through voice broadcast, This is not specifically limited in this exemplary embodiment.
- ncRPI noncoding RNA-protein interactions
- RNA-protein interaction prediction method may include the following steps S210 to S250:
- Step S210 Obtain the RNA-protein pair to be predicted
- Step S220 Perform feature extraction on the RNA-protein pair to be predicted to obtain the sequence features of the RNA-protein pair to be predicted;
- Step S230 Vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted;
- Step S240 Based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, use multiple interaction prediction models to obtain the predicted Multiple interaction predictors for RNA-protein pairs;
- Step S250 Determine the interaction between the RNA and the protein according to the plurality of interaction prediction values.
- the RNA-protein pair to be predicted is obtained; the RNA-protein pair to be predicted is extracted to obtain the RNA-protein to be predicted pair of sequence features; vectorize the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted; based on the RNA-protein pair to be predicted Sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair to be predicted, using multiple interaction prediction models to obtain multiple interaction prediction values of the RNA-protein pair to be predicted; according to the The interaction between the RNA and the protein is determined based on the plurality of interaction prediction values.
- RNA sequences and protein sequences can be fully mined, so as to accurately predict the interaction between RNA and proteins;
- the effective combination of the characteristics of the interaction prediction model can further improve the accuracy of predicting the interaction between RNA and protein.
- step S210 the RNA-protein pair to be predicted is obtained.
- RNA-protein pair to be predicted can be obtained. And the interaction between RNA and protein in each RNA-protein pair to be predicted is unknown.
- the user can input the RNA-protein pair to be predicted through the terminal device.
- the user may manually input the RNA-protein pair to be predicted, or input the RNA-protein pair to be predicted by voice, which is not specifically limited in this example.
- an RNA can be input, and then a protein can be input, and there is no limitation on the input order of the two.
- RNA and protein can be entered into different text boxes, or they can be entered into the same text box. For example, after the input is completed, click the "Start Prediction" button, and then start to execute the prediction steps provided in some embodiments of the present application.
- RNA and protein means that the function of protein is reflected in the interaction with other proteins and RNA.
- protein-RNA interactions play an important role in protein synthesis.
- many functions of RNA are also inseparable from the interaction with proteins.
- the interaction can be regulation, guidance, etc., and is not limited here.
- RNA in the presence of an interaction, RNA can guide protein synthesis, or RNA can regulate protein function.
- the interaction between RNA and protein can also mean that the two can regulate each other's life cycle and function through physical interaction.
- the RNA coding sequence can guide protein synthesis, and correspondingly, the protein can also regulate the expression and function of RNA.
- the prediction result of the RNA-protein interaction to be predicted can also be output to the terminal device for users to view.
- the prediction result may be directly displayed on the display screen of the terminal device, or the prediction result may be provided to the user through voice broadcast, which is not specifically limited in this example.
- At least one RNA sequence to be predicted can also be obtained, and a protein sequence that interacts with each input RNA sequence to be predicted can be searched in the database through multiple interaction prediction models.
- a protein sequence that interacts with each input RNA sequence to be predicted can be searched in the database through multiple interaction prediction models.
- at least one protein sequence in the database can be selected, and multiple RNA-protein pairs are formed from the RNA sequence to be predicted and each protein sequence, and then multiple interactive
- the role prediction model predicts the interaction of each RNA-protein pair, and outputs the protein sequence that can interact with the RNA sequence to be predicted according to the prediction result.
- several types of protein sequences can be stored in the database in advance, so as to be recalled when predicting the interaction of RNA-protein pairs.
- the protein sequence can be stored in the Redis database or in the MySQL database, and then the protein sequence to be predicted can be queried and selected in real time.
- Redis is a key-value storage system.
- the Redis database can include: a key-value pair (key-value) formed by a sequence identifier (such as a sequence number) and a corresponding protein sequence, wherein the key (key) is Sequence identifier, value (value) is the corresponding protein sequence.
- a key-value pair formed by a sequence identifier (such as a sequence number) and a corresponding protein sequence, wherein the key (key) is Sequence identifier, value (value) is the corresponding protein sequence.
- Redis can support more than 100K+ read and write frequencies per second, and has certain advantages in data reading and storage speed.
- MySQL is a relational database management system. The relational database stores data in different tables instead of storing all data in a unified manner, which increases storage speed and flexibility. It has stable advantages in data storage and can avoid
- RNA sequences can also be stored in the database in advance, so as to be recalled when predicting the interaction of RNA-protein pairs. Therefore, at least one protein sequence to be predicted can also be obtained, and an RNA sequence that interacts with each input protein sequence to be predicted can be searched in the database through multiple interaction prediction models. Similarly, after the user enters the protein sequence through the terminal device, at least one RNA sequence in the database can be selected, and multiple RNA-protein pairs are formed from the protein sequence to be predicted and each RNA sequence, which can then be predicted by multiple interaction prediction models. The interaction of each RNA-protein pair, and output the RNA sequence that can interact with the protein sequence to be predicted according to the prediction result, which is not specifically limited in the present disclosure.
- step S220 feature extraction is performed on the RNA-protein pair to be predicted to obtain sequence features of the RNA-protein pair to be predicted.
- RNA-protein pair to be predicted Before predicting the interaction of each RNA-protein pair to be predicted through multiple interaction prediction models, it is necessary to obtain the input features of each interaction prediction model.
- feature extraction can be performed on the RNA-protein pair to be predicted, that is, feature extraction is performed sequentially on the RNA sequence and protein sequence in the RNA-protein pair to be predicted, to obtain corresponding RNA sequence features and protein sequence features,
- the sequence features of the RNA-protein pair to be predicted are composed of RNA sequence features and protein sequence features, and the sequence features can be used as the input of the interaction prediction model.
- the RNA-protein pair to be predicted can also be vectorized, that is, the RNA sequence and protein sequence in the RNA-protein pair are respectively vectorized to obtain the corresponding RNA sequence representation vector and protein representation vector, and the RNA Sequence representation vectors and protein representation vectors were used as input to the interaction prediction model, respectively.
- RNA-protein pair It is also possible to perform feature extraction and vectorization processing on the RNA-protein pair to be predicted at the same time to obtain the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein representation vector in the RNA-protein pair to be predicted,
- the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein representation vector in the RNA-protein pair to be predicted can be used as the input of the interaction prediction model, which is not specifically limited in the present disclosure.
- feature extraction can be performed on the RNA-protein pair to be predicted according to step S310 and step S320 .
- step S310 Obtain the original sequence feature set.
- the original data set can be obtained, and feature extraction is performed on each RNA-protein pair in the original data set to obtain the original sequence feature set.
- the RPI1807 data set can be used as the original data set.
- This data set can contain 3243 RNA-protein pairs, and 1807 pairs of positive examples and 1436 pairs of negative examples are included in the 3243 RNA-protein pairs.
- a positive example can indicate that there is an interaction between the RNA and the protein in the RNA-protein pair
- a negative example can indicate that there is no interaction between the RNA and the protein in the RNA-protein pair.
- the RPI2241 data set, the RPI369 data set, etc. may also be used as the original data set for experimentation, which is not specifically limited in the present disclosure.
- feature extraction can be performed on the RNA-protein pairs in the original data set according to steps S410 to S430 to obtain the original sequence feature set.
- Step S410 Arranging and combining the basic units of RNA and protein to obtain k-mer subsequences.
- bases are the basic units of RNA.
- RNA sequence four kinds of bases can be included, namely adenine (A), uracil (U), guanine (G) and cytosine (C). All k-mer subsequences of the RNA sequence can be obtained by permuting and combining the four bases.
- amino acids are the basic units of proteins.
- protein sequence 20 amino acids can be included, and the 20 amino acids are coded sequentially as A, G, V, I, L, F, P, Y, M, T, S, H, N, Q, W, R, K, D, E, C.
- the 20 amino acids can be divided into ⁇ A, G, V ⁇ , ⁇ I, L, F, P ⁇ , ⁇ Y, M, T, S ⁇ , ⁇ H, N , Q, W ⁇ , ⁇ R, K ⁇ , ⁇ D, E ⁇ and ⁇ C ⁇ , there are 7 types, and each type of amino acid is recoded, such as 1, 2, 3, 4, 5, 6 and 7.
- the protein sequence ALQDVG can be converted to 124611.
- the seven types of amino acids can be arranged and combined to obtain all k-mer subsequences of the amino acid sequence.
- the 20 amino acids can also be classified according to the composition of the amino acids, and the k-mer subsequence of the amino acid sequence can be obtained directly based on the arrangement and combination of the 20 amino acids without classification, which is not specifically limited in this disclosure .
- the k-mer subsequence refers to a k-mer subsequence composed of k bases or k-type amino acids as a group.
- the k-mer subsequence may include an RNA k-mer subsequence and a protein k-mer subsequence.
- the k-mer subsequence may refer to an RNA k-mer subsequence obtained by permuting and combining four types of bases, and for a certain value of k, 4 k types of k-mer subsequences may be obtained.
- a k-mer subsequence may also refer to a protein k-mer subsequence obtained by permuting and combining 7 types of amino acids, and 7 k types of k-mer subsequences can be obtained for a certain k value. It can be understood that the classification of the 20 amino acids into 7 categories is only illustrative and may not be classified. Similarly, the four bases of the RNA sequence can also be classified according to actual needs.
- one or more values of k may be used, and the specific value of k may be adjusted according to actual conditions, which is not limited herein.
- two values of 3 and 4 may be taken as an example for illustration.
- AAA and AUC are two 3-mer subsequences of RNA sequences
- AAAA and AAAU are two 4-mer subsequences of RNA sequences.
- 111 and 112 are two 3-mer subsequences of the protein sequence, and 1111 and 1122 are two 4-mer subsequences of the protein sequence.
- k may only be 3 or only 4, which is not specifically limited in the present disclosure.
- Step S420 Count the occurrence frequency of each k-mer subsequence in the original data set, and calculate the variance of each k-mer subsequence according to the occurrence frequency.
- all 3-mer subsequences and 4-mer subsequences of RNA sequences and protein sequences can be obtained according to step S410, that is, 64 kinds of 3-mer subsequences and 256 kinds of 4-mer subsequences of RNA sequences can be obtained. 343 3-mer subsequences and 2401 4-mer subsequences of sequence and protein sequences.
- the occurrence frequency of each 3-mer subsequence or 4-mer subsequence in the original data set can be counted, and the variance of each 3-mer subsequence or 4-mer subsequence can be calculated according to the occurrence frequency.
- the 3-mer subsequence of the sequence may include "AGA”, “GAU”, “AUG” and “UGG”
- the 4-mer subsequence of the sequence may include "AGAU”, " GAUG” and "AUGG”, that is, the RNA sequence can be read through forward overlapping to obtain the corresponding 3-mer subsequence or 4-mer subsequence.
- the RNA sequence can also be read by reverse overlapping to obtain the corresponding 3-mer subsequence or 4-mer subsequence.
- the 3-mer subsequence of this sequence can also include "GGU”, “GUA”, “UAG” and “AGA”
- the 4-mer subsequence of this sequence can also include "GGUA”, "GUAG” and "UAGA” ".
- the RNA sequence can also be read in a non-overlapping manner to obtain the corresponding 3-mer subsequence or 4-mer subsequence, for example, the 3-mer subsequence of the sequence can also include "AGA” and "UGG", this disclosure does not specifically limit it.
- each 3-mer subsequence and/or 4-mer subsequence of the RNA sequence and protein sequence in the original data set can be counted, and each 3-mer subsequence and/or The frequency of 4-mer subsequences in the original dataset.
- the ratio of the frequency of occurrence of a certain 3-mer subsequence in the original data set to the total number of RNA-protein pairs in the original data set can be calculated to obtain the frequency of occurrence of the 3-mer subsequence in the original data set.
- each 3-mer subsequence and/or 4-mer subsequence is flagged for its presence in each RNA-protein pair.
- the variance of each k-mer subsequence can be calculated according to the occurrence frequency of each 3-mer subsequence and/or 4-mer subsequence in the original data set and the marker value in each RNA-protein pair.
- the subsequence may be a 3-mer subsequence of an RNA sequence or a protein sequence, or a 4-mer subsequence of an RNA sequence or a protein sequence.
- You can first count the frequency of occurrence of the subsequence in the RPI1807 data set. For example, N RNA-protein pairs (N 3243) in the RPI1807 data set can be cycled. If the subsequence appears in the current RNA-protein pair, the frequency of occurrence will be increased by 1.
- Step S430 Determine the feature set of the original sequence according to the variance of each k-mer sequence.
- the k-mer subsequence that meets the preset conditions can be determined according to the variance of each k-mer subsequence, and is composed of k-mer subsequences that meet the preset conditions Raw sequence feature set.
- all 3-mer subsequences and 4-mer subsequences of the RNA sequence and all 3-mer subsequences and 4-mer subsequences of the protein sequence can be sorted according to the size of the variance, such as sorting in descending order, you can choose The top k-mer subsequences constitute the original sequence feature set.
- the top 560 k-mer subsequences can be selected, and these 560 k-mer subsequences form the original sequence feature set.
- it can include the 3-mer subsequences of the top 60 RNA sequences, the 4-mer subsequences of the top 200 RNA sequences, the 3-mer subsequences of the top 200 protein sequences, and the 4-mer subsequences of the top 100 protein sequences sequence.
- the number of selected k-mer subsequences is only illustrative, and any number of k-mer subsequences can be selected according to actual needs.
- a variance threshold may also be preset, and k-mer subsequences with a variance greater than the threshold are screened out, and the filtered k-mer subsequences form the original sequence feature set.
- the preset variance threshold is 3
- k-mer subsequences with a variance greater than 3 can be selected to form the original sequence feature set.
- the average value of the number of occurrences of each k-mer subsequence in each RNA-protein pair can also be calculated, and the variance of each k-mer subsequence is calculated according to the average number of occurrences, Further, the original sequence feature set is determined according to the variance of each k-mer subsequence.
- the number of occurrences of each 3-mer subsequence or 4-mer subsequence in each RNA-protein pair can be determined.
- the total number of occurrences of each 3-mer subsequence or 4-mer subsequence in each RNA-protein pair can be calculated to obtain the total number of occurrences of the subsequence in the original data set.
- each Average frequency of 3-mer subsequences or 4-mer subsequences in each RNA-protein pair According to the total number of occurrences, each Average frequency of 3-mer subsequences or 4-mer subsequences in each RNA-protein pair.
- the variance of each subsequence can be calculated from the average number of occurrences of each 3-mer subsequence or 4-mer subsequence in each RNA-protein pair and the number of occurrences in each RNA-protein pair.
- the subsequence may be a 3-mer subsequence of an RNA sequence or a protein sequence, or a 4-mer subsequence of an RNA sequence or a protein sequence.
- You can first count the total number of occurrences of the subsequence in the RPI1807 data set. For example, n RNA-protein pairs (n 3243) in the RPI1807 data set can be cycled, and the number of occurrences of the subsequence in each RNA-protein pair can be obtained by statistics as x 1 , x 2 ,...,x n .
- the total number of occurrences of the subsequence in the RPI1807 data set is obtained, which is recorded as num i .
- the average number of occurrences of the subsequence in each RNA - protein pair can be calculated according to the total number of occurrences num i , that is, according to:
- the variance of the subsequence can be calculated by the average number of occurrences of the i-th k-mer subsequence in each RNA-protein pair and the number of occurrences in each RNA-protein pair, that is, according to:
- n is the number of RNA-protein pairs in the RPI1807 data set
- m i is the average number of occurrences of the subsequence in each RNA-protein pair
- x n is the number of the subsequence in the nth RNA-protein pair
- x 1 is the number of occurrences of the subsequence in the first RNA-protein pair
- x 2 is the number of occurrences of the subsequence in the second RNA-protein pair.
- all k-mer subsequences can be sorted according to the size of the variance, such as sorting in descending order, and the top k-mer subsequences can be selected to form the original sequence features set.
- a variance threshold may also be preset, and k-mer subsequences with a variance greater than the threshold are screened out, and the filtered k-mer subsequences form the original sequence feature set.
- the k-mer feature of each RNA-protein pair in the original data set may be extracted, and the extracted k-mer feature of the RNA sequence and the k-mer feature of the protein sequence form the original sequence feature set.
- the k-mer feature may include monomer component information (that is, each base contained) and sequence order information of the RNA sequence. Therefore, using the k-mer feature can better describe an RNA sequence, that is, an RNA sequence can be more accurately determined according to the k-mer feature, and different RNA sequences can also be distinguished by the k-mer feature.
- the frequent itemset features of each RNA-protein pair in the original data set can also be extracted, and the original sequence feature set is composed of the extracted frequent itemset features.
- the frequent itemset feature can combine the kmer feature of the RNA sequence and the kmer feature of the protein sequence. Therefore, using frequent itemset features can better distinguish between interacting and non-interacting RNA-protein pairs. It is also possible to extract k-mer features and frequent itemset features at the same time, and combine them to form the original sequence feature set. By combining the characteristics of k-mer features and frequent itemset features, it is possible to predict unknown RNA-protein pairs more accurately.
- the interaction between RNA and protein is not specifically limited in this disclosure.
- the frequent itemset feature refers to the k-mer subsequence pair composed of RNA k-mer subsequence and protein k-mer subsequence with a certain degree of support in the original data set, and the support degree refers to the combination of A and B.
- AAU,137 it means a 3-mer subsequence pair composed of an RNA 3-mer subsequence AAU and a protein 3-mer subsequence 137.
- the support degree of this subsequence pair is the ratio of RNA-protein pairs containing both subsequences AAU and 137 to all RNA-protein pairs in the original data set.
- the frequent itemset features of each RNA-protein pair in the original data set can be extracted according to steps S510 to S550 to obtain the original sequence feature set.
- Step S510 Convert the RNA sequence and protein sequence in each RNA-protein pair into k-mer subsequences respectively, and form the first candidate item set from the k-mer subsequences, and the k-mer subsequences Including RNA k-mer subsequence and protein k-mer subsequence.
- the RNA sequence and protein sequence of each RNA-protein pair in the RPI1807 data set can be converted into 3-mer subsequences and 4-mer subsequences respectively.
- the RPI1807 data set By traversing the RPI1807 data set, all RNA 3-mer subsequences, RNA 4-mer subsequences, protein 3-mer subsequences and protein 4-mer subsequences in the data set can be found out, and all 3-mer subsequences in the data set
- the -mer subsequence and the 4-mer subsequence form the first candidate itemset C1.
- Step S520 Count the frequency of occurrence of each k-mer subsequence in the first candidate item set in the original data set, and form a frequent itemset from k-mer subsequences satisfying a preset frequency threshold.
- the occurrence frequency of each k-mer subsequence in the first candidate item set C1 in the original data set can be counted.
- the subsequence may be a 3-mer subsequence of an RNA sequence or a protein sequence, or may be a 4-mer subsequence of an RNA sequence or a protein sequence.
- each 3-mer subsequence or 4-mer subsequence in the RPI1807 data set in the first candidate item set C1 can be calculated.
- all 3-mer subsequences and 4-mer subsequences can be screened according to a preset occurrence frequency threshold. For example, RNA 3-mer subsequences with frequency greater than the first threshold, RNA 4-mer subsequences with frequency greater than the second threshold, protein 3-mer subsequences with frequency greater than the third threshold, and protein 3-mer subsequences with frequency greater than the second threshold can be selected.
- the four-threshold protein 4-mer subsequences together constitute the frequent itemset L1.
- the first threshold, the second threshold, the third threshold and the fourth threshold may be the same or different, which is not specifically limited in the present disclosure.
- the frequency of occurrence of 3-mer subsequences and 4-mer subsequences of RNA sequences and the frequency of occurrence of 3-mer subsequences and 4-mer subsequences of protein sequences can also be sorted in descending order, ranked by The preceding subsequences form the frequent itemset L1, which is not specifically limited in this disclosure.
- Step S530 Cross-combining the RNA k-mer subsequences and protein k-mer subsequences in the frequent itemset, and forming a second candidate item set from the combined k-mer subsequence pairs.
- RNA 3-mer subsequence and protein 3-mer subsequence, RNA 4-mer subsequence and protein 4-mer subsequence in the frequent itemset L1 can be cross-combined in pairs to obtain a variety of 3-mer subsequences pairs and 4-mer subsequence pairs, and the second candidate item set C2 is composed of multiple subsequence pairs obtained by combination.
- the 3-mer subsequence pair "AUC_137" and “AUC_123” can also be obtained by cross-combining the RNA 3-mer subsequence and the protein 3-mer subsequence, and by combining the RNA 4-mer subsequence and the protein 4-
- the 4-mer subsequence pairs "AAUU_1737”, “AAUU_1234", “AGUC_1737” and “AGUC_1234" can be obtained by cross-combining mer subsequences.
- Step S540 Counting the occurrence frequency of each k-mer subsequence pair in the second candidate item set in the original data set to obtain the support degree of each k-mer subsequence pair.
- the occurrence frequency of each subsequence pair in the second candidate item set C2 in the RPI1807 data set can be counted.
- the subsequence pair may be a 3-mer subsequence pair, or a 4-mer subsequence pair.
- the frequency of occurrence of the subsequence pair in the RPI1807 data set can be counted first. For example, N RNA-protein pairs in the RPI1807 data set can be cycled. If the subsequence pair appears in the current RNA-protein pair, the frequency of occurrence will be increased by 1.
- the frequency of occurrence will be different. Change.
- the frequency of occurrence of the fth kind of k-mer subsequence pair in the RPI1807 data set obtained by statistics is recorded as num f , and then the frequency of occurrence of the subsequence pair in the RPI1807 data set can be calculated according to the frequency of occurrence num f , that is, the The support of the subsequence pair support f .
- the support degree of each subsequence pair in the second candidate item set C2 can be calculated.
- Step S550 Composing the original sequence feature set from the k-mer subsequence pairs whose support meets the preset condition.
- the subsequence pair satisfying the preset condition can be determined according to the support degree of each subsequence pair, and the original sequence feature set is composed of the subsequence pair satisfying the preset condition.
- a support threshold may be preset, and subsequence pairs whose support is greater than the threshold are screened out, and the screened subsequence pairs form the original sequence feature set.
- there are 370 subsequence pairs whose support degree is greater than the threshold and these 370 subsequence pairs are 370 frequent itemset features, and the original sequence feature set can be composed of 370 frequent itemset features.
- all subsequence pairs in the second candidate item set C2 can also be sorted in descending order according to the support degree, and the top-ranked subsequence pairs are selected to form the original sequence feature set, which is not specifically limited in this disclosure .
- the RNA sequence and the protein sequence in each RNA-protein pair can be respectively converted into k-mer subsequences to obtain k-mer subsequence pairs.
- the frequency of occurrence of each k-mer subsequence pair in the original data set is counted, and the k-mer subsequence pair that meets the preset condition of the frequency of occurrence is used as the frequent itemset feature and constitutes the original sequence feature set.
- the RNA sequences and protein sequences of all positive RNA-protein pairs in the RPI1807 dataset can be converted into positive 3-mer subsequences and positive 4-mer subsequences, respectively.
- the RNA sequences and protein sequences of all negative RNA-protein pairs in this dataset are converted into negative 3-mer subsequences and negative 4-mer subsequences, respectively.
- RNA 3-mer subsequences and protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences in the data set are combined in pairs to obtain a variety of 3-mer subsequence pairs and 4-mer subsequences.
- mer subsequence pair Exemplarily, positive example RNA 3-mer subsequences and positive example protein 3-mer subsequences can be cross-combined to obtain positive example 3-mer subsequence pairs.
- Negative RNA 3-mer subsequences and negative protein 3-mer subsequences can be cross-combined to obtain negative 3-mer subsequence pairs.
- the positive example RNA 4-mer subsequence and the positive example protein 4-mer subsequence can be cross-combined to obtain the positive example 4-mer subsequence pair.
- the negative example RNA 4-mer subsequence and the negative example protein 4-mer subsequence can be cross-combined to obtain the negative example 4-mer subsequence pair.
- the occurrence frequency of each sub-sequence pair in the data set can be counted. For example, for any positive example 3-mer subsequence pair, it can be based on:
- num is the number of occurrences of the positive 3-mer subsequence pair in the data set
- NUM is the total number of occurrences of all positive 3-mer subsequence pairs in the data set.
- all 3-mer subsequence pairs and 4-mer subsequence pairs can be sorted according to the frequency of occurrence, For example, in descending order, the top-ranked k-mer subsequence pairs can be selected to form frequent itemsets. For example, to sort all the positive 3-mer subsequence pairs in descending order, the first m 3-mer subsequence pairs can be selected to form the frequent itemset A1. All the positive 4-mer subsequence pairs are sorted in descending order, and the first n 4-mer subsequence pairs can be selected to form the frequent itemset A2.
- All the negative 3-mer subsequence pairs are sorted in descending order, and the first p 3-mer subsequence pairs can be selected to form the frequent itemset A3. All negative 4-mer subsequence pairs are sorted in descending order, and the first q 4-mer subsequence pairs can be selected to form frequent itemsets A4. Then the original sequence feature set is composed of these four frequent itemsets A1, A2, A3 and A4.
- the occurrence frequency threshold can also be preset, and the k-mer subsequence pairs whose occurrence frequency is greater than the threshold are filtered out, and the filtered k-mer subsequence pairs are used as frequent itemset features to form the original sequence feature set , which is not specifically limited in the present disclosure.
- the frequent itemset feature can combine the kmer feature of the RNA sequence and the kmer feature of the protein sequence, and use the frequent itemset feature to better distinguish between interacting and non-interacting RNA-protein pairs . Therefore, when the original sequence feature set is composed of frequent itemset features, and the feature extraction of the RNA-protein pair to be predicted is performed based on the original sequence feature set, the extracted sequence features of the RNA-protein pair to be predicted can be more accurate. Determine whether the RNA-protein pair to be predicted has an interaction.
- Step S320 Determine the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set.
- the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and after the original sequence feature set is obtained, each k-mer subsequence can be searched in the original sequence feature set. mer subsequence, and obtain the sequence characteristics of the RNA-protein pair to be predicted according to the search results.
- the sequence feature of the RNA-protein pair to be predicted may refer to a complete sequence feature composed of RNA sequence features and protein sequence features.
- the original sequence feature set can be composed of 560 kinds of k-mer subsequences, for example, the 560 kinds of k-mer subsequences can be [CCC, ..., AGU, CCCC, ..., CUGG, 777, ..., 373, 7774 , ..., 7571].
- the feature calculation can be performed on the RNA sequence and the protein sequence in the RNA-protein pair to be predicted based on the original sequence feature set to obtain the sequence feature of the RNA-protein pair to be predicted.
- the features of each feature dimension correspond to a k-mer subsequence.
- the subsequence CCC is the feature of the first feature dimension
- the subsequence 7571 is the feature of the 560th feature dimension. All 3-mer subsequences and 4-mer subsequences of the RNA-protein pair to be predicted can be searched in the original sequence feature set, and it is determined whether the features on each feature dimension in the original sequence feature set exist according to the search results.
- the feature value on the feature dimension is 1, and if it does not exist, it is 0.
- the RNA sequence in the RNA-protein pair to be predicted is AAAACCCGGG
- the feature CCC of the first feature dimension in the original sequence feature set is also a 3-mer subsequence of the RNA-protein pair . Therefore, it can be determined that the feature CCC on the first feature dimension in the original sequence feature set exists, and the corresponding feature value can be recorded as 1.
- the feature value on the feature dimension corresponding to the feature AGU in the original sequence feature set can be recorded as 0.
- eigenvalue vector [1, 0, ..., 0, 1, ...] can be calculated, and the eigenvalue vector is the sequence feature of the RNA-protein pair to be predicted. It can be understood that each eigenvalue contained in the eigenvalue vector is in one-to-one correspondence with the eigenvalues of each feature dimension in the original sequence feature set.
- the original sequence feature set is composed of the extracted k-mer feature of the RNA sequence and the k-mer feature of the protein sequence.
- the feature extraction of the RNA-protein pair to be predicted is performed to obtain the sequence characteristics of the RNA-protein pair to be predicted.
- the k-mer feature can include the RNA sequence Monomer component information (that is, each base contained) and sequence order information. Therefore, using the k-mer feature can better describe an RNA sequence, that is, an RNA sequence can be more accurately determined according to the k-mer feature, and different RNA sequences can also be distinguished by the k-mer feature.
- the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and the RNA k-mer subsequence and protein k -mer subsequences are cross-combined to obtain a variety of RNA-protein k-mer subsequence pairs.
- the RNA 3-mer subsequence can be And protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences were cross-combined in pairs to obtain a variety of 3-mer subsequence pairs and 4-mer subsequence pairs.
- Each RNA-protein k-mer subsequence pair can be searched in the original sequence feature set, and the sequence features of the RNA-protein pair can be obtained according to the search results.
- the original sequence feature set may be composed of 370 frequent itemset features, for example, the 370 frequent itemset features may be [CUG_122, AAU_122, ..., CUUU_1312, UCUG_1312, ...].
- RNA 3-mer subsequences and protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences can be paired to obtain a variety of 3-mer subsequence pairs and 4-mer subsequence pairs. Then, feature calculation can be performed on the RNA sequence and protein sequence in the RNA-protein pair to be predicted based on the original sequence feature set to obtain the sequence feature of the RNA-protein pair.
- the feature of each feature dimension corresponds to a kind of k-mer subsequence pair.
- the subsequence pair CUG_122 is a feature of the first feature dimension. All subsequence pairs of the to-be-predicted RNA-protein pair can be searched in the original sequence feature set, and it is determined whether the features on each feature dimension in the original sequence feature set exist according to the search results. If it exists, the feature value on the feature dimension is 1, and if it does not exist, it is 0.
- the protein sequence in the RNA-protein pair to be predicted is AUCUGAAAU
- the protein sequence is 512261312. It can be seen that CUG_122, AAU_122, and UCUG_1312 in the subsequence pair of the RNA-protein pair exist in the original sequence feature set. Therefore, the feature value on the corresponding feature dimension in the original sequence feature can be recorded as 1.
- a 370-dimensional eigenvalue vector [1, 1, ..., 0, 1, ...] can be calculated, and the eigenvalue vector is the sequence feature of the RNA-protein pair to be predicted.
- each eigenvalue contained in the eigenvalue vector is in one-to-one correspondence with the eigenvalues of each feature dimension in the original sequence feature set.
- the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and each k-mer subsequence can be searched in the original sequence feature set. mer subsequence to get the first sequence features. Then, RNA k-mer subsequences and protein k-mer subsequences can be combined to obtain multiple RNA-protein k-mer subsequence pairs, and each RNA-protein k-mer subsequence pair can be found in the original sequence feature set , to get the second sequence features. Finally, the sequence features of the RNA-protein pair to be predicted can be composed of the first sequence feature and the second sequence feature.
- the original sequence feature set may include two feature subsets, and the two feature subsets include 560 k-mer subsequences [CCC, ..., CCCC, ..., 777, ..., 7774, ...] and 370 k-mer subsequences respectively.
- Frequent itemset features [CUG_122, AAU_122, ..., CUUU_1312, UCUG_1312, ...].
- the RNA-protein pair to be predicted can be converted into RNA 3-mer subsequence, RNA 4-mer subsequence, protein 3-mer subsequence and protein 4-mer subsequence.
- RNA 3-mer subsequence can also be converted into Pair with protein 3-mer subsequence, RNA 4-mer subsequence and protein 4-mer subsequence to obtain various 3-mer subsequence pairs and 4-mer subsequence pairs.
- the feature calculation can be performed on the RNA sequence and the protein sequence in the RNA-protein pair to be predicted according to the original sequence feature set to obtain the sequence feature of the RNA-protein pair to be predicted.
- all subsequences and subsequence pairs of the RNA-protein pair to be predicted can be searched in the original sequence feature set, and it is determined whether the features on each feature dimension in the original sequence feature set exist according to the search results.
- a 560-dimensional eigenvalue vector [1, . . . , 1, . . . , 1, . . . , 0, .
- a 930-dimensional eigenvalue vector can be obtained by concatenating two eigenvalue vectors, which is the sequence feature of the RNA-protein pair to be predicted. It is also possible to directly input two eigenvalue vectors into the interaction prediction model at the same time, which is not specifically limited in this disclosure.
- a 930-dimensional original sequence feature set can also be composed of 560 k-mer subsequences and 370 frequent itemset features.
- the original sequence feature set is [CCC, ..., CCCC, ..., 777, ..., 7774, ..., CCA_121, ..., UCUG_1312, ..., AAU_122, ..., CUUU_1312, ...].
- All subsequences and subsequence pairs of the RNA-protein pair can be searched in the original sequence feature set, and according to the search results, it is determined whether the features on each feature dimension in the original sequence feature set exist, and a 930-dimensional eigenvalue vector [1 ,...,1,...,1,...,0,...,1,...,1,...,1,...,0,...], the eigenvalue vector is the sequence feature of the RNA-protein pair to be predicted.
- the frequent itemset features of each RNA-protein pair in the original data set can also be extracted, and the original sequence feature set is composed of the extracted frequent itemset features.
- the frequent itemset feature can combine the kmer feature of the RNA sequence and the kmer feature of the protein sequence. Therefore, using frequent itemset features can better distinguish between interacting and non-interacting RNA-protein pairs. It is also possible to extract k-mer features and frequent itemset features at the same time, and combine them to form the original sequence feature set. By combining the characteristics of k-mer features and frequent itemset features, it is possible to predict unknown RNA-protein pairs more accurately. Interactions between RNA and proteins.
- step S230 the RNA-protein pair to be predicted is vectorized to obtain an RNA sequence representation vector and a protein sequence representation vector in the RNA-protein pair to be predicted.
- the obtained sequence features can be used as the input of the first interaction prediction model.
- the RNA-protein pair to be predicted can also be vectorized, and the obtained vector can be used as the input of at least one second interaction prediction model enter.
- the RNA sequence and the protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively.
- RNA sequences that do not overlap can be divided into M RNA k-mer subsequences
- protein sequences that do not overlap can be divided into N protein k-mer subsequences.
- the RNA sequence is AUCUGAAAU, it can be divided into three RNA k-mer subsequences, namely AUC, UGA and AAU.
- the non-overlapping division of the RNA sequence and the protein sequence into multiple k-mer subsequences is to vectorize the RNA sequence and the protein sequence, that is, the bases in the RNA sequence and the protein sequence
- the amino acids of are vectorized in the form of k-joints.
- each base contained in the RNA sequence in the RNA-protein pair can also be vectorized to obtain multiple nucleotide vectors, and multiple base vectors can be spliced to obtain the representation vector of the RNA sequence.
- each amino acid contained in the protein sequence in the RNA-protein pair can be vectorized to obtain multiple amino acid vectors, and multiple amino acid vectors can be spliced to obtain the representation vector of the protein sequence.
- the overlapping RNA sequences can also be divided into P RNA k-mer subsequences, and the overlapping protein sequences can be divided into Q protein k-mer subsequences, which is not specifically limited in this disclosure. Then, the RNA sequence representation vector and the protein sequence representation vector can be respectively input into the second interaction prediction model.
- each k-mer subsequence of the RNA sequence and the protein sequence can be encoded first.
- Each RNA 3-mer subsequence and protein 3-mer subsequence can be encoded in Embedding (vector mapping) in sequence, and each 3-mer subsequence can be represented by a low-dimensional vector, and corresponding multiple 3-mer subsequences can be obtained.
- Each RNA 3-mer subsequence and protein 3-mer subsequence can be encoded in Embedding (vector mapping) in sequence, and each 3-mer subsequence can be represented by a low-dimensional vector, and corresponding multiple 3-mer subsequences
- each 3-mer subsequence can be One-Hot (one-hot) encoded, and One-Hot encoding is also called one-bit effective encoding.
- the method is to use N-bit status registers to perform N-state Encoding, each state has an independent register bit, and at any time, only one bit in the register is valid.
- a 64-dimensional One-Hot vector can be obtained by encoding, and the i-th element in the vector Set it to 1, and set other elements to 0, such as [0, 1, 0, 0, ..., 0].
- a 343-dimensional One-Hot vector can be obtained by encoding, and the jth element in the vector is set to 1, All other elements are set to 0.
- each RNA 3-mer subsequence and protein 3-mer subsequence can correspond to a 3-mer One-Hot vector.
- dense vectors can also be used to represent each 3-mer subsequence.
- the Word2vec algorithm can be used to map each 3-mer subsequence into a vector space, and each 3-mer subsequence can be represented by a subsequence vector in the vector space.
- RNA sequence data can be obtained, and the BERT pre-training model can be used for training. After the training is completed, a certain RNA sequence can be input into the trained model to obtain the high-dimensional vector of the RNA sequence. Not specifically limited.
- RNA sequences and protein sequences in the RNA-protein pairs to be predicted can be converted into 3-mer subsequences respectively, such as obtaining M RNA 3-mer subsequence and N protein 3-mer subsequences.
- M 3-mer One-Hot vectors corresponding to M RNA 3-mer subsequences can be determined through query, and the M 3-mer One-Hot vectors are sequentially spliced, such as splicing in the row direction to obtain a A two-dimensional matrix of M*64, such as:
- the two-dimensional matrix is the 3-mer One-Hot representation vector of the RNA sequence.
- N 3-mer One-Hot vectors corresponding to N protein 3-mer subsequences can also be determined by querying, and the N 3-mer One-Hot vectors are sequentially spliced in the row direction to obtain an N*343 binary Dimensional matrix, the two-dimensional matrix is the 3-mer One-Hot representation vector of the protein sequence. It is understandable that M 3-mer One-Hot vectors or N 3-mer One-Hot vectors can also be spliced in columns, and direct (ie tail) splicing can also be performed to obtain sequenced 3-mer One-Hot vectors , which is not specifically limited in the present disclosure.
- the RNA sequence representation vector and the protein sequence representation vector can be used as the input of the deep learning model to further discover the Few or new feature combinations reveal the interactions between implicit features.
- step S240 based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, multiple interaction prediction models are used to obtain the Multiple interaction predictions for predicted RNA-protein pairs.
- multiple interaction prediction models are based on joint training.
- the multiple interaction prediction models may all be traditional machine learning models, or all may be deep learning models, or may include at least one traditional machine learning model and at least one deep learning model at the same time.
- Traditional machine learning models refer to working with natural data in its raw form. For example, composing a pattern recognition or machine learning system requires expert knowledge to extract features from raw data (such as pixel values of an image) and convert them into an appropriate feature representation.
- the traditional machine learning model may include a linear regression model, a logistic regression model, a support vector machine model, a decision tree model, a K-nearest neighbor (K-Nearest Neighbor, KNN) model, a random forest model, and a naive Bayesian model, etc. .
- the deep learning model has the ability to automatically extract features, and can be composed of multiple processing layers to form a complex computing model, thereby automatically obtaining data representation and multiple levels of abstraction, which is a kind of learning for feature representation.
- the deep learning model may include a convolutional neural network model, a recurrent neural network model, and the like.
- the RNA-protein pair to be predicted can not be vectorized, but only the sequence features obtained by feature extraction of the RNA-protein pair to be predicted can be used as each traditional machine The input to the learned model.
- feature extraction of the RNA-protein pair to be predicted may not be performed, but only the RNA sequence obtained by vectorizing the RNA-protein pair to be predicted represents the vector and protein sequence Representation vectors serve as input to individual deep learning models.
- the sequence characteristics of the RNA-protein pair can be input into at least one first interaction prediction model In , at least one first interaction prediction value is obtained.
- the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted can be input into at least one second interaction prediction model to obtain at least one second interaction prediction value.
- the first interaction prediction model and at least one second interaction prediction model are obtained based on joint learning. Among them, joint learning refers to combining multiple sub-models into one model to complete the final target task.
- At least one first interaction prediction model and at least one second interaction prediction model may be combined, and a final prediction result may be obtained by fusing the outputs of each model.
- each model and the result of weighted summation can be considered at the same time, and the model parameters of each model can be optimized at the same time to obtain the best overall model, thereby improving the predictive ability of the overall model.
- the first interaction model may be a traditional machine learning model
- the second interaction model may be a deep learning model.
- the sequence features of the RNA-protein pair to be predicted can be input into at least one traditional machine learning model to obtain at least one first interaction prediction value.
- the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted can be input into at least one deep learning model to obtain at least one second interaction prediction value.
- each deep learning model can include at least two sub-deep learning models, which are respectively used to process the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted.
- the RNA sequence representation vector can be input into the first sub-deep learning model to obtain the first sequence feature.
- the protein sequence representation vector can be input into the second sub-deep learning model to obtain the second sequence feature.
- the first sequence feature and the second sequence feature can be fused through the fully connected layer of each deep learning model to obtain a second interaction prediction value based on the fused features.
- the traditional machine learning model may include at least one of LR (Logistic Regression, logistic regression) model, SVM model and decision tree model
- the deep learning model may include at least one of CNN model and recurrent neural network model One, wherein the recurrent neural network model can be an LSTM model, a BiLSTM (Bi-directional LSTM, two-way long-short memory network) model.
- a 930-dimensional original sequence feature set can be composed of 560 k-mer subsequences and 370 frequent itemset features, and the original sequence feature set is characterized according to the k-mer subsequence of the RNA-protein pair to be predicted Calculate and obtain a 930-dimensional eigenvalue vector, and use it as the input of the LR model to obtain the interaction prediction value y 1 .
- the RNA-protein pairs to be predicted can also be quantized to obtain the k-mer One-Hot vectors of the RNA sequence and the protein sequence respectively, and use them as the input of the CNN model and the BiLSTM model respectively to obtain the interaction prediction value y 2 , y 3 .
- the LR model has a good memory ability, and can learn the correlation between sequences or features from data.
- the CNN model and the BiLSTM model have good generalization capabilities, and can discover combinations of features that appear rarely or new in the data, and then reveal the interaction between implicit features.
- the CNN model can better capture features but ignores the location information of the features, while the BiLSTM model has better memory ability, and can use the sequence information and location information of the data to make up for the defects of the CNN model in memory ability.
- the LR model is simple and has good interpretability.
- the joint learning of CNN model, BiLSTM model and LR model can enhance the interpretability of the overall model for RPI prediction.
- the present disclosure can effectively combine the characteristics of each interaction prediction model, thereby improving the prediction ability of the overall model.
- step S250 the interaction between the RNA and the protein is determined according to the plurality of interaction prediction values.
- a weighted sum calculation can be performed on the multiple interaction prediction values, and the interaction between RNA and protein can be determined according to the calculation results. If the calculation result is greater than the preset interaction prediction value threshold, it can be determined that there is an interaction between the RNA and the protein. If the calculation result is less than or equal to the preset interaction prediction value threshold, it can be determined that there is no interaction between the RNA and the protein. By fusing the output values of each interaction prediction model to obtain the final prediction result, the interaction between RNA and protein in the RNA-protein pair to be predicted can be more accurately determined.
- RNA-protein pairs when predicting the interaction of RNA-protein pairs, it can be based on:
- the interaction prediction value y out of the RNA-protein pair to be predicted is calculated.
- y 1 is the output value of the logistic regression model
- y 2 is the output value of the convolutional neural network model
- y 3 is the output value of the recurrent neural network model
- ⁇ , ⁇ , ⁇ are respectively the logistic regression model, convolutional neural network
- the weight parameter of the model and the cyclic neural network model, y out can be any value between 0-1.
- the interaction prediction value threshold can be preset as 0.5, and when y out >0.5, the prediction result can be marked as 1, which means that the RNA-protein pair has an interaction. When y out ⁇ 0.5, the prediction result can be marked as 0, which means that the RNA-protein pair has no interaction.
- weight parameters ⁇ , ⁇ , and ⁇ can be obtained based on joint learning and training.
- multiple interaction prediction models and related parameters can be jointly trained in advance according to steps S610 to S650, so as to realize the optimization of all model parameters in each prediction model. According to the training The resulting final model makes predictions for RNA-protein pairs whose interactions are unknown.
- Step S610 Obtain a training data set, which includes positive RNA-protein pairs and negative RNA-protein pairs.
- RNA-protein pairs in the original data set can be used as the training data set, and some RNA-protein pairs in the original data set can also be used as the training data set.
- RNA-protein pairs in the data set there are 3243 RNA-protein pairs in the data set, including 1807 pairs of positive examples and 1436 pairs of negative examples. Exemplarily, 1200 positive examples and 1000 negative examples may be selected as the training data set. It can be understood that the number of RNA-protein pairs in the training data set is only illustrative, and any number of RNA-protein pairs can be obtained to train each interaction prediction model multiple times to improve the performance of each interaction prediction model. performance.
- RNA-protein pair can be labeled, and the obtained label value is "1", which means that the RNA-protein pair has an interaction.
- Negative RNA-protein pairs can be labeled, and the resulting label value is "0", which means that the RNA-protein pair has no interaction.
- Step S620 Obtain multiple interaction prediction values for each RNA-protein pair in the training data set by using the multiple interaction prediction models.
- multiple interaction prediction models can be LR model, CNN model and BiLSTM model respectively.
- feature extraction is performed on each RNA-protein pair in the training data set, and the extracted sequence features can be sequentially input into the LR model.
- Each RNA-protein pair is vectorized, and the obtained RNA sequence representation vector and protein sequence representation vector can be input into the CNN model and the BiLSTM model.
- it is taken as an example to use the i-th RNA-protein pair to train each prediction model.
- the sequence features of the RNA-protein pair can be input into the LR model, and the predicted value y 1 is output.
- RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair can be respectively input into two CNN sub-models, and the outputs of the two CNN sub-models are spliced to finally output a predicted value y 2 .
- the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair can be input into the BiLSTM model, and the predicted value y 3 can be output. It can be understood that, for each RNA-protein pair in the training data set, three interaction prediction values can be obtained through the three interaction prediction models.
- Step S630 Obtain the joint prediction value of each RNA-protein pair in the training data set according to the plurality of interaction prediction values.
- the multiple interaction prediction values can be weighted and summed to obtain a joint prediction value for each RNA-protein pair. Still taking the i-th RNA-protein pair in the training data set as an example, it can be based on:
- y 1 is the output value of the LR model
- y 2 is the output value of the CNN model
- y 3 is the output value of the BiLSTM model
- ⁇ , ⁇ , ⁇ are respectively Weight parameters of LR model, CNN model and BiLSTM model.
- Step S640 Using a loss function to calculate the joint prediction value and label value of each RNA-protein pair in the training data set to obtain a corresponding loss value.
- the i-th RNA-protein pair is positive data, and its corresponding label value is 1.
- the loss function can be calculated according to the joint prediction value y out and the label value 1 of the RNA-protein pair to obtain the corresponding loss value.
- the cross-entropy loss function can be selected as the objective function.
- the cross-entropy loss function is a performance function in the prediction model, which can be used to measure the degree of inconsistency between the prediction value of the prediction model and the label value. The smaller the value of the calculated cross-entropy loss function, the better the prediction effect of the model.
- Step S650 Adjust model parameters of the plurality of interaction prediction models according to the loss value.
- the model parameters of the multiple interaction prediction models can be iteratively updated based on the calculated loss values, and when the iteration termination condition is met, the training of the model parameters and each weight parameter of the multiple interaction prediction models is completed.
- the parameters may be updated using a stochastic gradient descent algorithm. According to the principle of backpropagation, the objective function such as cross-entropy loss function is continuously calculated, and the model parameters and weight parameters of each interaction prediction model are simultaneously updated according to the calculated loss value. When the objective function converges to the minimum value, the training of all model parameters is completed.
- the model parameters can also be updated iteratively in reverse, and when the preset number of iterations is met, the training of all model parameters is completed.
- the preset number of iterations may be 20, and each interaction prediction model is constantly updating model parameters during the 20 reverse iterations.
- the optimized model parameters can be obtained.
- the least squares method, Adam optimization algorithm, etc. can also be used to minimize the objective function, and the model parameters and weight parameters are updated sequentially from the back to the front to optimize the parameters.
- the weight parameters ⁇ , ⁇ , and ⁇ are also part of the model parameters, and the weight parameters will also be trained and continuously optimized during the joint training of multiple interaction prediction models.
- the logistic regression model when multiple interaction prediction models are used to predict the interaction of unknown RNA-protein pairs, by combining the logistic regression model with the deep learning model, the logistic regression model can be used to learn each The correlation between the k-mer features and/or the features of each frequent itemset can also use the deep learning model to reveal the interaction between the implicit features, and combine the characteristics of the CNN model and the BiLSTM model, in multiple models In the process of joint training, the model parameters of each model are optimized at the same time, thereby improving the accuracy of the overall model in predicting the interaction of unknown RNA-protein pairs.
- the original data set when joint training is performed on multiple interaction prediction models in advance, the original data set may be divided into a training data set, a verification data set and a test data set in proportion.
- the original data set is an RPI1807 data set as an example for illustration.
- the data set may be divided into training data set, verification data set and test data set according to the ratio of 7:2:1.
- the ratio of positive and negative cases in each data set can be consistent with the distribution of the overall data set, that is, the ratio is 1807:1436, which is about 1.25:1.
- 1250 positive examples and 1000 negative examples can be selected as the training data set
- 360 positive examples and 280 negative examples can be selected as the verification data set
- 180 positive examples and 140 negative examples can be selected as the test data set.
- the training data set can be input into multiple interaction prediction models, and the corresponding model parameters can be determined by using the backpropagation algorithm to obtain the first joint model.
- the training data set may be input into each interaction prediction model, and the model parameters may be adjusted.
- model parameters can be weight parameters, bias parameters, intercept parameters, and so on.
- each model parameter may be updated using a stochastic gradient descent algorithm. According to the principle of backpropagation, the objective function is continuously calculated, and the model parameters are updated according to the objective function. When the objective function converges to the minimum value, the training of the model parameters is completed, so as to obtain the first joint model.
- the verification data set can be input into the first joint model to verify the performance of the first joint model, and the second joint model can be obtained according to the verification result.
- a set of hyperparameters may be initialized first, and multiple interaction prediction models are continuously trained using the training data set to obtain the first joint model.
- the hyperparameters can be the learning rate, the number of CNN layers, the size of the convolution kernel, etc.
- the verification data set can be input into the trained first joint model to verify the prediction accuracy of the first joint model.
- the prediction accuracy reaches the preset accuracy threshold
- the current first joint model can be used as the second joint model to obtain the final training model.
- the final performance of the model can be tested using the test dataset on the trained model.
- a set of hyperparameters can be reset, and the interaction prediction model can be trained and tested in turn using the training data set and the verification data set. Verification, when the prediction accuracy obtained by the trained interaction prediction model on the verification data set reaches the preset accuracy threshold, the final performance of the prediction model can be tested with a new test data set.
- the accuracy rate of the second joint model may be determined according to the test data set, and a third joint model is obtained when the accuracy rate is greater than a preset threshold, and the third joint model includes a plurality of trained interaction prediction models.
- a third joint model is obtained after obtaining the second joint model, each RNA-protein pair in the test data set can be input into the second joint model to judge the accuracy of the second joint model. If the accuracy of the model is greater than the preset accuracy threshold, a third joint model is obtained.
- the third joint model includes multiple trained interaction prediction models, and then multiple interaction prediction models can be used to predict the interaction between unknown RNA-protein pairs.
- the test data set can also be used to judge the Matthews correlation coefficient of the second joint model.
- the Matthews correlation coefficient refers to the correlation coefficient between the actual classification and the predicted classification, and its value range is [0, 1]. A larger value indicates that the predicted value is more correlated with the real value, and a value of 1 indicates that the predicted result is completely correct. If the Matthews correlation coefficient of the model is greater than a preset threshold, a third joint model is obtained.
- the test data set can also be used to judge the specific rate and recall rate of the second joint model, which is not specifically limited in the present disclosure. For example, if the accuracy rate of the second joint model is not greater than the preset accuracy rate threshold, a new training data set can be obtained to train the model parameters of each interaction prediction model again, so as to continuously improve the model performance.
- parameters can be adjusted by using a training data set and a verification data set to train an optimal prediction model, and then the generalization performance of the prediction model can be tested by using a test data set.
- the model parameters in the LR model, CNN model and BiLSTM model can be trained at the same time.
- the model parameters of the three interaction prediction models can be adjusted at the same time according to the loss value calculated by the joint prediction value and label value of each RNA-protein pair in the training data set, and through multiple backpropagation, Finally, the model parameters of the three interaction prediction models can all tend to converge, or the training can be terminated after a certain number of iterations are satisfied.
- the three interaction prediction models of LR model, CNN model and BiLSTM model can be trained at the same time, and then the joint learning of the three interaction prediction models can be realized. At the same time, it can not only ensure higher precision and accuracy of each interaction prediction model in predicting RNA-protein pair interaction, but also improve the training efficiency of each interaction prediction model.
- each interaction prediction model obtained After the training of each model is completed, each interaction prediction model obtained finally can be used to predict the interaction between unknown RNA-protein pairs, and the prediction result of the RNA-protein pair interaction can be output to the terminal device for users to view .
- the same training data set, verification data set and test data set can be used to train and test the individual LR model, CNN model, BiLSTM model and the joint learning model of the three.
- the performance of each prediction model can be evaluated using two performance indicators, accuracy rate and Matthews correlation coefficient.
- the accuracy rate and Matthews correlation coefficient of the joint learning model are superior to those of the LR model, CNN model, and BiLSTM model, that is, the performance of the joint learning model is better than that of a single model. That is, the predictive ability is better.
- At least one RNA sequence can also be obtained, and a protein sequence that interacts with each input RNA sequence can be searched in the database through multiple interaction prediction models.
- multiple interaction prediction models may be jointly trained in advance by using the original data set with reference to FIG. 6 .
- all protein sequences participating in the joint training can be stored in the database.
- the database can also include other protein sequences that have not participated in the joint training, that is, the number of protein sequences in the database can be arbitrary, and the database can also include any number of RNA sequences. For example, it can include but not It is limited to all RNA sequences participating in the joint training, which is not specifically limited in this disclosure.
- each input RNA sequence can be combined with all protein sequences in the database to form several RNA-protein pairs.
- multiple interaction models can be learned jointly according to step S220 to step S250, so as to predict the interaction of each RNA-protein pair.
- feature extraction and vectorization processing can be performed on each RNA-protein pair, and the obtained sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair are input into multiple interaction prediction models, and Get interaction predictions for each RNA-protein pair.
- An interaction prediction value of 1 indicates that the RNA-protein pair has an interaction
- an interaction prediction value of 0 indicates that the RNA-protein pair has no interaction.
- all RNA-protein pairs with an interaction prediction value of 1 can be screened out, and the protein sequence in each RNA-protein pair can be output to the terminal device for users to view protein sequences that interact with the input RNA sequence .
- At least one protein sequence can also be obtained, and the RNA sequence that interacts with each input protein sequence can be searched in the database through multiple interaction prediction models.
- each input protein sequence can be combined with all RNA sequences in the database to form several RNA-protein pairs.
- multiple interaction models can be learned jointly according to step S220 to step S250, so as to predict the interaction of each RNA-protein pair.
- feature extraction and vectorization processing can be performed on each RNA-protein pair, and the obtained sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair are input into multiple interaction prediction models, and Get interaction predictions for each RNA-protein pair.
- An interaction prediction value of 1 indicates that the RNA-protein pair has an interaction
- an interaction prediction value of 0 indicates that the RNA-protein pair has no interaction. Then, all RNA-protein pairs with an interaction prediction value of 1 can be screened out, and the RNA sequence in each RNA-protein pair is output to the terminal device for users to view RNA sequences that interact with the input protein sequence .
- the RNA-protein pair to be predicted is obtained; the RNA-protein pair to be predicted is extracted to obtain the RNA-protein to be predicted pair of sequence features; vectorize the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted; based on the RNA-protein pair to be predicted Sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair to be predicted, using multiple interaction prediction models to obtain multiple interaction prediction values of the RNA-protein pair to be predicted; according to the The interaction between the RNA and the protein is determined based on the plurality of interaction prediction values.
- RNA sequences and protein sequences can be fully mined, so as to accurately predict the interaction between RNA and proteins;
- the effective combination of the characteristics of the interaction prediction model can further improve the accuracy of predicting the interaction between RNA and protein.
- RNA-protein interaction prediction device 700 may include a data acquisition module 710, a feature extraction module 720, a data vectorization module 730, an interaction prediction module 740, and an interaction determination module 750, wherein:
- a data acquisition module 710 configured to acquire RNA-protein pairs to be predicted
- Feature extraction module 720 for performing feature extraction on the RNA-protein pair to be predicted, to obtain the sequence features of the RNA-protein pair to be predicted;
- the data vectorization module 730 is used to vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted;
- An interaction prediction module 740 configured to use multiple interaction prediction models based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted Obtaining multiple interaction prediction values of the RNA-protein pair to be predicted respectively;
- An interaction determination module 750 configured to determine the interaction between the RNA and the protein according to the plurality of interaction prediction values.
- the feature extraction module 720 includes:
- the feature set acquisition module is used to obtain the original sequence feature set
- a feature determination module configured to determine the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set.
- the feature determination module includes:
- a sequence conversion unit for converting the RNA sequence and protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively;
- the first sequence search unit is configured to search for each k-mer subsequence in the original sequence feature set, and obtain the sequence feature of the RNA-protein pair to be predicted according to the search result.
- the feature determination module includes:
- a sequence combination unit for combining the RNA k-mer subsequence and the protein k-mer subsequence to obtain a variety of RNA-protein k-mer subsequence pairs;
- the second sequence search unit is configured to search for each RNA-protein k-mer subsequence pair in the original sequence feature set, and obtain the sequence feature of the RNA-protein pair to be predicted according to the search result.
- the feature determination module includes:
- a sequence conversion unit for converting the RNA sequence and the protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences mer subsequence;
- a first sequence search unit configured to search for each k-mer subsequence in the original sequence feature set to obtain the first sequence feature
- a sequence combination unit for combining the RNA k-mer subsequence and the protein k-mer subsequence to obtain a variety of RNA-protein k-mer subsequence pairs;
- a second sequence search unit configured to search for each RNA-protein k-mer subsequence pair in the original sequence feature set to obtain a second sequence feature
- a feature splicing unit configured to form the sequence feature of the RNA-protein pair to be predicted from the first sequence feature and the second sequence feature.
- the feature set acquisition module includes:
- the data set acquisition module is used to obtain the original data set
- a feature extraction module configured to perform feature extraction on each RNA-protein pair in the original data set to obtain the original sequence feature set.
- the feature extraction module includes:
- the sequence generation unit is used to arrange and combine the basic units of RNA and protein to obtain k-mer subsequences
- a variance calculation unit used to count the frequency of occurrence of each k-mer subsequence in the original data set, and calculate the variance of each k-mer subsequence according to the frequency of occurrence;
- the data set determination unit is configured to determine the feature set of the original sequence according to the variance of each k-mer sequence.
- the variance calculation unit includes:
- a frequency statistics subunit used to count the frequency of occurrence of each k-mer subsequence in the original data set
- a frequency calculation subunit configured to calculate the frequency of occurrence of each k-mer subsequence in the original data set according to the frequency of occurrence
- a sequence labeling subunit configured to traverse the original data set, and mark whether each k-mer subsequence appears in each RNA-protein pair;
- a variance calculation subunit for calculating the variance of each k-mer subsequence according to the frequency of occurrence of each k-mer subsequence in the original data set and the tag value in each RNA-protein pair .
- the variance calculation subunit is configured to:
- Var i of the i-th k-mer subsequence. in is the marker value of the i-th k-mer subsequence in the n-th RNA-protein pair
- Freq i is the occurrence frequency of the i-th k-mer subsequence in the original data set
- N is the RNA-protein pair in the original data set total number of .
- the data set determination unit is configured to determine a k-mer subsequence that satisfies a preset condition according to the variance of each k-mer subsequence, and the k-mer subsequence that satisfies the preset condition
- the conditional k-mer subsequences constitute the original sequence feature set.
- the feature extraction module also includes:
- the first item set generation unit is used to convert the RNA sequence and protein sequence in each RNA-protein pair into k-mer subsequences respectively, and the first candidate itemset is composed of the k-mer subsequences, so Described k-mer subsequence comprises RNA k-mer subsequence and protein k-mer subsequence;
- a frequent itemset generating unit configured to count the frequency of occurrence of each k-mer subsequence in the first candidate item set in the original data set, and a frequent itemset is composed of k-mer subsequences satisfying a preset frequency threshold ;
- the second item set generating unit is used to cross-combine the RNA k-mer subsequence and the protein k-mer subsequence in the frequent itemset, and form the second candidate item set by the k-mer subsequence obtained by combination;
- a support determination unit configured to count the frequency of occurrence of each k-mer subsequence pair in the original data set in the second candidate item set, and obtain the support degree of each k-mer subsequence pair;
- a feature set acquisition unit configured to form the original sequence feature set from k-mer subsequence pairs whose support meets a preset condition.
- the data vectorization module 730 includes:
- a sequence conversion unit for converting the RNA sequence and the protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively, and the k-mer subsequences include M RNA k-mer subsequences and N a protein k-mer subsequence;
- the first vector quantization unit is used to vectorize each RNA k-mer subsequence to obtain M RNA k-mer vectors;
- the first splicing unit is used to splice the M RNA k-mer vectors to obtain the RNA sequence representation vector;
- the second vectorization unit is used to vectorize each protein k-mer sequence to obtain N protein k-mer vectors;
- the second splicing unit is used to splice the N protein k-mer vectors to obtain the protein sequence representation vector.
- the interaction prediction module 740 includes:
- a first prediction unit configured to input the sequence features of the RNA-protein pair to be predicted into at least one first interaction prediction model to obtain at least one first interaction prediction value
- the second prediction unit is configured to input the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted into at least one second interaction prediction model to obtain at least one second interaction prediction value;
- the interaction prediction module 740 includes:
- a third prediction unit configured to input the sequence features of the RNA-protein pair to be predicted into at least one traditional machine learning model to obtain at least one first interaction prediction value
- the fourth prediction unit is configured to input the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted into at least one deep learning model to obtain at least one second interaction prediction value.
- each deep learning model includes at least two sub-deep learning models; the fourth prediction unit includes:
- the first feature generation unit is used to input the RNA sequence representation vector in the RNA-protein pair to be predicted into the first sub-deep learning model to obtain the first sequence feature;
- the second feature generation unit is used to input the protein sequence representation vector in the RNA-protein pair to be predicted into the second sub-deep learning model to obtain the second sequence feature;
- a feature fusion unit configured to fuse the first sequence features and the sequence second features, and obtain the second interaction prediction value according to the fused features.
- the traditional machine learning model includes at least one of a logistic regression model, a support vector machine model, and a decision tree model
- the deep learning model includes a convolutional neural network model and a recurrent neural network model. at least one of .
- the interaction determining module 750 includes:
- a weighted calculation unit configured to perform a weighted sum calculation on the plurality of interaction prediction values
- an interaction determining unit configured to determine the interaction between the RNA and the protein according to the calculation result.
- the interaction determination unit includes:
- the first interaction determination subunit is used to determine that there is an interaction between the RNA and the protein if the calculation result is greater than a preset interaction prediction value threshold;
- the second interaction determination subunit is configured to determine that there is no interaction between the RNA and the protein if the calculation result is less than or equal to a preset interaction prediction value threshold.
- the RNA-protein interaction prediction device 700 also includes:
- a joint training module configured to perform joint training on the multiple interaction prediction models.
- the joint training module includes:
- a training data acquisition unit configured to acquire a training data set, which includes positive RNA-protein pairs and negative example RNA-protein pairs in the training data set;
- the first predicted value output unit is used to use the multiple interaction prediction models to obtain multiple interaction prediction values for each RNA-protein pair in the training data set;
- the second predicted value output unit is used to obtain the joint predicted value of each RNA-protein pair in the training data set according to the plurality of interaction predicted values;
- a loss value calculation unit configured to use a loss function to calculate the joint prediction value and label value of each RNA-protein pair in the training data set to obtain a corresponding loss value
- a model parameter adjustment unit configured to adjust model parameters of the plurality of interaction prediction models according to the loss value.
- the second predicted value output unit includes:
- the second predictive value output subunit is configured to perform weighted summation of multiple interaction predictive values of each RNA-protein pair in the training data set to obtain a joint predictive value of each RNA-protein pair.
- the second predicted value output subunit is configured to:
- the joint prediction value y out of each RNA-protein pair in the training data set is calculated.
- y 1 is the output value of the traditional machine learning model
- y 2 is the output value of the convolutional neural network model
- y 3 is the output value of the recurrent neural network model
- ⁇ , ⁇ , ⁇ are the traditional machine learning model, convolution Weight parameters for neural network models and recurrent neural network models.
- the model parameter adjustment unit is configured to iteratively update the model parameters of the plurality of interaction prediction models based on the loss value, and complete the adjustment of all training the model parameters of the plurality of interaction prediction models, so as to use the trained plurality of interaction prediction models to predict the interaction of the RNA-protein pair to be predicted.
- the RNA-protein interaction prediction device 700 also includes:
- the data output module is used for outputting the prediction result of the interaction between the RNA and the protein.
- Each module in the above-mentioned device can be a general-purpose processor, including: a central processing unit, a network processor, etc.; it can also be a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components. Each module may also be implemented by software, firmware, and other forms. Each processor in the above device may be an independent processor, or may be integrated together.
- Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-mentioned method in this specification is stored.
- various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code.
- the program product When the program product is run on the electronic device, the program code is used to make the electronic device execute the above-mentioned functions of this specification. Steps according to various exemplary embodiments of the present disclosure described in the "Exemplary Methods" section.
- the program product may take the form of a portable compact disc read-only memory (CD-ROM) and include program code, and may run on an electronic device, such as a personal computer.
- CD-ROM portable compact disc read-only memory
- the program product of the present disclosure is not limited thereto.
- a readable storage medium may be any tangible medium containing or storing a program, and the program may be used by or in combination with an instruction execution system, apparatus or device.
- a program product may take the form of any combination of one or more readable media.
- the readable medium may be a readable signal medium or a readable storage medium.
- the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
- a computer readable signal medium may include a data signal carrying readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- a readable signal medium may also be any readable medium other than a readable storage medium that can transmit, propagate, or transport a program for use by or in conjunction with an instruction execution system, apparatus, or device.
- Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as "C" or similar programming language.
- the program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server to execute.
- the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., using an Internet service provider). business to connect via the Internet).
- LAN local area network
- WAN wide area network
- Internet service provider e.g., a wide area network
- Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.
- An electronic device 800 according to such an exemplary embodiment of the present disclosure is described below with reference to FIG. 8 .
- the electronic device 800 shown in FIG. 8 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
- electronic device 800 may take the form of a general-purpose computing device.
- Components of the electronic device 800 may include, but are not limited to: at least one processing unit 810 , at least one storage unit 820 , a bus 830 connecting different system components (including the storage unit 820 and the processing unit 810 ), and a display unit 840 .
- the storage unit 820 stores program codes, which can be executed by the processing unit 810, so that the processing unit 810 executes the steps described in the above "Exemplary Methods" section of this specification according to various exemplary embodiments of the present disclosure.
- the processing unit 810 may execute any one or more method steps in FIG. 2 to FIG. 6 .
- the storage unit 820 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 821 and/or a cache storage unit 822 , and may further include a read-only storage unit (ROM) 823 .
- RAM random access storage unit
- ROM read-only storage unit
- Storage unit 820 may also include a program/utility tool 824 having a set (at least one) of program modules 825, such program modules 825 including but not limited to: an operating system, one or more application programs, other program modules, and program data, Implementations of networked environments may be included in each or some combination of these examples.
- program modules 825 including but not limited to: an operating system, one or more application programs, other program modules, and program data, Implementations of networked environments may be included in each or some combination of these examples.
- Bus 830 may represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local area using any of a variety of bus structures. bus.
- the electronic device 800 can also communicate with one or more external devices 900 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable the user to interact with the electronic device 800, and/or communicate with Any device (eg, router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 850 .
- the electronic device 800 can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN) and/or a public network such as the Internet) through the network adapter 860 . As shown, the network adapter 860 communicates with other modules of the electronic device 800 through the bus 830 .
- other hardware and/or software modules may be used in conjunction with electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
- the RNA-protein interaction prediction method described in this disclosure can be performed by the processing unit 810 of the electronic device.
- the RNA-protein pair to be predicted/RNA sequence to be predicted/protein sequence to be predicted, raw data sets, and training data sets used to train each interaction prediction model can be input through the input interface 850.
- the RNA-protein pair to be predicted, the original data set, and the training data set used to train each interaction prediction model are input through the user interface of the electronic device.
- the prediction result of the to-be-predicted RNA-protein interaction can be output to the external device 900 through the output interface 850 for viewing by the user.
- the technical solutions according to the embodiments of the present disclosure can be embodied in the form of software products, and the software products can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiment of the present disclosure.
- a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Medicinal Chemistry (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Pharmacology & Pharmacy (AREA)
- Crystallography & Structural Chemistry (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims (27)
- 一种RNA-蛋白质相互作用预测方法,其特征在于,包括:获取待预测的RNA-蛋白质对;对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
- 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征,包括:获取原始序列特征集;根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
- 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列;在所述原始序列特征集中查找每种k-mer子序列,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
- 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
- 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;在所述原始序列特征集中查找每种k-mer子序列,得到第一序列特征;将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,得到第二序列特征;由所述第一序列特征和第二序列特征组成所述待预测的RNA-蛋白质对的序列特征。
- 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述获取原始序列特征集,包括:获取原始数据集;对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集。
- 根据权利要求6所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,包括:对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列;统计每种k-mer子序列在所述原始数据集中的出现频率,并根据所述出现频率计算所述每种k-mer子序列的方差;根据所述每种k-mer序列的方差大小确定所述原始序列特征集。
- 根据权利要求7所述的RNA-蛋白质相互作用预测方法,其特征在于,所述统计每种k-mer子序列在所述原始数据集中的出现频率,并根据所述出现频率计算所述每种k-mer子序列的方差,包括:统计所述每种k-mer子序列在所述原始数据集中的出现频次;根据所述出现频次计算得到所述每种k-mer子序列在所述原始数据集中的出现频率;遍历所述原始数据集,对所述每种k-mer子序列是否出现在所述每个RNA-蛋白质对中进行标记;根据所述每种k-mer子序列在所述原始数据集中的出现频率和在每个RNA-蛋白质对中的标记值计算所述每种k-mer子序列的方差。
- 根据权利要求7所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述每种k-mer序列的方差大小确定所述原始序列特征集,包括:根据所述每种k-mer子序列的方差大小确定满足预设条件的k-mer子序列,并由所述满足预设条件的k-mer子序列组成所述原始序列特征集。
- 根据权利要求6所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,还包括:将所述每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,由所述k-mer子序列组成第一候选项集,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;统计所述第一候选项集中每种k-mer子序列在所述原始数据集中的出现频率,由满足预设出现频率阈值的k-mer子序列组成频繁项集;将所述频繁项集中的RNA k-mer子序列和蛋白质k-mer子序列进行交叉组合,由组合得到的k-mer子序列对组成第二候选项集;统计所述第二候选项集中每种k-mer子序列对在所述原始数据集中的出现频率,得到所述每种k-mer子序列对的支持度;由所述支持度满足预设条件的k-mer子序列对组成所述原始序列特征集。
- 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,包括:将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括M个RNA k-mer子序列和N个蛋白质k-mer子序列;将每个RNA k-mer子序列向量化,得到M个RNA k-mer向量;拼接所述M个RNA k-mer向量得到所述RNA序列表示向量;将每个蛋白质k-mer序列向量化,得到N个蛋白质k-mer向量;拼接所述N个蛋白质k-mer向量得到所述蛋白质序列表示向量。
- 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别 得到所述待预测的RNA-蛋白质对的多个相互作用预测值,包括:将所述待预测的RNA-蛋白质对的序列特征输入到至少一个第一相互作用预测模型中,得到至少一个第一相互作用预测值;将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个第二相互作用预测模型中,得到至少一个第二相互作用预测值。
- 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值,包括:将所述待预测的RNA-蛋白质对的序列特征输入到至少一个传统机器学习模型中,得到至少一个第一相互作用预测值;将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个深度学习模型中,得到至少一个第二相互作用预测值。
- 根据权利要求14所述的RNA-蛋白质相互作用预测方法,其特征在于,所述每个深度学习模型至少包括两个子深度学习模型;所述将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到至少一个深度学习模型中,得到至少一个第二相互作用预测值,包括:将所述待预测的RNA-蛋白质对中的RNA序列表示向量输入第一子深度学习模型中,得到第一序列特征;将所述待预测的RNA-蛋白质对中的蛋白质序列表示向量输入第二子深度学习模型中,得到第二序列特征;融合所述第一序列特征和所述第二序列特征,根据融合后的特征得到所述第二相互作用预测值。
- 根据权利要求14所述的RNA-蛋白质相互作用预测方法,其特征在于,所述传统机器学习模型包括逻辑回归模型、支持向量机模型和决策树模型中的至少一种,深度学习模型包括卷积神经网络模型和循环神经网络模型中的至少一种。
- 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用,包括:对所述多个相互作用预测值进行加权求和计算;根据计算结果确定所述RNA和蛋白质之间的相互作用。
- 根据权利要求17所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据计算结果确定所述RNA和蛋白质之间的相互作用,包括:若所述计算结果大于预设的相互作用预测值阈值,确定所述RNA和蛋白质之间有相互作用;若所述计算结果小于或等于预设的相互作用预测值阈值,确定所述RNA和蛋白质之间没有相互作用。
- 根据权利要求1-18任一项所述的RNA-蛋白质相互作用预测方法,其特征在于,所述方法还包括:对所述多个相互作用预测模型进行联合训练。
- 根据权利要求19所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述多个相互作用预测模型进行联合训练,包括:获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对;使用所述多个相互作用预测模型分别得到所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值;根据所述多个相互作用预测值得到所述训练数据集中每个RNA-蛋白质对的联合预测值;利用损失函数对所述训练数据集中每个RNA-蛋白质对的联合预测值和标签值进行计算,得到对应的损失值;根据所述损失值调整所述多个相互作用预测模型的模型参数。
- 根据权利要求20所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述多个相互作用预测值得到所述训练数据集中每个RNA-蛋白质对的联合预测值,包括:对所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值进行加权求和得到每个RNA-蛋白质对的联合预测值。
- 根据权利要求21所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述训练数据集中每个RNA-蛋白质对的多个相互作用预测值进行加权求和得到每个RNA-蛋白质对的联合预测值,包括:根据:y out=α*y 1+β*y 2+γ*y 3计算得到所述训练数据集中每个RNA-蛋白质对的联合预测值y out。其中,y 1为传统机器学习模型的输出值,y 2为卷积神经网络模型的输出值,y 3为循环神经网络模型的输出值,α、β、γ分别为传统机器学习模型、卷积神经网络模型和循环神经网络模型的权重参数。
- 根据权利要求20所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述损失值调整所述多个相互作用预测模型的模型参数,包括:基于所述损失值对所述多个相互作用预测模型的模型参数进行迭代更新, 当满足迭代终止条件时,完成对所述多个相互作用预测模型的模型参数的训练,以使用训练好的所述多个相互作用预测模型对所述待预测的RNA-蛋白质对的相互作用进行预测。
- 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述方法还包括:输出所述RNA和蛋白质之间的相互作用的预测结果。
- 一种RNA-蛋白质相互作用预测装置,其特征在于,包括:数据获取模块,用于获取待预测的RNA-蛋白质对;特征提取模块,用于对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;数据向量化模块,用于向量化所述待预测的RNA-蛋白质对,得到所述待预测RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;相互作用预测模块,用于基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用多个相互作用预测模型分别得到所述待预测的RNA-蛋白质对的多个相互作用预测值;相互作用确定模块,用于根据所述多个相互作用预测值确定所述RNA和蛋白质之间的相互作用。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-24任一项所述方法。
- 一种电子设备,其特征在于,包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1-24任一项所述的方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2021/121089 WO2023044927A1 (zh) | 2021-09-27 | 2021-09-27 | Rna-蛋白质相互作用预测方法、装置、介质及电子设备 |
| CN202180002692.1A CN116490926A (zh) | 2021-09-27 | 2021-09-27 | Rna-蛋白质相互作用预测方法、装置、介质及电子设备 |
| US17/915,391 US20250054571A1 (en) | 2021-09-27 | 2021-09-27 | Method and apparatus for predicting rna-protein interaction, medium and electronic device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2021/121089 WO2023044927A1 (zh) | 2021-09-27 | 2021-09-27 | Rna-蛋白质相互作用预测方法、装置、介质及电子设备 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023044927A1 true WO2023044927A1 (zh) | 2023-03-30 |
Family
ID=85719929
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/121089 Ceased WO2023044927A1 (zh) | 2021-09-27 | 2021-09-27 | Rna-蛋白质相互作用预测方法、装置、介质及电子设备 |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20250054571A1 (zh) |
| CN (1) | CN116490926A (zh) |
| WO (1) | WO2023044927A1 (zh) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119049566A (zh) * | 2024-07-02 | 2024-11-29 | 浙江大学长三角智慧绿洲创新中心 | 一种基于自回归大模型的核酸序列特征挖掘方法 |
| CN119252333A (zh) * | 2023-07-03 | 2025-01-03 | 北京深势科技有限公司 | 一种平均核糖体负载预测系统的处理方法和装置 |
| WO2025066631A1 (zh) * | 2023-09-27 | 2025-04-03 | 京东方科技集团股份有限公司 | 关系预测模型的训练、相互作用关系的预测方法及装置 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100070438A1 (en) * | 2006-10-31 | 2010-03-18 | Keio University | Method for predicting interaction between protein and chemical |
| CN111192631A (zh) * | 2020-01-02 | 2020-05-22 | 中国科学院计算技术研究所 | 用于构建用于预测蛋白质-rna相互作用结合位点模型的方法和系统 |
| CN111916148A (zh) * | 2020-08-13 | 2020-11-10 | 中国计量大学 | 蛋白质相互作用的预测方法 |
| CN112420127A (zh) * | 2020-10-26 | 2021-02-26 | 大连民族大学 | 基于二级结构和多模型融合的非编码rna与蛋白质相互作用预测方法 |
| CN113313167A (zh) * | 2021-05-28 | 2021-08-27 | 湖南工业大学 | 一种基于深度学习的双神经网络结构预测lncRNA-蛋白质相互作用方法 |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230063188A1 (en) * | 2020-01-02 | 2023-03-02 | Oncocross Co., Ltd. | Method, apparatus, and computer program for predicting interaction of compound and protein |
-
2021
- 2021-09-27 WO PCT/CN2021/121089 patent/WO2023044927A1/zh not_active Ceased
- 2021-09-27 CN CN202180002692.1A patent/CN116490926A/zh active Pending
- 2021-09-27 US US17/915,391 patent/US20250054571A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100070438A1 (en) * | 2006-10-31 | 2010-03-18 | Keio University | Method for predicting interaction between protein and chemical |
| CN111192631A (zh) * | 2020-01-02 | 2020-05-22 | 中国科学院计算技术研究所 | 用于构建用于预测蛋白质-rna相互作用结合位点模型的方法和系统 |
| CN111916148A (zh) * | 2020-08-13 | 2020-11-10 | 中国计量大学 | 蛋白质相互作用的预测方法 |
| CN112420127A (zh) * | 2020-10-26 | 2021-02-26 | 大连民族大学 | 基于二级结构和多模型融合的非编码rna与蛋白质相互作用预测方法 |
| CN113313167A (zh) * | 2021-05-28 | 2021-08-27 | 湖南工业大学 | 一种基于深度学习的双神经网络结构预测lncRNA-蛋白质相互作用方法 |
Non-Patent Citations (2)
| Title |
|---|
| SHUPING CHENG, JIANJUN TAN, JINGRUI MEN: "Prediction of ncRNA-protein interactions based on machine learning methods ", BEIJING BIOMEDICAL ENGINEERING, vol. 38, no. 4, 1 August 2019 (2019-08-01), XP093053645 * |
| ZHU, MIN: "Prediction of Protein-protein Interactions Based on Ensemble Learning", JOURNAL OF SICHUAN UNIVERSITY (ENGINEERING SCIENCE EDITION), vol. 43, no. 3, 20 May 2011 (2011-05-20), XP093053655 * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119252333A (zh) * | 2023-07-03 | 2025-01-03 | 北京深势科技有限公司 | 一种平均核糖体负载预测系统的处理方法和装置 |
| WO2025007616A1 (zh) * | 2023-07-03 | 2025-01-09 | 北京深势科技有限公司 | 一种平均核糖体负载预测系统的处理方法和装置 |
| WO2025066631A1 (zh) * | 2023-09-27 | 2025-04-03 | 京东方科技集团股份有限公司 | 关系预测模型的训练、相互作用关系的预测方法及装置 |
| CN119049566A (zh) * | 2024-07-02 | 2024-11-29 | 浙江大学长三角智慧绿洲创新中心 | 一种基于自回归大模型的核酸序列特征挖掘方法 |
| CN119049566B (zh) * | 2024-07-02 | 2025-05-16 | 浙江大学长三角智慧绿洲创新中心 | 一种基于自回归大模型的核酸序列特征挖掘方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20250054571A1 (en) | 2025-02-13 |
| CN116490926A (zh) | 2023-07-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12367346B2 (en) | Natural language processing with k-NN | |
| CN113535984A (zh) | 一种基于注意力机制的知识图谱关系预测方法及装置 | |
| KR102092263B1 (ko) | 일정한 처리 시간 내에 k개의 극값을 찾는 방법 | |
| WO2023044927A1 (zh) | Rna-蛋白质相互作用预测方法、装置、介质及电子设备 | |
| CN116340839B (zh) | 基于蚁狮算法的算法选择方法及装置 | |
| CN113537304A (zh) | 一种基于双向cnn的跨模态语义聚类方法 | |
| WO2023044931A1 (zh) | Rna-蛋白质相互作用预测方法、装置、介质及电子设备 | |
| CN114417058B (zh) | 一种视频素材的筛选方法、装置、计算机设备和存储介质 | |
| Cheng et al. | TreeNet: Learning Sentence Representations with Unconstrained Tree Structure. | |
| Meng et al. | Classifier ensemble selection based on affinity propagation clustering | |
| Ozmen et al. | Multi-relation message passing for multi-label text classification | |
| CN116529828A (zh) | Rna-蛋白质相互作用预测方法、装置、介质及电子设备 | |
| US20240265993A1 (en) | Method for training vector model and generating negative sample | |
| CN116127201A (zh) | 一种基于进化多任务的大规模用户推荐方法 | |
| Liang et al. | RETRACTED ARTICLE: Incremental deep forest for multi-label data streams learning | |
| KR102608683B1 (ko) | Knn을 이용한 자연 언어 처리 | |
| WO2023050204A1 (zh) | Rna-蛋白质相互作用预测方法、装置、介质及电子设备 | |
| Xiu et al. | Prediction method for lysine acetylation sites based on LSTM network | |
| Chauleva et al. | Secondary structure prediction of RNA using convolutional neural networks | |
| Liu et al. | A Drug Sales Forecasting Method Based on Multiple Models and Parameter Optimization | |
| CN118964965B (zh) | 基于人工智能的数字化教学资源库优化方法及系统 | |
| CN119557730B (zh) | 一种面向部分标签的多模态学习方法、系统及存储介质 | |
| US20250348681A1 (en) | Natural language processing with knn | |
| Liu | Extracting Rules from Trained Machine Learning Models with Applications in Bioinformatics | |
| CN117688242A (zh) | 一种语义信息与意图增强的会话推荐方法及系统 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 202180002692.1 Country of ref document: CN |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21958045 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12/07/2024) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21958045 Country of ref document: EP Kind code of ref document: A1 |