CN113571125A

CN113571125A - Drug target interaction prediction method based on multilayer network and graph coding

Info

Publication number: CN113571125A
Application number: CN202110865457.9A
Authority: CN
Inventors: 刘闯; 王逸伟; 詹秀秀; 张子柯
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2021-10-29

Abstract

The invention discloses a drug target interaction prediction method based on multi-layer network and graph coding. The method of the invention includes a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module, and a result evaluation module. The data preprocessing module constructs drug and protein networks, and processes heterogeneous graphs. The feature learning module includes self-supervised learning of structural graph encoders and vector encoding of graphs, as well as isomorphic vector processing, which expresses the topological information of graphs in the form of vectors. The model algorithm design module includes constructing the cross-validation set and predicting the model design. The result evaluation module uses the ROC curve based on the confusion matrix and the PR curve based on the precision rate and recall rate sequence to verify the prediction effect of the model. The method of the invention studies the drug and the target from the perspective of data mining and graph, and predicts the interaction between the two through the generated graph structure information and the subsequent tree model.

Description

Drug target interaction prediction method based on multilayer network and graph coding

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a medicine target interaction prediction method based on a multilayer network and graph coding.

Background

With the rapid development of machine learning, the development of biological detection technologies such as third-generation gene sequencing and the like, and the arrival of a big data era in the field due to the rapid increase of biological data volume, more and more researchers and companies aim at the field of AI auxiliary drug development. The computer algorithm is used for assisting in screening the target targets, and the most intuitive advantage is that the computer is used for screening candidate drugs and narrowing the candidate range, so that the period of new drug discovery is greatly shortened, and the research consumables of new drug discovery are reduced. Practical application data indicates that AI technology can substantially reduce drug development costs by about 35%. By analyzing the net income trend of the international top medicine enterprises in recent years, the net income of most medicine enterprises is increased to different degrees after the AI auxiliary medicine is introduced for research and development. The AI technology can also perform multi-specific target analysis on the drug to predict multiple targets of the drug, thereby revealing the complex action mechanism of some diseases. In addition, the AI technology can also improve the accuracy and safety of the prediction of the drug, and search the side effect mechanism of the drug. Therefore, the AI technology can greatly simplify the process of research and development of new drugs on the whole, save research and development expenses, and assist drug enterprises in quickly researching and developing new drugs.

Disclosure of Invention

The invention aims to provide a method for predicting the interaction of drug targets based on a multilayer network and graph coding, which can eliminate the randomness of clinical experiments, narrow the screening range and accelerate the test period.

The invention constructs nine drug related networks (drug interaction network, drug disease related network, drug side effect related network, chemical similarity network of drug, therapeutic similarity network of drug, action target sequence similarity network of drug, biological process similarity network of drug, molecular function similarity network of drug, action cell component similarity network of drug), six target related networks (target interaction network, target disease related network, target sequence similarity network, target biological process similarity network, cell component similarity network where target is located, target molecular function similarity network) and drug target interaction network used as label. And respectively training corresponding structural self-encoders by using the networks independently, encoding the nodes into vectors by using the trained self-encoders, and finally splicing the encoded vectors of the nodes in different networks to form final characteristic vectors. And (3) sending the drug target pairs to be predicted into a trained lifting tree model (the model is obtained by linearly adding a series of decision trees constructed based on a training set) to obtain a final evaluation score.

The method comprises a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module and a result evaluation module.

(1) The data acquisition module comprises:

(1-1) for drugs, collecting drug-drug interaction relationship data, drug-disease relationship data, drug-side effect relationship data, and six different types of drug-pair similarity relationship data, including: chemical fingerprint data of the medicine, therapeutic data of the medicine, peptide chain data of an action target of the medicine, biological process data of the medicine, molecular function data of the medicine and action cell component data of the medicine;

(1-2) for the target, namely protein, collecting the data of the interaction relation between the target and the target, the data of the relation between the target and the disease and the data of the similarity relation between four different types of targets, comprising: peptide chain data of the target spot, biological process data of the target spot, cell component data of the target spot and target spot molecule function data;

(1-3) collecting the interaction relation data of the medicine and the target.

(2) The data preprocessing module comprises a medicine and target related network and a multilayer network;

(2-1) the construction of the drug and target related network comprises:

A. for single-class object interaction relation data, constructing homogeneous interaction network, including drug interaction network G_1DTarget interaction network G_1T；

B. For objects of different classesInteraction relationship data, constructing heterogeneous interaction networks, including drug disease-related network G_{D_DI}Network G relating to side effects of drugs_{D_SE}Target disease-related network G_{T_DI}；

C. Collecting drug information of different dimensions, and constructing drug similarity network including chemical similarity network G of drug_2DTherapeutic similarity network of drugs G_3DAnd the action target point sequence similarity network G of the medicine_4DBiological process similarity network G of drugs_5DMolecular functional similarity network G of drugs_6DNetwork of similarity of active cellular components of drugs G_7D；

D. Collecting target point information of different dimensions, and constructing a target point similarity network including a target point sequence similarity network G_2TTarget biological process similarity network G_3TSimilarity network G of cellular components of target site_4TTarget molecule functional similarity network G_5T；

E. Construction of drug target interaction network G_{D_T}。

(2-2) the method for generating the multilayer network comprises the steps of generating a medicine multilayer network and generating a target multilayer network, and comprises the following specific steps:

(2-2-1) first, the drug disease-related network G_{D_DI}Disease similarity network G decomposed and converted into drug_8D＝(V_8D,E_8D) In which V is_8D、E_8DRespectively representing a drug node set in the network and an edge weight set of disease similarity between two drugs; margin for disease similarity of drugs

x_{D_M}And y_{D_M}Two drugs are shown in G_{D_DI}The corresponding row vector in the adjacency matrix of (a) represents the vector modulo;

network G relating drug side effects_{D_SE}Network G of similarity of side effects of drug decomposition and conversion_9D＝(V_9D,E_9D) In which V is_9D、E_9DAre respectively provided withA set of drug nodes in the network, a set of edge weights representing side effect similarities between two drugs; margin for similarity of side effects of drugs

x_{D_SE}And y_{D_SE}Two drugs are shown in G_{D_SE}The corresponding row vector in the adjacency matrix of (a);

target disease-related network G_{T_DI}Decomposing and converting into target disease similarity network G_6T＝(V_6T,E_6T) Wherein V is_6T、E_6TRespectively representing a target point node set in the network and an edge weight set of disease similarity between two target points; margin for disease similarity of target points

x_{T_DI}And y_{T_DI}Indicates that two target points are at G_{T_DI}Corresponding row vectors in the adjoining matrix of (a);

(2-2-2) then combining the drug-related networks into a drug multilayer network G_D＝{G_iD＝(V_iD,E_iD) I is the drug network number, i belongs to [1,9 ]](ii) a Combining target related networks into a target multilayer network G_T＝{G_jT＝(V_jT,E_jT) J is the network number of the target point, j belongs to [1,6 ]]。

(3) The feature learning module comprises a training structural self-encoder, encoding output and similar feature vector processing;

(3-1) training the structural autoencoder: drug multilayer network G_DWith target multilayer network G_TCorrespondingly training a structural self-encoder for each layer;

(3-2) encoding output: respectively coding the corresponding network layers by using the coding ends of the trained structural self-coder to obtain multilayer vectors of all the medicines and the target spots;

(3-3) processing the similar feature vectors: splicing the multiple layers of vectors of a drug to obtain the final characteristic vector representation of the drug; and splicing the multi-layer vectors of a target point to obtain the final characteristic vector representation of the target point.

(4) The model algorithm design module comprises a training sample construction module, a training and evaluation model and a medicine target point interaction prediction module;

(4-1) constructing a training sample: constructing a training sample by adopting a PairWise model, randomly dividing data into M parts, and performing M-fold cross validation, namely selecting one part as a validation set and the rest as a training set each time, adjusting model parameters through the overall performance of the cross validation, wherein M is a positive integer greater than 3;

(4-2) training and evaluating the model: building a lifting tree by adopting a lightweight gradient lifting decision tree and taking the decision tree as a weak learner, namely building the decision tree T (x, theta) by adopting iteration_l) Wherein x and θ_lRespectively inputting a characteristic vector and a learnable parameter of the first decision tree;

(4-3) predicting drug target interaction: and according to the optimal prediction model obtained by the result evaluation module, calculating the interaction probability of all the drug target pairs, and screening out the drug target pairs with high possibility as candidate drug target pairs capable of interacting as prediction results.

(5) The result evaluation module verifies the prediction effect of the model by adopting an ROC curve and a PR curve; the method comprises the following steps:

(5-1) plotting ROC curves: defining the false positive rate FPR as a horizontal axis and the true positive rate TPR as a vertical axis, wherein the larger the area AUROC value covered by the ROC curve is, the better the prediction effect of the model is represented;

real positive rate TPR of ROC curve_αAnd false positive rate FPR_αThe calculation by the confusion matrix is as follows:

the drug target pair is a positive sample in the presence of interaction, and is a negative sample in the absence of interaction; TP_αIndicates the number of positive samples, FP, predicted from the positive samples in the test set_αRepresenting negative examples in a test setMeasured as the number of positive samples, FN_αIndicates the number of positive samples predicted as negative samples, TN_αRepresenting the number of negative samples predicted in the test set as negative samples; α represents a prediction confidence;

(5-2) drawing a PR curve: precision at different prediction confidence alpha_αRecall with recall recalling_αComposition of precision-recall sequence:

drawing a precision-recall curve, namely a PR curve, by taking the horizontal axis as recall rate and the vertical axis as precision rate, wherein AUPR (area under PR) can reflect the classification effect of the classifier on the whole, and the larger AUPR value of the area under the PR curve is, the better the prediction effect of the model is;

(5-3) evaluation of model: and (4) according to the prediction result of the step (4-3), utilizing the drawn ROC curve and PR curve, calculating AUROC and AUPR, and searching for a model parameter under the optimal prediction result.

The method researches the interaction of the drug target pairs from the aspects of data mining and multilayer networks, abstracts different types of data into the same data structure by constructing the network, and realizes the drug target prediction by combining the methods of the decomposition of heterogeneous networks, the automatic learning of network topological structures by structural self-encoders, tree-based classifiers and the like. Therefore, the method can effectively analyze the drug target data and predict the interaction between the drug target data and the drug target data, thereby providing scientific guidance for the research and development of new drugs, improving the research and development efficiency of the new drugs and promoting the development of medical independent innovation to a certain extent.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings.

The existing 732 drug-related data, 1915 targets (proteins) and corresponding 12904 side effects and 440 disease-related data comprise data of interactions between drug pairs, between drug diseases, between drug side effects, between targets and targets, between targets and diseases, MACCS fingerprint data of drug chemical formula, GO annotation of drug and target, protein sequence data of target, and half-inhibitory concentration data between drug and target.

As shown in fig. 1, a method for predicting drug target interaction based on multilayer network and graph coding comprises a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module, and a result evaluation module, and specifically comprises the following steps:

(1) a data acquisition module comprising:

(1-3) collecting interaction relation data of the medicine and the target;

the above data is downloaded through a public website.

(2) The data preprocessing module comprises a module for constructing a medicine and target related network and generating a multilayer network, and provides a data basis for medicine target prediction, and specifically comprises the following steps:

(2-1) constructing a medicine and target related network, comprising:

(I) for the interaction relation data of the drug and the drug, constructing a drug interaction network G_1D＝(V_1D,E_1D)，V_1DRepresenting a set of drug nodes in the network, E_1DRepresents the netThe edge set of the interaction between two drugs in the collateral exists;

constructing a target interaction network G for the interaction relation data of the target and the target_1T＝(V_1T,E_1T)，V_1TRepresenting a set of target nodes in the network, E_1TRepresenting a set of edges that have an interaction between two targets in the network;

(II) for the relation data of the medicine and the disease, constructing a medicine disease related network

Wherein

E_{D_DI}Respectively representing a medicine node set, a disease node set and an edge set of the relation between the medicine and the disease in the network;

for the relation data of the medicine and the side effect, a medicine side effect related network is constructed

Wherein

E_{D_SE}Respectively representing a drug node set, a side effect node set and an edge set of the relationship between the drug and the side effect in the network;

for target and disease relation data, constructing target disease related network

Wherein

E_{T_DI}Respectively representing a target point node set, a disease node set and an edge set of the relation between a target point and a disease in the network;

(III) for chemical fingerprint data of the medicine, constructing a chemical similarity network G of the medicine_2D＝(V_2D,E_2D) In which V is_2D、E_2DRespectively representing a drug node set and an edge weight set of chemical similarity between two drugs in the network; margin of chemical similarity

Wherein a is₁And b₁Is the bit number of MACCS fingerprints of two drugs respectively, c₁The number of the same bit of the two medicines;

for therapeutic data of a drug, a therapeutic similarity network G of the drug is constructed_3D＝(V_3D,E_3D) In which V is_3D、E_3DA set of drug nodes in the network, a set of side weights representing therapeutic similarity between two drugs, respectively; margin of therapeutic similarity

Wherein a is₂And b₂Coding for the respective ATC of the two drugs, c₂The number of digits for the same ATC code for both drugs;

constructing a medicine action target point sequence similarity network G for the peptide chain data of the medicine action target point_4D＝(V_4D,E_4D) In which V is_4D、E_4DRespectively representing a drug node set in the network and an edge weight set of action target point similarity between two drugs; margin for similarity of drug action targets

Wherein a and b represent the respective targets of the two drugs, T_{T_T}(a, b) shows the sequence similarity of respective targets of the two drugs, mean (-) shows the mean;

for biological process data of the drug, a biological process similarity network G of the drug is constructed_5D＝(V_5D,E_5D) In which V is_5D、E_5DRespectively representing a drug node set in the network and an edge weight set of the similarity of biological processes between two drugs; margin for similarity of pharmacogenomic processes

T_{T_P}(a, b) representing the similarity of biological processes at the respective targets of the two drugs;

for the molecular function data of the medicine, a molecular function similarity network G of the medicine is constructed_6D＝(V_6D,E_6D) In which V is_6D、 E_6DRespectively representing a drug node set in the network and an edge weight set of molecular function similarity between two drugs; the boundary of functional similarity of drug molecules

T_{T_M}(a, b) shows the molecular functional similarity of the respective targets of the two drugs;

for the acting cell component data of the medicine, constructing an acting cell component similarity network G of the medicine_7D＝(V_7D,E_7D) In which V is_7D、E_7DRespectively representing a drug node set in the network and an edge weight set of similarity of acting cell components between two drugs; margin for similarity of cell components for drug action

T_{T_C}(a, b) shows the similarity of the acting cell components of the respective targets of the two drugs;

(IV) constructing a target sequence similarity network G for the peptide chain data of the target_2T＝(V_2T,E_2T) In which V is_2T、E_2TRespectively representing a target point node set and an edge weight set of sequence similarity between two target points in the network; sequence similarity margin

Wherein a is₃And b₃The number of peptide chain sequence positions of two targets respectively, c₃The number of bits of the peptide chain sequence which is the same with the two targets;

for the biological process data of the target, a similarity network G of the biological process of the target is constructed_3T＝(V_3T,E_3T) In which V is_3T、E_3TRespectively representing a target point node set in the network and an edge weight set of the similarity of the biological processes between two target points; edge weights T of similarity of target biological processes_{T_P}(a, b) semantic annotation of GO in the biological process of two targets;

constructing a cell component similarity network G for the cell component data of the target point_4T＝(V_4T,E_4T) In which V is_4T、E_4TRespectively representing a target point node set in the network and an edge weight set of the similarity of the cell components between the two target points; margin T of similarity of cellular components at target site_{T_C}(a, b) semantic annotation of GO of cell components of two target points;

constructing a target molecule functional similarity network G for the target molecule functional data_5T＝(V_5T,E_5T) In which V is_5T、E_5TRespectively representing a target point node set in the network and an edge weight set of molecular function similarity between two target points; edge weight T of target molecule function similarity_{T_M}(a, b) semantic annotation of GO with molecular functions of two targets;

(V) for the interaction relation data of the drug and the target, constructing a drug target interaction network

Wherein

E_{D_T}Respectively representing a drug node set, a target point node set and an edge set of the relationship between the drug and the target point in the network.

(2-2) generating a multilayer network, including generating a drug multilayer network and generating a target multilayer network:

(2-2-1) network G relating drug diseases to drug diseases_{D_DI}Disease similarity network G decomposed and converted into drug_8D＝(V_8D,E_8D) In which V is_8D、E_8DRespectively representing a drug node set in the network and an edge weight set of disease similarity between two drugs; medicineBy the similarity of diseases

network G relating drug side effects_{D_SE}Network G of similarity of side effects of drug decomposition and conversion_9D＝(V_9D,E_9D) In which V is_9D、E_9DRespectively representing a drug node set in the network and an edge weight set of side effect similarity between two drugs; margin for similarity of side effects of drugs

x_{D_SE}And y_{D_SE}Two drugs are shown in G_{D_SE}The corresponding row vector in the adjacent matrix of (2);

target disease-related network G_{T_DI}Decomposing and converting into target disease similarity network G_6T＝(V_6T,E_6T) In which V is_6T、E_6TRespectively representing a target point node set in the network and an edge weight set of disease similarity between two target points; margin for disease similarity of targets

x_{T_DI}And y_{T_DI}Indicates that two target points are at G_{T_DI}The corresponding row vector in the adjacency matrix of (2);

(2-2-2) combining a drug interaction network, a drug disease similarity network, a drug side effect similarity network, a drug chemical similarity network, a drug therapeutic similarity network, a drug action target sequence similarity network, a drug biological process similarity network, a drug molecular function similarity network and a drug action cell component similarity network into a drug multilayer network G_D＝{G_iD＝(V_iD,E_iD) I is the drug network number, i belongs to [1,9 ]]；

Phase of target pointThe interaction network, the disease similarity network of the target, the sequence similarity network of the target, the similarity network of the biological process of the target, the similarity network of the cellular components of the target and the functional similarity network of the target molecule are combined into a target multilayer network G_T＝{G_jT＝(V_jT,E_jT) J is the network number of the target point, j belongs to [1,6 ]]。

(3) A feature learning module:

in the study of machine learning related problems, data and features determine the upper limit of the prediction result, and models and algorithms only approximate the upper limit. The feature coding module of the invention solves the problem of feature selection of the first half sentence, namely better learning gene features of a model algorithm, and achieves the most accurate prediction result. The module is based on a drug multilayer network G_DWith target multilayer network G_TThe method adopts the structural self-encoder to automatically encode the network structure, thereby ensuring the integrity of feature extraction.

(3-1) training the structural autoencoder: drug multilayer network G_DWith target multilayer network G_TEach layer of (a) correspondingly trains a structural self-encoder, and the training process is as follows:

a. using the adjacent matrix corresponding to the single-layer network as the input of the encoder;

b. after encoding, the output of the encoder is obtained and is used as the input of the decoder;

c. decoding to obtain the output of a decoder, and calculating a loss function by using the adjacency matrix, the output of the encoder and the output of the decoder;

d. calculating the gradient of each parameter of the encoder and the decoder by using a loss function, updating the parameters, wherein the updating step length is a multiple of the negative gradient;

e. repeating steps b through d until the loss function converges.

Said loss function L_mThe calculation includes two parts:

first order loss of similarity

N is the number of nodes, z_pAnd z_gRepresenting the coded output vectors, T, of the coder for node p and node g, respectively_pgRepresenting the weight of the connected edge; if it is an interaction network, T_pgIt is only possible to take 0 and 1, representing the case of no edge and an edge, respectively; if it is a similarity network, T_pgAny value between 0 and 1, inclusive, may be used. The loss function is defined in order to make the feature vectors encoded by drugs or targets with high similarity as similar as possible.

Second order loss of similarity

b_nAnd

representing the encoder input vector and the decoder output vector, respectively, of node n. The purpose of defining the loss function is to enable the decoder to reconstruct the original input vector as much as possible from the encoded vector, so that the encoded vector contains as much information as possible of the original vector.

Total loss function L_m＝L_2nd+λL_1stλ is a penalty term, 0 < λ < 1.

(3-2) encoding output: and respectively coding the corresponding network layers by using the coding ends of the trained structural self-coder to obtain multilayer vectors of all the medicines and the target points.

(3-3) processing the same-class feature vectors:

splicing the multiple layers of vectors of a drug to obtain the final characteristic vector representation of the drug;

and splicing the multi-layer vectors of a target point to obtain the final characteristic vector representation of the target point.

(4) A model algorithm design module comprising:

(4-1) constructing a training sample: the drug target pairs include verified drug target pairs and unverified drug target pairs, including undiscovered but objectively interacting drug target pairs. The invention finds out the drug target pairs which have objective interaction but are not discovered from the unverified drug target pairs. Therefore, it can be assumed that the probability that an unverified drug target pair interacts is certainly not greater than the probability of a verified interaction drug target pair. Based on the assumption, a PairWise model is adopted to construct training samples, namely, a positive sample is extracted from a verified and interacted drug target pair, a negative sample is also extracted from an unverified drug target pair, and training samples are constructed through corresponding positive and negative samples to obtain paired positive and negative training sample sets with the same quantity; and randomly dividing the data into M parts, performing M-fold cross validation, namely selecting one part as a validation set and the rest as a training set each time, and adjusting model parameters through the overall performance of the cross validation, wherein M is a positive integer greater than 3.

(4-2) training and evaluating the model: building a lifting tree by adopting a lightweight gradient lifting decision tree and taking the decision tree as a weak learner, namely building the decision tree T (x, theta) by adopting iteration_l) Wherein x and θ_lThe method comprises the following specific processes of inputting feature vectors and learnable parameters of the first decision tree respectively:

(4-2-1) before each round of decision tree construction, screening small gradient samples by using a gradient-based unilateral sampling (GOSS) algorithm, namely reserving a small part of large gradient samples and randomly selecting a part of small gradient samples to calculate the total variance gain, so that the number of samples is reduced;

(4-2-2) before each round of construction of the decision tree, merging mutually exclusive features by using a mutually Exclusive Feature Bundling (EFB) algorithm, thereby reducing feature dimensions;

(4-2-3) constructing a fitting target for the generated first decision tree when an input feature vector x and a corresponding label y of a certain sample are input based on the screened sample: if l is 1, the fitting target is the label of the sample, wherein the label of the positive sample is 1, and the label of the negative sample is 0; when l is more than or equal to 2, the fitting target is

Wherein the lifting tree obtained after the first-1 iteration

L is a loss function, and under the binary task, a single sample (x, y) has a predicted value of

The time loss function is defined as:

(4-2-4) based on the screened samples, fitting the target to construct a binary decision tree, wherein a leaf node of the binary decision tree is split by the following steps: constructing a histogram for each screened feature according to the value range of the feature, calculating the variance gain of each division point by using the histogram, selecting the feature with the maximum variance gain and the division point as the splitting feature of the current node and the optimal division point, and dividing the data of the leaf node corresponding to the optimal division point into two batches; recursion continues until the maximum depth of the tree is reached. The variance gain of feature f based on dataset D at partition point D is expressed as:

wherein x_l、x_l,f、g_lRespectively representing the ith sample vector, the ith feature of the ith sample vector and the negative gradient thereof,

and

all features f are smaller and larger than the division point D in the dataset D, respectively.

(4-2-5) performing K rounds of iteration to generate K decision trees;

(4-2-6) deciding K decisionsAdding the trees to generate a final lightweight gradient lifting decision tree

For the input feature vector x of the sample, the decision tree output H (x) e [0,1]The probability that the input sample is a positive sample can be interpreted;

(5-1) plotting ROC curves: plotting the ROC curve requires generating a confusion matrix, which is also an index for evaluating the model results, is part of the model evaluation, and is represented in the form of a square matrix, displaying the accuracy of the prediction results in a confusion matrix, each column representing the prediction category, the total number of each column representing the number of data predicted as the category, each row representing the true attribution category of data, and the total number of each row representing the number of data instances of the category.

The ROC curve is a new classification model performance evaluation method introduced from the field of medical analysis, is suitable for the research problem of two classifications, and when the ROC curve is drawn, the false positive rate FPR is defined as a horizontal axis, the true positive rate TPR is defined as a vertical axis, the larger the area AUROC value covered by the ROC curve is, namely the closer to 1, the better the prediction effect of the model is represented.

in the context of drug target prediction, the presence of drug target pair interaction is a positive sample and the absence is a negative sample. TP_αIndicates the number of positive samples, FP, predicted from the positive samples in the test set_αIndicating the number of negative samples predicted as positive samples in the test set, FN_αDenotes the number of predicted positive samples as negative samples, TN_αRepresenting the number of negative samples predicted from the test set; α represents a prediction confidence;

(5-2) drawing a PR curve: the rendering of the PR curve requires the generation of precision-recall sequences that are represented by precision at different prediction confidence degrees alpha_αRecall with recall recalling_αThe calculation formula is as follows:

the precision rate describes the accuracy rate of correctly classifying the positive samples under the confidence degree alpha, and the recall rate describes the proportion of correctly classifying the positive samples in the total positive samples under the confidence degree alpha; the two show opposite change trends along with the change of alpha. Therefore, an accuracy-recall ratio pair sequence generated by different alpha is utilized, a horizontal axis is used as a recall ratio, a vertical axis is used as an accuracy ratio to draw a precision-recall curve, namely a PR curve, an area AUPR under the PR curve can reflect the classification effect of the classifier on the whole, and the larger the area AUPR under the PR curve is, the closer the area AUPR is to 1, the better the prediction effect of the expression model is;

Screening candidate drugs is a main means for assisting the development of new drugs by AI, wherein the computer modeling (i.e. which data structure is adopted to represent both) and prediction model selection of drugs and targets are the most critical two steps. The method adopts two different computer modeling, namely network nodes and characteristic vectors, for the medicine and the target at different stages. Two data models are described below, using drugs as examples.

The drug networks can well reflect the relationship between drugs, and the multilayer networks formed by different types of drug networks can better reflect the relationship at different angles, thereby providing a new idea for drug screening. Specifically, the drug network represents a single drug as a node, and the interaction between drugs is defined as the connecting edges between nodes. The definition of edges is different for different types of drug networks, thus expressing the relationship between drug pairs at different viewing angles. Taking the chemical similarity network of drugs as an example, the edge weight between node pairs represents the chemical structure similarity between corresponding drug pairs, and the absence of an edge represents that the similarity is 0. In the process of constructing a drug network, the edge weights are usually normalized so that the weight values range from 0 to 1.

The eigenvector is an array of real numbers, each of which represents an eigenvalue and contains specific information in the application. In the method, the medicine characteristic vector is obtained by a structural self-encoder based on medicine network encoding, and the topological information of the network is contained in the characteristic value. The autoencoder is an auto-supervised representation learning method, and can convert nodes into feature vectors only according to input (here, a medicine network), and the dimensionality of the feature vectors is far smaller than the number of the nodes. Compared with the traditional one-hot coding, the method greatly reduces the complexity and the sparsity of the data. The structural self-encoder adopted by the method considers the first-order adjacency and the second-order adjacency of the network and more comprehensively comprises the whole structure of the network.

Network representation, vector coding and prediction model training of drugs and targets are the core content of comparison in drug target prediction algorithms. The algorithm model avoids the blindness of manual screening, greatly saves time cost and capital cost, and represents the information into a uniform data form by integrating the information of different aspects of the medicine and the target spot, and provides a feasible paradigm for the future medicine target spot prediction by a plurality of relatively independent and clear modules, thereby improving the prediction accuracy and ensuring the high efficiency, flexibility and expandability of the algorithm.

Claims

1. A drug target interaction prediction method based on multi-layer network and graph coding, comprising a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module, and a result evaluation module, characterized in that:

(1) The described data acquisition module includes:

(1-1) For drugs, collect drug-drug interaction data, drug-disease relationship data, drug-side effect relationship data, and six different types of drug-to-similarity relationship data, including: medicinal chemical fingerprint data, drug Therapeutic data, the peptide chain data of drug action targets, the biological process data of the drug, the molecular function data of the drug, and the cellular component data of the drug's action;

(1-2) For the target, that is, protein, collect target-target interaction data, target-disease relationship data, and four different types of target-to-similarity relationship data, including: target peptides Chain data, biological process data of the target, data of cellular components where the target is located, and molecular function data of the target;

(1-3) Collect data on the interaction between drugs and targets;

(2) the data preprocessing module, including constructing a drug and target related network, and generating a multi-layer network;

(2-1) The construction drug and target related network, including:

A. For single-type object interaction relationship data, build a homogeneous interaction network, including drug interaction network G _1D , target interaction network G _1T ;

B. For different types of object interaction relationship data, construct a heterogeneous interaction network, including a drug-disease-related network G _{D_DI} , a drug-side effect-related network G _{D_SE} , and a target disease-related network G _{T_DI} ;

C. Collect drug information of different dimensions and construct drug similarity networks, including drug chemical similarity network G _2D , drug therapeutic similarity network G _3D , drug action target sequence similarity network G _4D , and drug biological similarity network G 4D . The similarity network G _5D of the chemical process, the similarity network G _6D of the molecular function of the drug, the similarity network G _7D of the cellular component of the action of the drug;

D. Collect target information of different dimensions, and construct a target similarity network, including target sequence similarity network G _2T , target biological process similarity network G _3T , target cell component similarity network G _4T , target dot molecular functional similarity network G _5T ;

E. Constructing a drug target interaction network G _{D_T} ;

(2-2) The described generation multi-layer network includes the generation of drug multi-layer network and the generation of target multi-layer network, and the specific method is:

(2-2-1) First, decompose the drug-disease-related network G _{D_DI} and convert it into a drug-disease similarity network G _8D =(V _8D ,E _8D ), where V _8D and E _8D represent the drug nodes in the network respectively Set, set of edge weights of disease similarity between two drugs; edge weight of disease similarity of drugs

x _{D_M} and y _{D_M} represent the corresponding row vectors of the two drugs in the adjacency matrix of G _{D_DI} , and || · || represent the modulo of the vectors;

Decompose the drug side effect correlation network G _{D_SE} and convert it into a drug side effect similarity network G _9D = (V _9D , E _9D ), where V _9D and E _9D represent the set of drug nodes in the network, and the side effects between the two drugs are similar, respectively. set of edge weights for sex; edge weights for similarity of side effects of drugs

x _{D_SE} and y _{D_SE} represent the corresponding row vectors of the two drugs in the adjacency matrix of G _{D_SE} ;

The target disease-related network G _{T_DI} is decomposed and transformed into a target disease similarity network G _6T = (V _6T , E _6T ), where V _6T and E _6T represent the target node set and the two targets in the network respectively. A set of edge weights for disease similarity between targets; edge weights for disease similarity between targets

x _{T_DI} and y _{T_DI} represent the corresponding row vectors of the two target points in the adjacency matrix of G _{T_DI} ;

(2-2-2) Then combine the drug-related networks into a multi-layer drug network G _D ={G _iD =(V _iD ,E _iD )}, i is the number of the drug network, i∈[1,9]; The point correlation network is combined into a target multi-layer network G _T ={G _jT =(V _jT ,E _jT )}, j is the target network number, j∈[1,6];

(3) the described feature learning module, including training a structural autoencoder, encoding output, and processing similar feature vectors;

(3-1) Training a structural autoencoder: a structural autoencoder is trained corresponding to each layer of the multi-layer drug network G _D and the target multi-layer network G _T ;

(3-2) Encoding output: use the encoding end of the trained structural autoencoder to encode the corresponding network layers respectively to obtain multi-layer vectors of all drugs and targets;

(3-3) Similar feature vector processing: splicing the multi-layer vectors of a drug to obtain the final feature vector representation of the drug; splicing the multi-layer vectors of a target to obtain the final feature vector representation of the target;

(4) the described model algorithm design module, including:

(4-1) Constructing training samples: The PairWise model is used to construct training samples, the data is randomly divided into M parts, and M-fold cross-validation is performed, that is, one of them is selected as the validation set each time, and the rest are training sets. The overall performance adjusts the model parameters, M is a positive integer greater than 3;

(4-2) Training and evaluation model: adopt a lightweight gradient boosting decision tree, and use the decision tree as a weak learner to build a boosting tree, that is, the process of iteratively constructing a decision tree T(x, θ _l ), where x and θ _l are the input feature vector and the learnable parameters of the lth decision tree;

(4-3) Predicting drug-target interactions: According to the optimal prediction model obtained by the result evaluation module, the probability of interaction between all drug-target pairs is calculated, and the drug-target pairs with high possibility are screened out as candidates, capable of The drug-target pair that interacts as a predicted result;

(5) The described result evaluation module adopts ROC curve and PR curve to verify the prediction effect of the model; specifically:

(5-1) Draw the ROC curve: the false positive rate FPR is defined as the horizontal axis, and the true positive rate TPR is defined as the vertical axis. The larger the AUROC value of the area covered by the ROC curve, the better the prediction effect of the model;

The true positive rate TPR _α and the false positive rate FPR _α of the ROC curve are calculated by the confusion matrix as follows:

Drug-target pairs with interactions are positive samples, and no interactions are negative samples; TP _α represents the number of positive samples predicted to be positive samples in the test set, and FP _α represents the number of negative samples predicted to be positive samples in the test set. FN _α represents the number of positive samples predicted as negative samples, TN _α represents the number of negative samples predicted as negative samples in the test set; α represents the prediction confidence;

(5-2) Draw the PR curve: the precision _α and the recall α under different prediction confidence _α form the precision-recall sequence:

Draw a precision-recall curve with the horizontal axis as the recall rate and the vertical axis as the precision rate, namely the PR curve. The area under the PR curve, AUPR, can reflect the classification effect of the classifier as a whole. The larger the AUPR value of the area under the PR curve, the more The prediction effect of the model is better;

(5-3) Model evaluation: According to the prediction result of (4-3), use the drawn ROC curve and PR curve and calculate AUROC and AUPR to find the model parameters under the optimal prediction result.

2. the drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, is characterized in that: in (2-1), A is specifically:

For drug-drug interaction relationship data, construct a drug interaction network G _1D =(V _1D ,E _1D ), where V _1D represents the set of drug nodes in the network, and E _1D represents the interaction between the two drugs in the network. set of edges;

For the target-target interaction relationship data, construct a target interaction network G _1T =(V _1T ,E _1T ), where V _1T represents the target node set in the network, and E _1T represents the two targets in the network A set of edges that interact with each other.

3. the drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, is characterized in that: in (2-1), B is specifically:

For drug-disease relationship data, build a drug-disease correlation network

in

E _{D_DI} respectively represent the set of drug nodes, the set of disease nodes, and the set of edges of the relationship between drugs and diseases in the network;

For drug and side effect relationship data, build drug side effect correlation network

in

E _{D_SE} respectively represent the drug node set, side effect node set, and edge set of the relationship between drugs and side effects in the network;

For target-disease relationship data, construct a target-disease-related network

in

_{ET_DI} represents the target node set, disease node set, and edge set of the relationship between target and disease in the network, respectively.

4. the drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, is characterized in that: in (2-1), C is specifically:

For medicinal chemical fingerprint data, construct a drug chemical similarity network G _2D = (V _2D , E _2D ), where V _2D and E _2D represent the set of drug nodes in the network and the edge weights of chemical similarity between two drugs, respectively set; edge weights for chemical similarity

where a ₁ and b ₁ are the bits of the respective MACCS fingerprints of the two drugs, and c ₁ is the number of bits of the same bits of the two drugs;

For the therapeutic data of drugs, construct a therapeutic similarity network G _3D = (V _3D , E _3D ), where V _3D and E _3D represent the set of drug nodes in the network and the therapeutic similarity between two drugs, respectively set of edge weights; edge weights of therapeutic similarity

where a ₂ and b ₂ are the respective ATC codes of the two drugs, and c ₂ is the number of digits of the same ATC code of the two drugs;

For the peptide chain data of drug targets, build a drug target sequence similarity network G _4D = (V _4D , E _4D ), where V _4D and E _4D represent the set of drug nodes in the network, the relationship between the two drugs, respectively. The set of edge weights for the similarity of drug action targets; the edge weights for the similarity of drug action targets

where a and b represent the respective targets of the two drugs, T _{T_T} (a, b) represents the sequence similarity of the respective targets of the two drugs, and mean( ) represents the average value;

For the biological process data of a drug, construct a biological process similarity network G _5D = (V _5D , E _5D ), where V _5D and E _5D represent the set of drug nodes in the network, the biological process between the two drugs, respectively. A set of edge weights for process similarity; edge weights for drug biological process similarity

T _{T_P} (a,b) represents the biological process similarity of the respective targets of the two drugs;

For the molecular function data of drugs, construct a molecular function similarity network G _6D = (V _6D , E _6D ), where V _6D and E _6D represent the set of drug nodes in the network and the molecular function similarity between two drugs, respectively set of edge weights; edge weights for functional similarity of drug molecules

T _{T_M} (a,b) represents the molecular functional similarity of the respective targets of the two drugs;

For the data of drug action cell components, construct drug action cell component similarity network G _7D = (V _7D , E _7D ), where V _7D and E _7D represent the drug node set in the network and the action cells between the two drugs, respectively A set of edge weights for component similarity; edge weights for drug action cell component similarity

T _{T_C} (a,b) represents the similarity of the cellular components of the respective targets of the two drugs.

5. The drug target interaction prediction method based on multi-layer network and graph encoding as claimed in claim 1, characterized in that: in (2-1), D is specifically:

For the peptide chain data of the target, construct a target sequence similarity network G _2T = (V _2T , E _2T ), where V _2T and E _2T represent the target node set in the network and the sequence similarity between the two targets, respectively. set of edge weights for sex; sequence similarity edge weights

where a ₃ and b ₃ are the number of digits of the respective peptide chain sequences of the two targets, and c ₃ is the number of digits of the same peptide chain sequence of the two targets;

For the biological process data of the target, build a target biological process similarity network G _3T = (V _3T , E _3T ), where V _3T and E _3T represent the target node set in the network and the difference between the two targets respectively. The set of edge weights of the biological process similarity between the two targets; the edge weight T _{T_P} (a,b) of the biological process similarity of the target is obtained by the GO semantic annotation of the biological process of the two targets;

For the cellular component data where the target is located, construct a similarity network of cellular components where the target is located G _4T = (V _4T , E _4T ), where V _4T and E _4T represent the target node set in the network and the distance between the two targets, respectively. The set of edge weights for the similarity of the cell components where the target is located; the edge weight T _{T_C} (a,b) of the similarity of the cell components where the target is located is obtained by the GO semantic annotation of the cell components where the two targets are located;

For the target molecular function data, construct a target molecular function similarity network G _5T = (V _5T , E _5T ), where V _5T and E _5T represent the target node set in the network and the molecular function between the two targets, respectively. The set of edge weights for similarity; the edge weights T _{T_M} (a,b) for the similarity of target molecular functions are obtained by GO semantic annotation of molecular functions of the two targets.

6. The drug target interaction prediction method based on multi-layer network and graph encoding as claimed in claim 1, characterized in that: in (2-1), E is specifically:

For drug-target interaction relationship data, build a drug-target interaction network

in

E _{D_T} respectively represent the drug node set, target node set, and edge set of the relationship between drug and target in the network.

7. The drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, wherein: (3-1) training process is:

a. Use the adjacency matrix corresponding to the single-layer network as the input of the encoder;

b. After encoding, the output of the encoder is obtained, and it is used as the input of the decoder;

c. After decoding, the output of the decoder is obtained, and the loss function is calculated by using the adjacency matrix, the output of the encoder, and the output of the decoder;

d. Use the loss function to calculate the gradient of each parameter of the encoder and the decoder, update the parameters, and the update step size is a multiple of the negative gradient;

e. Repeat steps b to d until the loss function converges;

The calculation of the loss function L _m includes two parts:

first-order similarity loss

N is the number of nodes, z _p and z _g represent the encoding output vector of the encoder to node p and node g, respectively, and T _pg represents the weight of the connecting edge;

second-order similarity loss

b _n and

represent the encoder input vector and decoder output vector of node n, respectively;

The total loss function L _m =L _2nd +λL _1st , λ is a penalty term, 0<λ<1.

8. the drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, is characterized in that: (4-2) concrete process is as follows:

(4-2-1) Before constructing the decision tree in each round, use the gradient-based unilateral sampling algorithm to filter out the small gradient samples, that is, keep a small part of the large gradient samples and randomly select some small gradient samples to calculate the overall variance gain ;

(4-2-2) Before constructing the decision tree in each round, use the mutually exclusive feature binding (EFB) algorithm to merge the mutually exclusive features;

(4-2-3) Based on the filtered samples, when the input feature vector x and the corresponding label y of a sample are input, a fitting target is constructed for the generated lth decision tree: if l=1, the fitting target is the label of the sample, where the positive sample label is 1 and the negative sample label is 0; when l≥2, the fitting target is

The boosted tree obtained after the l-1 round of iterations

L is the loss function. Under the binary classification task, the predicted value of a single sample (x, y) is

The loss function is defined as:

(4-2-4) Based on the filtered samples, fit the target to construct a binary decision tree. The splitting process of a leaf node of the binary decision tree is: for each filtered feature according to the selection of the feature The value range constructs a histogram, uses the histogram to calculate the variance gain of each division point, selects the feature and division point with the largest variance gain as the split feature and optimal division point of the current node, and selects the leaf corresponding to the optimal division point. The data of the node is divided into two batches; the recursion is repeated until the maximum depth of the tree is reached; the feature f is expressed as:

where x _l , x _l,f , g _l represent the l-th sample vector, the f-th feature of the l-th sample vector and its negative gradient, respectively,

and

In the dataset D, all the features f are smaller than the division point d and the number of samples larger than the division point d;

(4-2-5) K rounds of iterations are performed to generate K decision trees;

(4-2-6) Add K decision trees to generate the final lightweight gradient boosting decision tree