[go: up one dir, main page]

CN113571125A - Drug target interaction prediction method based on multilayer network and graph coding - Google Patents

Drug target interaction prediction method based on multilayer network and graph coding Download PDF

Info

Publication number
CN113571125A
CN113571125A CN202110865457.9A CN202110865457A CN113571125A CN 113571125 A CN113571125 A CN 113571125A CN 202110865457 A CN202110865457 A CN 202110865457A CN 113571125 A CN113571125 A CN 113571125A
Authority
CN
China
Prior art keywords
drug
network
target
similarity
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110865457.9A
Other languages
Chinese (zh)
Inventor
刘闯
王逸伟
詹秀秀
张子柯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Normal University
Original Assignee
Hangzhou Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Normal University filed Critical Hangzhou Normal University
Priority to CN202110865457.9A priority Critical patent/CN113571125A/en
Publication of CN113571125A publication Critical patent/CN113571125A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了基于多层网络与图编码的药物靶点相互作用预测方法。本发明方法包括数据采集模块、数据预处理模块、特征学习模块、模型算法设计模块、结果评估模块。数据预处理模块构建药物及蛋白质网络,以及异质图的处理。特征学习模块包括对结构性图编码器的自监督学习和对图的向量编码,以及同构向量处理,将图的拓扑信息表示成向量的形式。模型算法设计模块包括构造交叉验证集、预测模型设计。结果评估模块是采用基于混淆矩阵的ROC曲线和基于准确率、召回率序列的PR曲线验证模型的预测效果。本发明方法从数据挖掘和图的角度研究药物和靶点,通过生成的图结构信息及后续的树模型来预测两者之间的相互作用。

Figure 202110865457

The invention discloses a drug target interaction prediction method based on multi-layer network and graph coding. The method of the invention includes a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module, and a result evaluation module. The data preprocessing module constructs drug and protein networks, and processes heterogeneous graphs. The feature learning module includes self-supervised learning of structural graph encoders and vector encoding of graphs, as well as isomorphic vector processing, which expresses the topological information of graphs in the form of vectors. The model algorithm design module includes constructing the cross-validation set and predicting the model design. The result evaluation module uses the ROC curve based on the confusion matrix and the PR curve based on the precision rate and recall rate sequence to verify the prediction effect of the model. The method of the invention studies the drug and the target from the perspective of data mining and graph, and predicts the interaction between the two through the generated graph structure information and the subsequent tree model.

Figure 202110865457

Description

Drug target interaction prediction method based on multilayer network and graph coding
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a medicine target interaction prediction method based on a multilayer network and graph coding.
Background
With the rapid development of machine learning, the development of biological detection technologies such as third-generation gene sequencing and the like, and the arrival of a big data era in the field due to the rapid increase of biological data volume, more and more researchers and companies aim at the field of AI auxiliary drug development. The computer algorithm is used for assisting in screening the target targets, and the most intuitive advantage is that the computer is used for screening candidate drugs and narrowing the candidate range, so that the period of new drug discovery is greatly shortened, and the research consumables of new drug discovery are reduced. Practical application data indicates that AI technology can substantially reduce drug development costs by about 35%. By analyzing the net income trend of the international top medicine enterprises in recent years, the net income of most medicine enterprises is increased to different degrees after the AI auxiliary medicine is introduced for research and development. The AI technology can also perform multi-specific target analysis on the drug to predict multiple targets of the drug, thereby revealing the complex action mechanism of some diseases. In addition, the AI technology can also improve the accuracy and safety of the prediction of the drug, and search the side effect mechanism of the drug. Therefore, the AI technology can greatly simplify the process of research and development of new drugs on the whole, save research and development expenses, and assist drug enterprises in quickly researching and developing new drugs.
Disclosure of Invention
The invention aims to provide a method for predicting the interaction of drug targets based on a multilayer network and graph coding, which can eliminate the randomness of clinical experiments, narrow the screening range and accelerate the test period.
The invention constructs nine drug related networks (drug interaction network, drug disease related network, drug side effect related network, chemical similarity network of drug, therapeutic similarity network of drug, action target sequence similarity network of drug, biological process similarity network of drug, molecular function similarity network of drug, action cell component similarity network of drug), six target related networks (target interaction network, target disease related network, target sequence similarity network, target biological process similarity network, cell component similarity network where target is located, target molecular function similarity network) and drug target interaction network used as label. And respectively training corresponding structural self-encoders by using the networks independently, encoding the nodes into vectors by using the trained self-encoders, and finally splicing the encoded vectors of the nodes in different networks to form final characteristic vectors. And (3) sending the drug target pairs to be predicted into a trained lifting tree model (the model is obtained by linearly adding a series of decision trees constructed based on a training set) to obtain a final evaluation score.
The method comprises a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module and a result evaluation module.
(1) The data acquisition module comprises:
(1-1) for drugs, collecting drug-drug interaction relationship data, drug-disease relationship data, drug-side effect relationship data, and six different types of drug-pair similarity relationship data, including: chemical fingerprint data of the medicine, therapeutic data of the medicine, peptide chain data of an action target of the medicine, biological process data of the medicine, molecular function data of the medicine and action cell component data of the medicine;
(1-2) for the target, namely protein, collecting the data of the interaction relation between the target and the target, the data of the relation between the target and the disease and the data of the similarity relation between four different types of targets, comprising: peptide chain data of the target spot, biological process data of the target spot, cell component data of the target spot and target spot molecule function data;
(1-3) collecting the interaction relation data of the medicine and the target.
(2) The data preprocessing module comprises a medicine and target related network and a multilayer network;
(2-1) the construction of the drug and target related network comprises:
A. for single-class object interaction relation data, constructing homogeneous interaction network, including drug interaction network G1DTarget interaction network G1T
B. For objects of different classesInteraction relationship data, constructing heterogeneous interaction networks, including drug disease-related network GD_DINetwork G relating to side effects of drugsD_SETarget disease-related network GT_DI
C. Collecting drug information of different dimensions, and constructing drug similarity network including chemical similarity network G of drug2DTherapeutic similarity network of drugs G3DAnd the action target point sequence similarity network G of the medicine4DBiological process similarity network G of drugs5DMolecular functional similarity network G of drugs6DNetwork of similarity of active cellular components of drugs G7D
D. Collecting target point information of different dimensions, and constructing a target point similarity network including a target point sequence similarity network G2TTarget biological process similarity network G3TSimilarity network G of cellular components of target site4TTarget molecule functional similarity network G5T
E. Construction of drug target interaction network GD_T
(2-2) the method for generating the multilayer network comprises the steps of generating a medicine multilayer network and generating a target multilayer network, and comprises the following specific steps:
(2-2-1) first, the drug disease-related network GD_DIDisease similarity network G decomposed and converted into drug8D=(V8D,E8D) In which V is8D、E8DRespectively representing a drug node set in the network and an edge weight set of disease similarity between two drugs; margin for disease similarity of drugs
Figure BDA0003187370820000031
xD_MAnd yD_MTwo drugs are shown in GD_DIThe corresponding row vector in the adjacency matrix of (a) represents the vector modulo;
network G relating drug side effectsD_SENetwork G of similarity of side effects of drug decomposition and conversion9D=(V9D,E9D) In which V is9D、E9DAre respectively provided withA set of drug nodes in the network, a set of edge weights representing side effect similarities between two drugs; margin for similarity of side effects of drugs
Figure BDA0003187370820000032
xD_SEAnd yD_SETwo drugs are shown in GD_SEThe corresponding row vector in the adjacency matrix of (a);
target disease-related network GT_DIDecomposing and converting into target disease similarity network G6T=(V6T,E6T) Wherein V is6T、E6TRespectively representing a target point node set in the network and an edge weight set of disease similarity between two target points; margin for disease similarity of target points
Figure BDA0003187370820000033
xT_DIAnd yT_DIIndicates that two target points are at GT_DICorresponding row vectors in the adjoining matrix of (a);
(2-2-2) then combining the drug-related networks into a drug multilayer network GD={GiD=(ViD,EiD) I is the drug network number, i belongs to [1,9 ]](ii) a Combining target related networks into a target multilayer network GT={GjT=(VjT,EjT) J is the network number of the target point, j belongs to [1,6 ]]。
(3) The feature learning module comprises a training structural self-encoder, encoding output and similar feature vector processing;
(3-1) training the structural autoencoder: drug multilayer network GDWith target multilayer network GTCorrespondingly training a structural self-encoder for each layer;
(3-2) encoding output: respectively coding the corresponding network layers by using the coding ends of the trained structural self-coder to obtain multilayer vectors of all the medicines and the target spots;
(3-3) processing the similar feature vectors: splicing the multiple layers of vectors of a drug to obtain the final characteristic vector representation of the drug; and splicing the multi-layer vectors of a target point to obtain the final characteristic vector representation of the target point.
(4) The model algorithm design module comprises a training sample construction module, a training and evaluation model and a medicine target point interaction prediction module;
(4-1) constructing a training sample: constructing a training sample by adopting a PairWise model, randomly dividing data into M parts, and performing M-fold cross validation, namely selecting one part as a validation set and the rest as a training set each time, adjusting model parameters through the overall performance of the cross validation, wherein M is a positive integer greater than 3;
(4-2) training and evaluating the model: building a lifting tree by adopting a lightweight gradient lifting decision tree and taking the decision tree as a weak learner, namely building the decision tree T (x, theta) by adopting iterationl) Wherein x and θlRespectively inputting a characteristic vector and a learnable parameter of the first decision tree;
(4-3) predicting drug target interaction: and according to the optimal prediction model obtained by the result evaluation module, calculating the interaction probability of all the drug target pairs, and screening out the drug target pairs with high possibility as candidate drug target pairs capable of interacting as prediction results.
(5) The result evaluation module verifies the prediction effect of the model by adopting an ROC curve and a PR curve; the method comprises the following steps:
(5-1) plotting ROC curves: defining the false positive rate FPR as a horizontal axis and the true positive rate TPR as a vertical axis, wherein the larger the area AUROC value covered by the ROC curve is, the better the prediction effect of the model is represented;
real positive rate TPR of ROC curveαAnd false positive rate FPRαThe calculation by the confusion matrix is as follows:
Figure BDA0003187370820000041
the drug target pair is a positive sample in the presence of interaction, and is a negative sample in the absence of interaction; TPαIndicates the number of positive samples, FP, predicted from the positive samples in the test setαRepresenting negative examples in a test setMeasured as the number of positive samples, FNαIndicates the number of positive samples predicted as negative samples, TNαRepresenting the number of negative samples predicted in the test set as negative samples; α represents a prediction confidence;
(5-2) drawing a PR curve: precision at different prediction confidence alphaαRecall with recall recallingαComposition of precision-recall sequence:
Figure BDA0003187370820000051
drawing a precision-recall curve, namely a PR curve, by taking the horizontal axis as recall rate and the vertical axis as precision rate, wherein AUPR (area under PR) can reflect the classification effect of the classifier on the whole, and the larger AUPR value of the area under the PR curve is, the better the prediction effect of the model is;
(5-3) evaluation of model: and (4) according to the prediction result of the step (4-3), utilizing the drawn ROC curve and PR curve, calculating AUROC and AUPR, and searching for a model parameter under the optimal prediction result.
The method researches the interaction of the drug target pairs from the aspects of data mining and multilayer networks, abstracts different types of data into the same data structure by constructing the network, and realizes the drug target prediction by combining the methods of the decomposition of heterogeneous networks, the automatic learning of network topological structures by structural self-encoders, tree-based classifiers and the like. Therefore, the method can effectively analyze the drug target data and predict the interaction between the drug target data and the drug target data, thereby providing scientific guidance for the research and development of new drugs, improving the research and development efficiency of the new drugs and promoting the development of medical independent innovation to a certain extent.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
Detailed Description
The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings.
The existing 732 drug-related data, 1915 targets (proteins) and corresponding 12904 side effects and 440 disease-related data comprise data of interactions between drug pairs, between drug diseases, between drug side effects, between targets and targets, between targets and diseases, MACCS fingerprint data of drug chemical formula, GO annotation of drug and target, protein sequence data of target, and half-inhibitory concentration data between drug and target.
As shown in fig. 1, a method for predicting drug target interaction based on multilayer network and graph coding comprises a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module, and a result evaluation module, and specifically comprises the following steps:
(1) a data acquisition module comprising:
(1-1) for drugs, collecting drug-drug interaction relationship data, drug-disease relationship data, drug-side effect relationship data, and six different types of drug-pair similarity relationship data, including: chemical fingerprint data of the medicine, therapeutic data of the medicine, peptide chain data of an action target of the medicine, biological process data of the medicine, molecular function data of the medicine and action cell component data of the medicine;
(1-2) for the target, namely protein, collecting the data of the interaction relation between the target and the target, the data of the relation between the target and the disease and the data of the similarity relation between four different types of targets, comprising: peptide chain data of the target spot, biological process data of the target spot, cell component data of the target spot and target spot molecule function data;
(1-3) collecting interaction relation data of the medicine and the target;
the above data is downloaded through a public website.
(2) The data preprocessing module comprises a module for constructing a medicine and target related network and generating a multilayer network, and provides a data basis for medicine target prediction, and specifically comprises the following steps:
(2-1) constructing a medicine and target related network, comprising:
(I) for the interaction relation data of the drug and the drug, constructing a drug interaction network G1D=(V1D,E1D),V1DRepresenting a set of drug nodes in the network, E1DRepresents the netThe edge set of the interaction between two drugs in the collateral exists;
constructing a target interaction network G for the interaction relation data of the target and the target1T=(V1T,E1T),V1TRepresenting a set of target nodes in the network, E1TRepresenting a set of edges that have an interaction between two targets in the network;
(II) for the relation data of the medicine and the disease, constructing a medicine disease related network
Figure BDA0003187370820000061
Wherein
Figure BDA0003187370820000062
ED_DIRespectively representing a medicine node set, a disease node set and an edge set of the relation between the medicine and the disease in the network;
for the relation data of the medicine and the side effect, a medicine side effect related network is constructed
Figure BDA0003187370820000063
Wherein
Figure BDA0003187370820000064
ED_SERespectively representing a drug node set, a side effect node set and an edge set of the relationship between the drug and the side effect in the network;
for target and disease relation data, constructing target disease related network
Figure BDA0003187370820000065
Wherein
Figure BDA0003187370820000066
ET_DIRespectively representing a target point node set, a disease node set and an edge set of the relation between a target point and a disease in the network;
(III) for chemical fingerprint data of the medicine, constructing a chemical similarity network G of the medicine2D=(V2D,E2D) In which V is2D、E2DRespectively representing a drug node set and an edge weight set of chemical similarity between two drugs in the network; margin of chemical similarity
Figure BDA0003187370820000071
Wherein a is1And b1Is the bit number of MACCS fingerprints of two drugs respectively, c1The number of the same bit of the two medicines;
for therapeutic data of a drug, a therapeutic similarity network G of the drug is constructed3D=(V3D,E3D) In which V is3D、E3DA set of drug nodes in the network, a set of side weights representing therapeutic similarity between two drugs, respectively; margin of therapeutic similarity
Figure BDA0003187370820000072
Wherein a is2And b2Coding for the respective ATC of the two drugs, c2The number of digits for the same ATC code for both drugs;
constructing a medicine action target point sequence similarity network G for the peptide chain data of the medicine action target point4D=(V4D,E4D) In which V is4D、E4DRespectively representing a drug node set in the network and an edge weight set of action target point similarity between two drugs; margin for similarity of drug action targets
Figure BDA0003187370820000073
Wherein a and b represent the respective targets of the two drugs, TT_T(a, b) shows the sequence similarity of respective targets of the two drugs, mean (-) shows the mean;
for biological process data of the drug, a biological process similarity network G of the drug is constructed5D=(V5D,E5D) In which V is5D、E5DRespectively representing a drug node set in the network and an edge weight set of the similarity of biological processes between two drugs; margin for similarity of pharmacogenomic processes
Figure BDA0003187370820000074
TT_P(a, b) representing the similarity of biological processes at the respective targets of the two drugs;
for the molecular function data of the medicine, a molecular function similarity network G of the medicine is constructed6D=(V6D,E6D) In which V is6D、 E6DRespectively representing a drug node set in the network and an edge weight set of molecular function similarity between two drugs; the boundary of functional similarity of drug molecules
Figure BDA0003187370820000075
TT_M(a, b) shows the molecular functional similarity of the respective targets of the two drugs;
for the acting cell component data of the medicine, constructing an acting cell component similarity network G of the medicine7D=(V7D,E7D) In which V is7D、E7DRespectively representing a drug node set in the network and an edge weight set of similarity of acting cell components between two drugs; margin for similarity of cell components for drug action
Figure BDA0003187370820000081
TT_C(a, b) shows the similarity of the acting cell components of the respective targets of the two drugs;
(IV) constructing a target sequence similarity network G for the peptide chain data of the target2T=(V2T,E2T) In which V is2T、E2TRespectively representing a target point node set and an edge weight set of sequence similarity between two target points in the network; sequence similarity margin
Figure BDA0003187370820000082
Wherein a is3And b3The number of peptide chain sequence positions of two targets respectively, c3The number of bits of the peptide chain sequence which is the same with the two targets;
for the biological process data of the target, a similarity network G of the biological process of the target is constructed3T=(V3T,E3T) In which V is3T、E3TRespectively representing a target point node set in the network and an edge weight set of the similarity of the biological processes between two target points; edge weights T of similarity of target biological processesT_P(a, b) semantic annotation of GO in the biological process of two targets;
constructing a cell component similarity network G for the cell component data of the target point4T=(V4T,E4T) In which V is4T、E4TRespectively representing a target point node set in the network and an edge weight set of the similarity of the cell components between the two target points; margin T of similarity of cellular components at target siteT_C(a, b) semantic annotation of GO of cell components of two target points;
constructing a target molecule functional similarity network G for the target molecule functional data5T=(V5T,E5T) In which V is5T、E5TRespectively representing a target point node set in the network and an edge weight set of molecular function similarity between two target points; edge weight T of target molecule function similarityT_M(a, b) semantic annotation of GO with molecular functions of two targets;
(V) for the interaction relation data of the drug and the target, constructing a drug target interaction network
Figure BDA0003187370820000083
Wherein
Figure BDA0003187370820000084
ED_TRespectively representing a drug node set, a target point node set and an edge set of the relationship between the drug and the target point in the network.
(2-2) generating a multilayer network, including generating a drug multilayer network and generating a target multilayer network:
(2-2-1) network G relating drug diseases to drug diseasesD_DIDisease similarity network G decomposed and converted into drug8D=(V8D,E8D) In which V is8D、E8DRespectively representing a drug node set in the network and an edge weight set of disease similarity between two drugs; medicineBy the similarity of diseases
Figure BDA0003187370820000085
xD_MAnd yD_MTwo drugs are shown in GD_DIThe corresponding row vector in the adjacency matrix of (a) represents the vector modulo;
network G relating drug side effectsD_SENetwork G of similarity of side effects of drug decomposition and conversion9D=(V9D,E9D) In which V is9D、E9DRespectively representing a drug node set in the network and an edge weight set of side effect similarity between two drugs; margin for similarity of side effects of drugs
Figure BDA0003187370820000091
xD_SEAnd yD_SETwo drugs are shown in GD_SEThe corresponding row vector in the adjacent matrix of (2);
target disease-related network GT_DIDecomposing and converting into target disease similarity network G6T=(V6T,E6T) In which V is6T、E6TRespectively representing a target point node set in the network and an edge weight set of disease similarity between two target points; margin for disease similarity of targets
Figure BDA0003187370820000092
xT_DIAnd yT_DIIndicates that two target points are at GT_DIThe corresponding row vector in the adjacency matrix of (2);
(2-2-2) combining a drug interaction network, a drug disease similarity network, a drug side effect similarity network, a drug chemical similarity network, a drug therapeutic similarity network, a drug action target sequence similarity network, a drug biological process similarity network, a drug molecular function similarity network and a drug action cell component similarity network into a drug multilayer network GD={GiD=(ViD,EiD) I is the drug network number, i belongs to [1,9 ]];
Phase of target pointThe interaction network, the disease similarity network of the target, the sequence similarity network of the target, the similarity network of the biological process of the target, the similarity network of the cellular components of the target and the functional similarity network of the target molecule are combined into a target multilayer network GT={GjT=(VjT,EjT) J is the network number of the target point, j belongs to [1,6 ]]。
(3) A feature learning module:
in the study of machine learning related problems, data and features determine the upper limit of the prediction result, and models and algorithms only approximate the upper limit. The feature coding module of the invention solves the problem of feature selection of the first half sentence, namely better learning gene features of a model algorithm, and achieves the most accurate prediction result. The module is based on a drug multilayer network GDWith target multilayer network GTThe method adopts the structural self-encoder to automatically encode the network structure, thereby ensuring the integrity of feature extraction.
(3-1) training the structural autoencoder: drug multilayer network GDWith target multilayer network GTEach layer of (a) correspondingly trains a structural self-encoder, and the training process is as follows:
a. using the adjacent matrix corresponding to the single-layer network as the input of the encoder;
b. after encoding, the output of the encoder is obtained and is used as the input of the decoder;
c. decoding to obtain the output of a decoder, and calculating a loss function by using the adjacency matrix, the output of the encoder and the output of the decoder;
d. calculating the gradient of each parameter of the encoder and the decoder by using a loss function, updating the parameters, wherein the updating step length is a multiple of the negative gradient;
e. repeating steps b through d until the loss function converges.
Said loss function LmThe calculation includes two parts:
first order loss of similarity
Figure BDA0003187370820000101
N is the number of nodes, zpAnd zgRepresenting the coded output vectors, T, of the coder for node p and node g, respectivelypgRepresenting the weight of the connected edge; if it is an interaction network, TpgIt is only possible to take 0 and 1, representing the case of no edge and an edge, respectively; if it is a similarity network, TpgAny value between 0 and 1, inclusive, may be used. The loss function is defined in order to make the feature vectors encoded by drugs or targets with high similarity as similar as possible.
Second order loss of similarity
Figure BDA0003187370820000102
bnAnd
Figure BDA0003187370820000103
representing the encoder input vector and the decoder output vector, respectively, of node n. The purpose of defining the loss function is to enable the decoder to reconstruct the original input vector as much as possible from the encoded vector, so that the encoded vector contains as much information as possible of the original vector.
Total loss function Lm=L2nd+λL1stλ is a penalty term, 0 < λ < 1.
(3-2) encoding output: and respectively coding the corresponding network layers by using the coding ends of the trained structural self-coder to obtain multilayer vectors of all the medicines and the target points.
(3-3) processing the same-class feature vectors:
splicing the multiple layers of vectors of a drug to obtain the final characteristic vector representation of the drug;
and splicing the multi-layer vectors of a target point to obtain the final characteristic vector representation of the target point.
(4) A model algorithm design module comprising:
(4-1) constructing a training sample: the drug target pairs include verified drug target pairs and unverified drug target pairs, including undiscovered but objectively interacting drug target pairs. The invention finds out the drug target pairs which have objective interaction but are not discovered from the unverified drug target pairs. Therefore, it can be assumed that the probability that an unverified drug target pair interacts is certainly not greater than the probability of a verified interaction drug target pair. Based on the assumption, a PairWise model is adopted to construct training samples, namely, a positive sample is extracted from a verified and interacted drug target pair, a negative sample is also extracted from an unverified drug target pair, and training samples are constructed through corresponding positive and negative samples to obtain paired positive and negative training sample sets with the same quantity; and randomly dividing the data into M parts, performing M-fold cross validation, namely selecting one part as a validation set and the rest as a training set each time, and adjusting model parameters through the overall performance of the cross validation, wherein M is a positive integer greater than 3.
(4-2) training and evaluating the model: building a lifting tree by adopting a lightweight gradient lifting decision tree and taking the decision tree as a weak learner, namely building the decision tree T (x, theta) by adopting iterationl) Wherein x and θlThe method comprises the following specific processes of inputting feature vectors and learnable parameters of the first decision tree respectively:
(4-2-1) before each round of decision tree construction, screening small gradient samples by using a gradient-based unilateral sampling (GOSS) algorithm, namely reserving a small part of large gradient samples and randomly selecting a part of small gradient samples to calculate the total variance gain, so that the number of samples is reduced;
(4-2-2) before each round of construction of the decision tree, merging mutually exclusive features by using a mutually Exclusive Feature Bundling (EFB) algorithm, thereby reducing feature dimensions;
(4-2-3) constructing a fitting target for the generated first decision tree when an input feature vector x and a corresponding label y of a certain sample are input based on the screened sample: if l is 1, the fitting target is the label of the sample, wherein the label of the positive sample is 1, and the label of the negative sample is 0; when l is more than or equal to 2, the fitting target is
Figure BDA0003187370820000111
Wherein the lifting tree obtained after the first-1 iteration
Figure BDA0003187370820000112
L is a loss function, and under the binary task, a single sample (x, y) has a predicted value of
Figure BDA0003187370820000113
The time loss function is defined as:
Figure BDA0003187370820000114
(4-2-4) based on the screened samples, fitting the target to construct a binary decision tree, wherein a leaf node of the binary decision tree is split by the following steps: constructing a histogram for each screened feature according to the value range of the feature, calculating the variance gain of each division point by using the histogram, selecting the feature with the maximum variance gain and the division point as the splitting feature of the current node and the optimal division point, and dividing the data of the leaf node corresponding to the optimal division point into two batches; recursion continues until the maximum depth of the tree is reached. The variance gain of feature f based on dataset D at partition point D is expressed as:
Figure BDA0003187370820000115
wherein xl、xl,f、glRespectively representing the ith sample vector, the ith feature of the ith sample vector and the negative gradient thereof,
Figure BDA0003187370820000121
and
Figure BDA0003187370820000122
all features f are smaller and larger than the division point D in the dataset D, respectively.
(4-2-5) performing K rounds of iteration to generate K decision trees;
(4-2-6) deciding K decisionsAdding the trees to generate a final lightweight gradient lifting decision tree
Figure BDA0003187370820000123
For the input feature vector x of the sample, the decision tree output H (x) e [0,1]The probability that the input sample is a positive sample can be interpreted;
(4-3) predicting drug target interaction: and according to the optimal prediction model obtained by the result evaluation module, calculating the interaction probability of all the drug target pairs, and screening out the drug target pairs with high possibility as candidate drug target pairs capable of interacting as prediction results.
(5) The result evaluation module verifies the prediction effect of the model by adopting an ROC curve and a PR curve; the method comprises the following steps:
(5-1) plotting ROC curves: plotting the ROC curve requires generating a confusion matrix, which is also an index for evaluating the model results, is part of the model evaluation, and is represented in the form of a square matrix, displaying the accuracy of the prediction results in a confusion matrix, each column representing the prediction category, the total number of each column representing the number of data predicted as the category, each row representing the true attribution category of data, and the total number of each row representing the number of data instances of the category.
The ROC curve is a new classification model performance evaluation method introduced from the field of medical analysis, is suitable for the research problem of two classifications, and when the ROC curve is drawn, the false positive rate FPR is defined as a horizontal axis, the true positive rate TPR is defined as a vertical axis, the larger the area AUROC value covered by the ROC curve is, namely the closer to 1, the better the prediction effect of the model is represented.
Real positive rate TPR of ROC curveαAnd false positive rate FPRαThe calculation by the confusion matrix is as follows:
Figure BDA0003187370820000124
in the context of drug target prediction, the presence of drug target pair interaction is a positive sample and the absence is a negative sample. TPαIndicates the number of positive samples, FP, predicted from the positive samples in the test setαIndicating the number of negative samples predicted as positive samples in the test set, FNαDenotes the number of predicted positive samples as negative samples, TNαRepresenting the number of negative samples predicted from the test set; α represents a prediction confidence;
(5-2) drawing a PR curve: the rendering of the PR curve requires the generation of precision-recall sequences that are represented by precision at different prediction confidence degrees alphaαRecall with recall recallingαThe calculation formula is as follows:
Figure BDA0003187370820000131
the precision rate describes the accuracy rate of correctly classifying the positive samples under the confidence degree alpha, and the recall rate describes the proportion of correctly classifying the positive samples in the total positive samples under the confidence degree alpha; the two show opposite change trends along with the change of alpha. Therefore, an accuracy-recall ratio pair sequence generated by different alpha is utilized, a horizontal axis is used as a recall ratio, a vertical axis is used as an accuracy ratio to draw a precision-recall curve, namely a PR curve, an area AUPR under the PR curve can reflect the classification effect of the classifier on the whole, and the larger the area AUPR under the PR curve is, the closer the area AUPR is to 1, the better the prediction effect of the expression model is;
(5-3) evaluation of model: and (4) according to the prediction result of the step (4-3), utilizing the drawn ROC curve and PR curve, calculating AUROC and AUPR, and searching for a model parameter under the optimal prediction result.
Screening candidate drugs is a main means for assisting the development of new drugs by AI, wherein the computer modeling (i.e. which data structure is adopted to represent both) and prediction model selection of drugs and targets are the most critical two steps. The method adopts two different computer modeling, namely network nodes and characteristic vectors, for the medicine and the target at different stages. Two data models are described below, using drugs as examples.
The drug networks can well reflect the relationship between drugs, and the multilayer networks formed by different types of drug networks can better reflect the relationship at different angles, thereby providing a new idea for drug screening. Specifically, the drug network represents a single drug as a node, and the interaction between drugs is defined as the connecting edges between nodes. The definition of edges is different for different types of drug networks, thus expressing the relationship between drug pairs at different viewing angles. Taking the chemical similarity network of drugs as an example, the edge weight between node pairs represents the chemical structure similarity between corresponding drug pairs, and the absence of an edge represents that the similarity is 0. In the process of constructing a drug network, the edge weights are usually normalized so that the weight values range from 0 to 1.
The eigenvector is an array of real numbers, each of which represents an eigenvalue and contains specific information in the application. In the method, the medicine characteristic vector is obtained by a structural self-encoder based on medicine network encoding, and the topological information of the network is contained in the characteristic value. The autoencoder is an auto-supervised representation learning method, and can convert nodes into feature vectors only according to input (here, a medicine network), and the dimensionality of the feature vectors is far smaller than the number of the nodes. Compared with the traditional one-hot coding, the method greatly reduces the complexity and the sparsity of the data. The structural self-encoder adopted by the method considers the first-order adjacency and the second-order adjacency of the network and more comprehensively comprises the whole structure of the network.
Network representation, vector coding and prediction model training of drugs and targets are the core content of comparison in drug target prediction algorithms. The algorithm model avoids the blindness of manual screening, greatly saves time cost and capital cost, and represents the information into a uniform data form by integrating the information of different aspects of the medicine and the target spot, and provides a feasible paradigm for the future medicine target spot prediction by a plurality of relatively independent and clear modules, thereby improving the prediction accuracy and ensuring the high efficiency, flexibility and expandability of the algorithm.

Claims (8)

1.基于多层网络与图编码的药物靶点相互作用预测方法,包括数据采集模块、数据预处理模块、特征学习模块、模型算法设计模块、结果评估模块,其特征在于:1. A drug target interaction prediction method based on multi-layer network and graph coding, comprising a data acquisition module, a data preprocessing module, a feature learning module, a model algorithm design module, and a result evaluation module, characterized in that: (1)所述的数据采集模块,包括:(1) The described data acquisition module includes: (1-1)对于药物,采集药物与药物相互作用关系数据、药物与疾病关系数据、药物与副作用关系数据,以及六种不同类型的药物对相似性关系数据,包括:药物化学指纹数据、药物的治疗学数据、药物作用靶点的肽链数据、药物的生物学过程数据、药物的分子功能数据、药物的作用细胞成分数据;(1-1) For drugs, collect drug-drug interaction data, drug-disease relationship data, drug-side effect relationship data, and six different types of drug-to-similarity relationship data, including: medicinal chemical fingerprint data, drug Therapeutic data, the peptide chain data of drug action targets, the biological process data of the drug, the molecular function data of the drug, and the cellular component data of the drug's action; (1-2)对于靶点,即蛋白质,采集靶点与靶点相互作用关系数据、靶点与疾病关系数据,以及四种不同类型的靶点对相似性关系数据,包括:靶点的肽链数据、靶点的生物学过程数据、靶点所在细胞成分数据、靶点分子功能数据;(1-2) For the target, that is, protein, collect target-target interaction data, target-disease relationship data, and four different types of target-to-similarity relationship data, including: target peptides Chain data, biological process data of the target, data of cellular components where the target is located, and molecular function data of the target; (1-3)收集药物与靶点相互作用关系数据;(1-3) Collect data on the interaction between drugs and targets; (2)所述的数据预处理模块,包括构建药物及靶点相关网络、生成多层网络;(2) the data preprocessing module, including constructing a drug and target related network, and generating a multi-layer network; (2-1)所述的构建药物及靶点相关网络,包括:(2-1) The construction drug and target related network, including: A.对于单类对象相互作用关系数据,构建同质相互作用网络,包括药物相互作用网络G1D、靶点相互作用网络G1TA. For single-type object interaction relationship data, build a homogeneous interaction network, including drug interaction network G 1D , target interaction network G 1T ; B.对于不同类的对象相互作用关系数据,构建异质相互作用网络,包括药物疾病相关网络GD_DI、药物副作用相关网络GD_SE、靶点疾病相关网络GT_DIB. For different types of object interaction relationship data, construct a heterogeneous interaction network, including a drug-disease-related network G D_DI , a drug-side effect-related network G D_SE , and a target disease-related network G T_DI ; C.收集不同维度的药物信息,构建药物相似性网络,包括药物的化学相似性网络G2D、药物的治疗学相似性网络G3D、药物的作用靶点序列相似性网络G4D、药物的生物学过程相似性网络G5D、药物的分子功能相似性网络G6D、药物的作用细胞成分相似性网络G7DC. Collect drug information of different dimensions and construct drug similarity networks, including drug chemical similarity network G 2D , drug therapeutic similarity network G 3D , drug action target sequence similarity network G 4D , and drug biological similarity network G 4D . The similarity network G 5D of the chemical process, the similarity network G 6D of the molecular function of the drug, the similarity network G 7D of the cellular component of the action of the drug; D.收集不同维度的靶点信息,构建靶点相似性网络,包括靶点序列相似性网络G2T、靶点生物学过程相似性网络G3T、靶点所在细胞成分相似性网络G4T、靶点分子功能相似性网络G5TD. Collect target information of different dimensions, and construct a target similarity network, including target sequence similarity network G 2T , target biological process similarity network G 3T , target cell component similarity network G 4T , target dot molecular functional similarity network G 5T ; E.构建药物靶点相互作用网络GD_TE. Constructing a drug target interaction network G D_T ; (2-2)所述的生成多层网络,包括生成药物多层网络和生成靶点多层网络,具体方法是:(2-2) The described generation multi-layer network includes the generation of drug multi-layer network and the generation of target multi-layer network, and the specific method is: (2-2-1)首先将药物疾病相关网络GD_DI分解,转化为药物的疾病相似性网络G8D=(V8D,E8D),其中V8D、E8D分别表示该网络中的药物节点集合、两药物之间疾病相似性的边权集合;药物的疾病相似性的边权
Figure FDA0003187370810000021
xD_M和yD_M表示两药物在GD_DI的邻接矩阵中对应的行向量,||·||表示对向量取模;
(2-2-1) First, decompose the drug-disease-related network G D_DI and convert it into a drug-disease similarity network G 8D =(V 8D ,E 8D ), where V 8D and E 8D represent the drug nodes in the network respectively Set, set of edge weights of disease similarity between two drugs; edge weight of disease similarity of drugs
Figure FDA0003187370810000021
x D_M and y D_M represent the corresponding row vectors of the two drugs in the adjacency matrix of G D_DI , and || · || represent the modulo of the vectors;
将药物副作用相关网络GD_SE分解,转化为药物的副作用相似性网络G9D=(V9D,E9D),其中V9D、E9D分别表示该网络中的药物节点集合、两药物之间副作用相似性的边权集合;药物的副作用相似性的边权
Figure FDA0003187370810000022
xD_SE和yD_SE表示两药物在GD_SE的邻接矩阵中对应的行向量;
Decompose the drug side effect correlation network G D_SE and convert it into a drug side effect similarity network G 9D = (V 9D , E 9D ), where V 9D and E 9D represent the set of drug nodes in the network, and the side effects between the two drugs are similar, respectively. set of edge weights for sex; edge weights for similarity of side effects of drugs
Figure FDA0003187370810000022
x D_SE and y D_SE represent the corresponding row vectors of the two drugs in the adjacency matrix of G D_SE ;
将靶点疾病相关网络GT_DI分解,转化为靶点的疾病相似性网络G6T=(V6T,E6T),其中V6T、E6T分别表示该网络中的靶点节点集合、两靶点之间疾病相似性的边权集合;靶点的疾病相似性的边权
Figure FDA0003187370810000023
xT_DI和yT_DI表示两靶点在GT_DI的邻接矩阵中对应的行向量;
The target disease-related network G T_DI is decomposed and transformed into a target disease similarity network G 6T = (V 6T , E 6T ), where V 6T and E 6T represent the target node set and the two targets in the network respectively. A set of edge weights for disease similarity between targets; edge weights for disease similarity between targets
Figure FDA0003187370810000023
x T_DI and y T_DI represent the corresponding row vectors of the two target points in the adjacency matrix of G T_DI ;
(2-2-2)然后将药物相关网络组合成药物多层网络GD={GiD=(ViD,EiD)},i为药物网络编号,i∈[1,9];将靶点相关网络组合成靶点多层网络GT={GjT=(VjT,EjT)},j为靶点网络编号,j∈[1,6];(2-2-2) Then combine the drug-related networks into a multi-layer drug network G D ={G iD =(V iD ,E iD )}, i is the number of the drug network, i∈[1,9]; The point correlation network is combined into a target multi-layer network G T ={G jT =(V jT ,E jT )}, j is the target network number, j∈[1,6]; (3)所述的特征学习模块,包括训练结构性自编码器、编码输出、同类特征向量处理;(3) the described feature learning module, including training a structural autoencoder, encoding output, and processing similar feature vectors; (3-1)训练结构性自编码器:药物多层网络GD与靶点多层网络GT的每一层对应训练一个结构性自编码器;(3-1) Training a structural autoencoder: a structural autoencoder is trained corresponding to each layer of the multi-layer drug network G D and the target multi-layer network G T ; (3-2)编码输出:使用训练后的结构性自编码器的编码端分别对对应的网络层进行编码,得到所有药物及靶点的多层向量;(3-2) Encoding output: use the encoding end of the trained structural autoencoder to encode the corresponding network layers respectively to obtain multi-layer vectors of all drugs and targets; (3-3)同类特征向量处理:将一个药物的多层向量拼接,得到该药物的最终特征向量表示;将一个靶点的多层向量拼接,得到该靶点的最终特征向量表示;(3-3) Similar feature vector processing: splicing the multi-layer vectors of a drug to obtain the final feature vector representation of the drug; splicing the multi-layer vectors of a target to obtain the final feature vector representation of the target; (4)所述的模型算法设计模块,包括:(4) the described model algorithm design module, including: (4-1)构造训练样本:采用PairWise模型构造训练样本,将数据随机划分成M份,并进行M折交叉验证,即每次选取其中一份作为验证集,其余为训练集,通过交叉验证的整体表现调整模型参数,M为大于3的正整数;(4-1) Constructing training samples: The PairWise model is used to construct training samples, the data is randomly divided into M parts, and M-fold cross-validation is performed, that is, one of them is selected as the validation set each time, and the rest are training sets. The overall performance adjusts the model parameters, M is a positive integer greater than 3; (4-2)训练和评估模型:采用轻量级梯度提升决策树,以决策树作为弱学习器,构建提升树,即采用迭代构建决策树T(x,θl)的过程,其中x和θl分别为输入特征向量及第l棵决策树的可学习参数;(4-2) Training and evaluation model: adopt a lightweight gradient boosting decision tree, and use the decision tree as a weak learner to build a boosting tree, that is, the process of iteratively constructing a decision tree T(x, θ l ), where x and θ l are the input feature vector and the learnable parameters of the lth decision tree; (4-3)预测药物靶点相互作用:根据结果评估模块得到的最优预测模型,计算所有药物靶点对存在相互作用的概率,筛选出可能性大的药物靶点对作为候选的、能发生相互作用的药物靶点对,作为预测结果;(4-3) Predicting drug-target interactions: According to the optimal prediction model obtained by the result evaluation module, the probability of interaction between all drug-target pairs is calculated, and the drug-target pairs with high possibility are screened out as candidates, capable of The drug-target pair that interacts as a predicted result; (5)所述的结果评估模块,采用ROC曲线和PR曲线验证模型的预测效果;具体是:(5) The described result evaluation module adopts ROC curve and PR curve to verify the prediction effect of the model; specifically: (5-1)绘制ROC曲线:将假阳性率FPR定义为横轴,真阳性率TPR定义为纵轴,ROC曲线所覆盖的面积AUROC值越大,表示模型的预测效果越好;(5-1) Draw the ROC curve: the false positive rate FPR is defined as the horizontal axis, and the true positive rate TPR is defined as the vertical axis. The larger the AUROC value of the area covered by the ROC curve, the better the prediction effect of the model; ROC曲线的真阳性率TPRα和假阳性率FPRα通过混淆矩阵计算如下:The true positive rate TPR α and the false positive rate FPR α of the ROC curve are calculated by the confusion matrix as follows:
Figure FDA0003187370810000031
Figure FDA0003187370810000031
药物靶点对存在相互作用为正样本,不存在相互作用为负样本;TPα表示将测试集中的正样本预测为正样本的个数,FPα表示测试集中的负样本预测为正样本的个数,FNα表示将正样本预测为负样本的个数,TNα表示将测试集中的负样本预测为负样本的个数;α表示预测置信度;Drug-target pairs with interactions are positive samples, and no interactions are negative samples; TP α represents the number of positive samples predicted to be positive samples in the test set, and FP α represents the number of negative samples predicted to be positive samples in the test set. FN α represents the number of positive samples predicted as negative samples, TN α represents the number of negative samples predicted as negative samples in the test set; α represents the prediction confidence; (5-2)绘制PR曲线:不同预测置信度α下的精确率precisionα与召回率recallα组成精度-召回序列:
Figure FDA0003187370810000032
(5-2) Draw the PR curve: the precision α and the recall α under different prediction confidence α form the precision-recall sequence:
Figure FDA0003187370810000032
以横轴为召回率、纵轴为精确率绘制一条精度-召回曲线,即PR曲线,PR曲线下方面积AUPR就能在整体上反映分类器的分类效果,PR曲线下方面积AUPR值越大,表示模型的预测效果越好;Draw a precision-recall curve with the horizontal axis as the recall rate and the vertical axis as the precision rate, namely the PR curve. The area under the PR curve, AUPR, can reflect the classification effect of the classifier as a whole. The larger the AUPR value of the area under the PR curve, the more The prediction effect of the model is better; (5-3)模型评估:根据(4-3)的预测结果,利用绘制的ROC曲线和PR曲线并计算AUROC与AUPR,寻找最优预测结果下的模型参数。(5-3) Model evaluation: According to the prediction result of (4-3), use the drawn ROC curve and PR curve and calculate AUROC and AUPR to find the model parameters under the optimal prediction result.
2.如权利要求1所述的基于多层网络与图编码的药物靶点相互作用预测方法,其特征在于:(2-1)中A具体是:2. the drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, is characterized in that: in (2-1), A is specifically: 对于药物与药物相互作用关系数据,构建药物相互作用网络G1D=(V1D,E1D),V1D表示该网络中的药物节点集合,E1D表示该网络中两药物之间存在相互作用的边集合;For drug-drug interaction relationship data, construct a drug interaction network G 1D =(V 1D ,E 1D ), where V 1D represents the set of drug nodes in the network, and E 1D represents the interaction between the two drugs in the network. set of edges; 对于靶点与靶点相互作用关系数据,构建靶点相互作用网络G1T=(V1T,E1T),V1T表示该网络中的靶点节点集合,E1T表示该网络中的两靶点之间存在相互作用的边集合。For the target-target interaction relationship data, construct a target interaction network G 1T =(V 1T ,E 1T ), where V 1T represents the target node set in the network, and E 1T represents the two targets in the network A set of edges that interact with each other. 3.如权利要求1所述的基于多层网络与图编码的药物靶点相互作用预测方法,其特征在于:(2-1)中B具体是:3. the drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, is characterized in that: in (2-1), B is specifically: 对于药物与疾病关系数据,构建药物疾病相关网络
Figure FDA0003187370810000041
其中
Figure FDA0003187370810000042
ED_DI分别表示该网络中的药物节点集合、疾病节点集合、药物与疾病关系的边集合;
For drug-disease relationship data, build a drug-disease correlation network
Figure FDA0003187370810000041
in
Figure FDA0003187370810000042
E D_DI respectively represent the set of drug nodes, the set of disease nodes, and the set of edges of the relationship between drugs and diseases in the network;
对于药物与副作用关系数据,构建药物副作用相关网络
Figure FDA0003187370810000043
其中
Figure FDA0003187370810000044
ED_SE分别表示该网络中的药物节点集合、副作用节点集合、药物与副作用关系的边集合;
For drug and side effect relationship data, build drug side effect correlation network
Figure FDA0003187370810000043
in
Figure FDA0003187370810000044
E D_SE respectively represent the drug node set, side effect node set, and edge set of the relationship between drugs and side effects in the network;
对于靶点与疾病关系数据,构建靶点疾病相关网络
Figure FDA0003187370810000045
其中
Figure FDA0003187370810000046
ET_DI分别表示该网络中的靶点节点集合、疾病节点集合、靶点与疾病关系的边集合。
For target-disease relationship data, construct a target-disease-related network
Figure FDA0003187370810000045
in
Figure FDA0003187370810000046
ET_DI represents the target node set, disease node set, and edge set of the relationship between target and disease in the network, respectively.
4.如权利要求1所述的基于多层网络与图编码的药物靶点相互作用预测方法,其特征在于:(2-1)中C具体是:4. the drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, is characterized in that: in (2-1), C is specifically: 对于药物化学指纹数据,构建药物的化学相似性网络G2D=(V2D,E2D),其中V2D、E2D分别表示该网络中的药物节点集合、两药物之间化学相似性的边权集合;化学相似性的边权
Figure FDA0003187370810000047
其中a1和b1为两药物各自MACCS指纹的比特位数,c1为两药物相同比特位的位数;
For medicinal chemical fingerprint data, construct a drug chemical similarity network G 2D = (V 2D , E 2D ), where V 2D and E 2D represent the set of drug nodes in the network and the edge weights of chemical similarity between two drugs, respectively set; edge weights for chemical similarity
Figure FDA0003187370810000047
where a 1 and b 1 are the bits of the respective MACCS fingerprints of the two drugs, and c 1 is the number of bits of the same bits of the two drugs;
对于药物的治疗学数据,构建药物的治疗学相似性网络G3D=(V3D,E3D),其中V3D、E3D分别表示该网络中的药物节点集合、两药物之间治疗学相似性的边权集合;治疗学相似性的边权
Figure FDA0003187370810000051
其中a2和b2为两药物各自ATC编码,c2为两药物相同ATC编码的位数;
For the therapeutic data of drugs, construct a therapeutic similarity network G 3D = (V 3D , E 3D ), where V 3D and E 3D represent the set of drug nodes in the network and the therapeutic similarity between two drugs, respectively set of edge weights; edge weights of therapeutic similarity
Figure FDA0003187370810000051
where a 2 and b 2 are the respective ATC codes of the two drugs, and c 2 is the number of digits of the same ATC code of the two drugs;
对于药物作用靶点的肽链数据,构建药物的作用靶点序列相似性网络G4D=(V4D,E4D),其中V4D、E4D分别表示该网络中的药物节点集合、两药物之间作用靶点相似性的边权集合;药物作用靶点相似性的边权
Figure FDA0003187370810000052
其中a和b表示两药物各自的靶点,TT_T(a,b)表示两药物的各自靶点的序列相似性,mean(·)表示取平均值;
For the peptide chain data of drug targets, build a drug target sequence similarity network G 4D = (V 4D , E 4D ), where V 4D and E 4D represent the set of drug nodes in the network, the relationship between the two drugs, respectively. The set of edge weights for the similarity of drug action targets; the edge weights for the similarity of drug action targets
Figure FDA0003187370810000052
where a and b represent the respective targets of the two drugs, T T_T (a, b) represents the sequence similarity of the respective targets of the two drugs, and mean( ) represents the average value;
对于药物的生物学过程数据,构建药物的生物学过程相似性网络G5D=(V5D,E5D),其中V5D、E5D分别表示该网络中的药物节点集合、两药物之间生物学过程相似性的边权集合;药物生物学过程相似性的边权
Figure FDA0003187370810000053
TT_P(a,b)表示两药物的各自靶点的生物学过程相似性;
For the biological process data of a drug, construct a biological process similarity network G 5D = (V 5D , E 5D ), where V 5D and E 5D represent the set of drug nodes in the network, the biological process between the two drugs, respectively. A set of edge weights for process similarity; edge weights for drug biological process similarity
Figure FDA0003187370810000053
T T_P (a,b) represents the biological process similarity of the respective targets of the two drugs;
对于药物的分子功能数据,构建药物的分子功能相似性网络G6D=(V6D,E6D),其中V6D、E6D分别表示该网络中的药物节点集合、两药物之间分子功能相似性的边权集合;药物分子功能相似性的边权
Figure FDA0003187370810000054
TT_M(a,b)表示两药物的各自靶点的分子功能相似性;
For the molecular function data of drugs, construct a molecular function similarity network G 6D = (V 6D , E 6D ), where V 6D and E 6D represent the set of drug nodes in the network and the molecular function similarity between two drugs, respectively set of edge weights; edge weights for functional similarity of drug molecules
Figure FDA0003187370810000054
T T_M (a,b) represents the molecular functional similarity of the respective targets of the two drugs;
对于药物的作用细胞成分数据,构建药物的作用细胞成分相似性网络G7D=(V7D,E7D),其中V7D、E7D分别表示该网络中的药物节点集合、两药物之间作用细胞成分相似性的边权集合;药物作用细胞成分相似性的边权
Figure FDA0003187370810000055
TT_C(a,b)表示两药物的各自靶点的作用细胞成分相似性。
For the data of drug action cell components, construct drug action cell component similarity network G 7D = (V 7D , E 7D ), where V 7D and E 7D represent the drug node set in the network and the action cells between the two drugs, respectively A set of edge weights for component similarity; edge weights for drug action cell component similarity
Figure FDA0003187370810000055
T T_C (a,b) represents the similarity of the cellular components of the respective targets of the two drugs.
5.如权利要求1所述的基于多层网络与图编码的药物靶点相互作用预测方法,其特征在于:(2-1)中D具体是:5. The drug target interaction prediction method based on multi-layer network and graph encoding as claimed in claim 1, characterized in that: in (2-1), D is specifically: 对于靶点的肽链数据,构建靶点序列相似性网络G2T=(V2T,E2T),其中V2T、E2T分别表示该网络中的靶点节点集合、两靶点之间序列相似性的边权集合;序列相似性边权
Figure FDA0003187370810000061
其中a3和b3为两靶点各自的肽链序列位数,c3为两靶点相同肽链序列的位数;
For the peptide chain data of the target, construct a target sequence similarity network G 2T = (V 2T , E 2T ), where V 2T and E 2T represent the target node set in the network and the sequence similarity between the two targets, respectively. set of edge weights for sex; sequence similarity edge weights
Figure FDA0003187370810000061
where a 3 and b 3 are the number of digits of the respective peptide chain sequences of the two targets, and c 3 is the number of digits of the same peptide chain sequence of the two targets;
对于靶点的生物学过程数据,构建靶点生物学过程相似性网络G3T=(V3T,E3T),其中V3T、E3T分别表示该网络中的靶点节点集合、两靶点之间生物学过程相似性的边权集合;靶点生物学过程相似性的边权TT_P(a,b)通过两靶点的生物学过程的GO语义注释得到;For the biological process data of the target, build a target biological process similarity network G 3T = (V 3T , E 3T ), where V 3T and E 3T represent the target node set in the network and the difference between the two targets respectively. The set of edge weights of the biological process similarity between the two targets; the edge weight T T_P (a,b) of the biological process similarity of the target is obtained by the GO semantic annotation of the biological process of the two targets; 对于靶点所在细胞成分数据,构建靶点所在细胞成分相似性网络G4T=(V4T,E4T),其中V4T、E4T分别表示该网络中的靶点节点集合、两靶点之间所在细胞成分相似性的边权集合;靶点所在细胞成分相似性的边权TT_C(a,b)通过两靶点的所在细胞成分的GO语义注释得到;For the cellular component data where the target is located, construct a similarity network of cellular components where the target is located G 4T = (V 4T , E 4T ), where V 4T and E 4T represent the target node set in the network and the distance between the two targets, respectively. The set of edge weights for the similarity of the cell components where the target is located; the edge weight T T_C (a,b) of the similarity of the cell components where the target is located is obtained by the GO semantic annotation of the cell components where the two targets are located; 对于靶点分子功能数据,构建靶点分子功能相似性网络G5T=(V5T,E5T),其中V5T、E5T分别表示该网络中的靶点节点集合、两靶点之间分子功能相似性的边权集合;靶点分子功能相似性的边权TT_M(a,b)通过两靶点的分子功能的GO语义注释得到。For the target molecular function data, construct a target molecular function similarity network G 5T = (V 5T , E 5T ), where V 5T and E 5T represent the target node set in the network and the molecular function between the two targets, respectively. The set of edge weights for similarity; the edge weights T T_M (a,b) for the similarity of target molecular functions are obtained by GO semantic annotation of molecular functions of the two targets.
6.如权利要求1所述的基于多层网络与图编码的药物靶点相互作用预测方法,其特征在于:(2-1)中E具体是:6. The drug target interaction prediction method based on multi-layer network and graph encoding as claimed in claim 1, characterized in that: in (2-1), E is specifically: 对于药物与靶点相互作用关系数据,构建药物靶点相互作用网络
Figure FDA0003187370810000062
其中
Figure FDA0003187370810000063
ED_T分别表示该网络中的药物节点集合、靶点节点集合、药物与靶点关系的边集合。
For drug-target interaction relationship data, build a drug-target interaction network
Figure FDA0003187370810000062
in
Figure FDA0003187370810000063
E D_T respectively represent the drug node set, target node set, and edge set of the relationship between drug and target in the network.
7.如权利要求1所述的基于多层网络与图编码的药物靶点相互作用预测方法,其特征在于:(3-1)训练过程为:7. The drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, wherein: (3-1) training process is: a.将单层网络对应的邻接矩阵作为编码器的输入;a. Use the adjacency matrix corresponding to the single-layer network as the input of the encoder; b.编码后得到编码器的输出,并将其作为解码器的输入;b. After encoding, the output of the encoder is obtained, and it is used as the input of the decoder; c.解码后得到解码器的输出,利用邻接矩阵、编码器输出、解码器输出计算损失函数;c. After decoding, the output of the decoder is obtained, and the loss function is calculated by using the adjacency matrix, the output of the encoder, and the output of the decoder; d.利用损失函数计算编码器和解码器各参数的梯度,更新参数,更新步长为负梯度的倍数;d. Use the loss function to calculate the gradient of each parameter of the encoder and the decoder, update the parameters, and the update step size is a multiple of the negative gradient; e.重复步骤b到d,直到损失函数收敛;e. Repeat steps b to d until the loss function converges; 所述的损失函数Lm计算包括两部分:The calculation of the loss function L m includes two parts: 一阶相似性损失
Figure FDA0003187370810000071
N为节点数量,zp和zg分别表示编码器对节点p和节点g的编码输出向量,Tpg表示连边的权重;
first-order similarity loss
Figure FDA0003187370810000071
N is the number of nodes, z p and z g represent the encoding output vector of the encoder to node p and node g, respectively, and T pg represents the weight of the connecting edge;
二阶相似性损失
Figure FDA0003187370810000072
bn
Figure FDA0003187370810000073
分别表示节点n的编码器输入向量和解码器输出向量;
second-order similarity loss
Figure FDA0003187370810000072
b n and
Figure FDA0003187370810000073
represent the encoder input vector and decoder output vector of node n, respectively;
总损失函数Lm=L2nd+λL1st,λ为惩罚项,0<λ<1。The total loss function L m =L 2nd +λL 1st , λ is a penalty term, 0<λ<1.
8.如权利要求1所述的基于多层网络与图编码的药物靶点相互作用预测方法,其特征在于:(4-2)具体过程如下:8. the drug target interaction prediction method based on multi-layer network and graph coding as claimed in claim 1, is characterized in that: (4-2) concrete process is as follows: (4-2-1)在每轮构造决策树前,使用基于梯度的单边采样算法筛选掉小梯度样本,即保留小部分大梯度样本并随机选取部分小梯度样本用来计算总体的方差增益;(4-2-1) Before constructing the decision tree in each round, use the gradient-based unilateral sampling algorithm to filter out the small gradient samples, that is, keep a small part of the large gradient samples and randomly select some small gradient samples to calculate the overall variance gain ; (4-2-2)在每轮构造决策树前,使用互斥特征捆绑(EFB)算法合并互斥特征;(4-2-2) Before constructing the decision tree in each round, use the mutually exclusive feature binding (EFB) algorithm to merge the mutually exclusive features; (4-2-3)基于筛选后的样本,当输入某样本的输入特征向量x及对应标签y时,对所生成的第l棵决策树构造拟合目标:若l=1,拟合目标就是样本的标签,其中正样本标签为1,负样本标签为0;当l≥2,则拟合目标为
Figure FDA0003187370810000074
其中第l-1轮迭代后得到的提升树
Figure FDA0003187370810000075
L为损失函数,在二分类任务下,单个样本(x,y)在预测值为
Figure FDA0003187370810000076
时损失函数定义为:
Figure FDA0003187370810000077
(4-2-3) Based on the filtered samples, when the input feature vector x and the corresponding label y of a sample are input, a fitting target is constructed for the generated lth decision tree: if l=1, the fitting target is the label of the sample, where the positive sample label is 1 and the negative sample label is 0; when l≥2, the fitting target is
Figure FDA0003187370810000074
The boosted tree obtained after the l-1 round of iterations
Figure FDA0003187370810000075
L is the loss function. Under the binary classification task, the predicted value of a single sample (x, y) is
Figure FDA0003187370810000076
The loss function is defined as:
Figure FDA0003187370810000077
(4-2-4)基于筛选后的样本,拟合目标构建一棵二叉决策树,该二叉决策树的一个叶节点,其分裂过程为:为每个筛选后的特征根据特征的取值范围构造一个直方图,利用该直方图计算各划分点的方差增益,选取有最大方差增益的特征及划分点作为当前节点的分裂特征和最优分割点,将该最优分割点对应的叶节点的数据分割成两批;不断递归,直到达到树的最大深度;特征f基于数据集D在划分点d的方差增益表示为:(4-2-4) Based on the filtered samples, fit the target to construct a binary decision tree. The splitting process of a leaf node of the binary decision tree is: for each filtered feature according to the selection of the feature The value range constructs a histogram, uses the histogram to calculate the variance gain of each division point, selects the feature and division point with the largest variance gain as the split feature and optimal division point of the current node, and selects the leaf corresponding to the optimal division point. The data of the node is divided into two batches; the recursion is repeated until the maximum depth of the tree is reached; the feature f is expressed as:
Figure FDA0003187370810000081
Figure FDA0003187370810000081
其中xl、xl,f、gl分别表示第l个样本向量、第l个样本向量的第f个特征及其负梯度,
Figure FDA0003187370810000082
Figure FDA0003187370810000083
分别在数据集D中所有特征f小于划分点d及大于划分点d的样本个数;
where x l , x l,f , g l represent the l-th sample vector, the f-th feature of the l-th sample vector and its negative gradient, respectively,
Figure FDA0003187370810000082
and
Figure FDA0003187370810000083
In the dataset D, all the features f are smaller than the division point d and the number of samples larger than the division point d;
(4-2-5)进行K轮迭代,生成K个决策树;(4-2-5) K rounds of iterations are performed to generate K decision trees; (4-2-6)将K个决策树相加,生成最终的轻量级梯度提升决策树
Figure FDA0003187370810000084
(4-2-6) Add K decision trees to generate the final lightweight gradient boosting decision tree
Figure FDA0003187370810000084
CN202110865457.9A 2021-07-29 2021-07-29 Drug target interaction prediction method based on multilayer network and graph coding Pending CN113571125A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110865457.9A CN113571125A (en) 2021-07-29 2021-07-29 Drug target interaction prediction method based on multilayer network and graph coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110865457.9A CN113571125A (en) 2021-07-29 2021-07-29 Drug target interaction prediction method based on multilayer network and graph coding

Publications (1)

Publication Number Publication Date
CN113571125A true CN113571125A (en) 2021-10-29

Family

ID=78169065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110865457.9A Pending CN113571125A (en) 2021-07-29 2021-07-29 Drug target interaction prediction method based on multilayer network and graph coding

Country Status (1)

Country Link
CN (1) CN113571125A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114023464A (en) * 2021-11-08 2022-02-08 东北林业大学 A Drug-Target Interaction Prediction Method Based on Contrastive Learning of Supervised Synergy Graphs
CN114038499A (en) * 2021-11-12 2022-02-11 东南大学 Traditional Chinese medicine prescription active ingredient group prediction method based on heterogeneous network embedding
CN114067905A (en) * 2021-11-08 2022-02-18 大连大学 A drug-target interaction prediction method incorporating multilayer drug structure information
CN114171114A (en) * 2021-12-06 2022-03-11 中山大学 Method and device for constructing drug target prediction model, storage medium and electronic equipment
CN114334038A (en) * 2021-12-31 2022-04-12 杭州师范大学 Disease drug prediction method based on heterogeneous network embedded model
CN114944191A (en) * 2022-06-21 2022-08-26 湖南中医药大学 A component-target interaction prediction method based on web crawler and multimodal features
CN114974408A (en) * 2022-05-26 2022-08-30 浙江大学 Construction method, prediction method and device of drug interaction prediction model
WO2023123168A1 (en) * 2021-12-30 2023-07-06 Boe Technology Group Co., Ltd. Method of generating negative sample set for predicting macromolecule-macromolecule interaction, method of predicting macromolecule-macromolecule interaction, method of training model
CN119476258A (en) * 2024-11-21 2025-02-18 浪潮云信息技术股份公司 Medical patent information analysis method, device, equipment and medium based on big model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785320A (en) * 2020-06-28 2020-10-16 西安电子科技大学 Drug-target interaction prediction method based on multi-layer network representation learning
CN112382411A (en) * 2020-11-13 2021-02-19 大连理工大学 Drug-protein targeting effect prediction method based on heterogeneous graph
US20210142173A1 (en) * 2019-11-12 2021-05-13 The Cleveland Clinic Foundation Network-based deep learning technology for target identification and drug repurposing
CN112926627A (en) * 2021-01-28 2021-06-08 电子科技大学 Equipment defect time prediction method based on capacitive equipment defect data
CN113066526A (en) * 2021-04-08 2021-07-02 北京大学 Hypergraph-based drug-target-disease interaction prediction method
CN116206775A (en) * 2023-01-13 2023-06-02 大连大学 Multi-dimensional characteristic fusion medicine-target interaction prediction method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210142173A1 (en) * 2019-11-12 2021-05-13 The Cleveland Clinic Foundation Network-based deep learning technology for target identification and drug repurposing
CN111785320A (en) * 2020-06-28 2020-10-16 西安电子科技大学 Drug-target interaction prediction method based on multi-layer network representation learning
CN112382411A (en) * 2020-11-13 2021-02-19 大连理工大学 Drug-protein targeting effect prediction method based on heterogeneous graph
CN112926627A (en) * 2021-01-28 2021-06-08 电子科技大学 Equipment defect time prediction method based on capacitive equipment defect data
CN113066526A (en) * 2021-04-08 2021-07-02 北京大学 Hypergraph-based drug-target-disease interaction prediction method
CN116206775A (en) * 2023-01-13 2023-06-02 大连大学 Multi-dimensional characteristic fusion medicine-target interaction prediction method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHANG SUN, ETAL.: ""Graph Convolutional Autoencoder and Generative Adversarial Network-Based Method for Predicting Drug-Target Interactions"", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》, vol. 19, no. 1, 1 June 2020 (2020-06-01), pages 455 - 464 *
CHUANG LIU, ETAL.: ""Computational network biology: Data, models, and applications"", 《PHYSICS REPORTS》, vol. 846, 30 December 2019 (2019-12-30), pages 1 - 66, XP086088178, DOI: 10.1016/j.physrep.2019.12.004 *
FANGPING WAN, ETAL.: ""NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug–target interactions"", 《BIOINFORMATICS》, vol. 35, no. 1, 2 July 2018 (2018-07-02), pages 104 - 111 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114023464A (en) * 2021-11-08 2022-02-08 东北林业大学 A Drug-Target Interaction Prediction Method Based on Contrastive Learning of Supervised Synergy Graphs
CN114067905A (en) * 2021-11-08 2022-02-18 大连大学 A drug-target interaction prediction method incorporating multilayer drug structure information
CN114023464B (en) * 2021-11-08 2022-08-09 东北林业大学 Drug-target interaction prediction method based on supervised synergy map contrast learning
CN114038499A (en) * 2021-11-12 2022-02-11 东南大学 Traditional Chinese medicine prescription active ingredient group prediction method based on heterogeneous network embedding
CN114171114A (en) * 2021-12-06 2022-03-11 中山大学 Method and device for constructing drug target prediction model, storage medium and electronic equipment
WO2023123168A1 (en) * 2021-12-30 2023-07-06 Boe Technology Group Co., Ltd. Method of generating negative sample set for predicting macromolecule-macromolecule interaction, method of predicting macromolecule-macromolecule interaction, method of training model
CN114334038A (en) * 2021-12-31 2022-04-12 杭州师范大学 Disease drug prediction method based on heterogeneous network embedded model
CN114334038B (en) * 2021-12-31 2024-05-14 杭州师范大学 Disease medicine prediction method based on heterogeneous network embedded model
CN114974408A (en) * 2022-05-26 2022-08-30 浙江大学 Construction method, prediction method and device of drug interaction prediction model
CN114944191A (en) * 2022-06-21 2022-08-26 湖南中医药大学 A component-target interaction prediction method based on web crawler and multimodal features
CN119476258A (en) * 2024-11-21 2025-02-18 浪潮云信息技术股份公司 Medical patent information analysis method, device, equipment and medium based on big model
CN119476258B (en) * 2024-11-21 2025-09-19 浪潮云信息技术股份公司 Medicine patent information analysis method, device, equipment and medium based on large model

Similar Documents

Publication Publication Date Title
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN111312329B (en) Transcription factor binding site prediction method based on deep convolution automatic encoder
CN113707235B (en) Drug micromolecule property prediction method, device and equipment based on self-supervision learning
CN114724623B (en) A method for predicting drug-target affinity based on multi-source protein feature fusion
CN113393911B (en) Ligand compound rapid pre-screening method based on deep learning
CN113936735A (en) A method for predicting the binding affinity of drug molecules to target proteins
CN116386729A (en) scRNA-seq data dimension reduction method based on graph neural network
CN118212974A (en) Drug target interaction prediction method based on multisource characteristic interaction
CN118430844B (en) Deep learning heterogeneous network-based adverse drug reaction prediction method and system
CN118588196A (en) A drug design method based on autoregressive model
CN119763653A (en) A method for predicting drug-target affinity based on parallel fully connected networks
CN116864031B (en) A drug-drug interaction prediction method based on RGDA-DDI
Geethu et al. Protein secondary structure prediction using cascaded feature learning model
CN118155746A (en) A dual-channel contrast model for predicting molecular properties
CN117611974A (en) Image recognition method and system based on searching of multiple group alternative evolutionary neural structures
Song et al. Drug sensitivity prediction based on multi-stage multi-modal drug representation learning
CN114898815B (en) Method and device for predicting homogeneous interaction based on spatial structure in field of drug discovery
CN116631512A (en) Prediction method of piRNA-disease association based on deep decomposition machine
CN115312125B (en) Deep learning method for predicting drug-target interaction based on biological substructure
CN116978464A (en) Data processing methods, devices, equipment and media
CN119068972B (en) Drug target interaction relation prediction method and system
CN119580825B (en) Drug target prediction model and method based on graph neural network
CN110534153B (en) Target prediction system and method based on deep learning
CN119400294A (en) Drug molecule optimization design method based on deep reinforcement learning and skeleton constraints
CN119580820A (en) Drug-target affinity prediction method based on multi-scale convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination