CN116052762A - Method and server for matching drug molecules and target proteins - Google Patents
Method and server for matching drug molecules and target proteins Download PDFInfo
- Publication number
- CN116052762A CN116052762A CN202310094426.7A CN202310094426A CN116052762A CN 116052762 A CN116052762 A CN 116052762A CN 202310094426 A CN202310094426 A CN 202310094426A CN 116052762 A CN116052762 A CN 116052762A
- Authority
- CN
- China
- Prior art keywords
- information
- characterization
- target
- protein
- target protein
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 944
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 943
- 239000003814 drug Substances 0.000 title claims abstract description 689
- 229940079593 drug Drugs 0.000 title claims abstract description 689
- 238000000034 method Methods 0.000 title claims abstract description 109
- 238000012512 characterization method Methods 0.000 claims abstract description 449
- 230000004853 protein function Effects 0.000 claims abstract description 104
- 238000012216 screening Methods 0.000 claims abstract description 49
- 238000012549 training Methods 0.000 claims description 102
- 230000003993 interaction Effects 0.000 claims description 44
- 230000000052 comparative effect Effects 0.000 claims description 26
- 230000004927 fusion Effects 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 17
- 238000005070 sampling Methods 0.000 claims description 14
- 230000004044 response Effects 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 2
- 238000012163 sequencing technique Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 13
- 238000009509 drug development Methods 0.000 description 12
- 201000010099 disease Diseases 0.000 description 8
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 8
- 239000003596 drug target Substances 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 150000003384 small molecules Chemical class 0.000 description 7
- 239000013598 vector Substances 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- 239000002547 new drug Substances 0.000 description 6
- 108091026890 Coding region Proteins 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000012514 protein characterization Methods 0.000 description 4
- 101710116895 DNA-binding protein H-NS Proteins 0.000 description 3
- 101710132617 Protein B1 Proteins 0.000 description 3
- 101710132618 Protein B2 Proteins 0.000 description 3
- 101710170630 Ribonucleoside-diphosphate reductase 1 subunit alpha Proteins 0.000 description 3
- 101710201789 Ribonucleoside-diphosphate reductase 1 subunit beta Proteins 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000004071 biological effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000009510 drug design Methods 0.000 description 2
- 238000007877 drug screening Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012827 research and development Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 108020004414 DNA Proteins 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- HTTJABKRGRZYRN-UHFFFAOYSA-N Heparin Chemical compound OC1C(NC(=O)C)C(O)OC(COS(O)(=O)=O)C1OC1C(OS(O)(=O)=O)C(O)C(OC2C(C(OS(O)(=O)=O)C(OC3C(C(O)C(O)C(O3)C(O)=O)OS(O)(=O)=O)C(CO)O2)NS(O)(=O)=O)C(C(O)=O)O1 HTTJABKRGRZYRN-UHFFFAOYSA-N 0.000 description 1
- 102000004310 Ion Channels Human genes 0.000 description 1
- 108091000080 Phosphotransferase Proteins 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000008827 biological function Effects 0.000 description 1
- 230000008236 biological pathway Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000002651 drug therapy Methods 0.000 description 1
- 102000038037 druggable proteins Human genes 0.000 description 1
- 108091007999 druggable proteins Proteins 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 229960002897 heparin Drugs 0.000 description 1
- 229920000669 heparin Polymers 0.000 description 1
- 238000013537 high throughput screening Methods 0.000 description 1
- 238000009533 lab test Methods 0.000 description 1
- 150000002611 lead compounds Chemical class 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000000144 pharmacologic effect Effects 0.000 description 1
- 102000020233 phosphotransferase Human genes 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000006916 protein interaction Effects 0.000 description 1
- 230000026447 protein localization Effects 0.000 description 1
- 102000005962 receptors Human genes 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 229920002477 rna polymer Polymers 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 229940126586 small molecule drug Drugs 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Pharmacology & Pharmacy (AREA)
- Medicinal Chemistry (AREA)
- Crystallography & Structural Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本申请涉及计算机技术,尤其涉及一种药物分子与靶点蛋白匹配的方法、服务器。This application relates to computer technology, in particular to a method and server for matching drug molecules and target proteins.
背景技术Background technique
药物靶点是药物与机体生物大分子作用而产生药理学作用并达到防治疾病目的由生物分子形成的特殊位点,是药物发挥作用的基础,在新药筛选中具有十分重要的意义。药物分子与靶点蛋白的匹配方法可以应用于预测药物靶点,不但对药物分子初期成药性的评价有着不可替代的作用,而且对药物成熟后老药新用等领域都有着重大的意义,也可以应用于预测靶点适用的药物,在联合药物治疗等场景下有着重大的意义。A drug target is a special site formed by a biomolecule formed by the interaction between a drug and a biomacromolecule in the body to produce pharmacological effects and achieve the purpose of preventing and treating diseases. It is the basis for the drug to function and is of great significance in the screening of new drugs. The matching method between drug molecules and target proteins can be applied to predict drug targets, which not only plays an irreplaceable role in the evaluation of the initial druggability of drug molecules, but also has great significance in the field of new use of old drugs after the drug matures. It can be applied to predict the drugs applicable to the target, which is of great significance in scenarios such as combined drug therapy.
目前,药物分子与靶点蛋白的匹配方法大部分集中于将药物分子与靶点蛋白进行逐一匹配,来预测药物分子与靶点蛋白是否具有相互作用关系。但是在实际应用中,往往需要从数量众多的靶点(药物)中筛选与给定药物(或靶点)可能具有相互作用的靶点(药物),药物分子与靶点蛋白的匹配的耗时长、效率低。At present, most of the matching methods between drug molecules and target proteins focus on matching drug molecules and target proteins one by one to predict whether there is an interaction relationship between drug molecules and target proteins. However, in practical applications, it is often necessary to screen a target (drug) that may interact with a given drug (or target) from a large number of targets (drugs), and it takes a long time to match the drug molecule with the target protein. ,low efficiency.
发明内容Contents of the invention
本申请提供一种药物分子与靶点蛋白匹配的方法、服务器,用以解决现有的药物分子与靶点蛋白匹配的方法效率低的问题。The present application provides a method and a server for matching drug molecules and target proteins to solve the problem of low efficiency of existing methods for matching drug molecules and target proteins.
第一方面,本申请提供一种药物分子与靶点蛋白匹配方法,包括:获取待匹配的药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息;确定所述药物分子的表征信息和所述靶点蛋白的表征信息,所述药物分子的表征信息包括药物分子的结构表征,所述靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征;根据所述药物分子的结构表征、所述靶点蛋白的结构表征和蛋白家族组别表征,预测所述靶点蛋白与所述药物分子的相关性信息,并根据所述相关性信息,从待筛选对象中筛选出备选对象,所述待筛选对象为所述靶点蛋白或所述药物分子;根据所述药物分子的表征信息和所述靶点蛋白的表征信息,对所述备选对象进行排序,根据排序结果输出所述药物分子与所述靶点蛋白的匹配结果。In the first aspect, the present application provides a method for matching a drug molecule with a target protein, including: obtaining the structural information of the drug molecule to be matched, the structural information of the target protein, the protein family group information, and the protein function description information; The characterization information of the drug molecule and the characterization information of the target protein, the characterization information of the drug molecule includes the structural characterization of the drug molecule, the characterization information of the target protein includes the structural characterization of the target protein, protein family group Characterization and protein function description and characterization; according to the structural characterization of the drug molecule, the structural characterization of the target protein and the characterization of protein family groups, predict the correlation information between the target protein and the drug molecule, and according to the According to the above correlation information, select candidate objects from the objects to be screened, and the objects to be screened are the target protein or the drug molecule; according to the characterization information of the drug molecule and the characterization information of the target protein , sort the candidate objects, and output a matching result between the drug molecule and the target protein according to the sorting result.
第二方面,本申请提供一种药物分子与靶点蛋白匹配方法,包括:响应于药物多靶点蛋白预测指令,获取给定的药物分子的结构信息,以及待筛选的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息;确定所述药物分子的表征信息和所述靶点蛋白的表征信息,所述药物分子的表征信息包括药物分子的结构表征,所述靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征;根据所述药物分子的结构表征、所述靶点蛋白的结构表征和蛋白家族组别表征,预测所述靶点蛋白与所述药物分子的相关性信息,并根据所述相关性信息,从待筛选的靶点蛋白中筛选出备选靶点蛋白;根据所述药物分子的表征信息和所述靶点蛋白的表征信息,对所述备选靶点蛋白进行排序,根据排序结果输出与所述药物分子相匹配的多靶点蛋白序列。In the second aspect, the present application provides a method for matching drug molecules and target proteins, including: responding to drug multi-target protein prediction instructions, obtaining structural information of a given drug molecule and structural information of target proteins to be screened , protein family group information and protein function description information; determine the characterization information of the drug molecule and the characterization information of the target protein, the characterization information of the drug molecule includes the structural characterization of the drug molecule, and the characterization information of the target protein The characterization information includes the structural characterization of the target protein, the characterization of the protein family group and the description and characterization of the protein function; according to the structural characterization of the drug molecule, the structural characterization of the target protein and the characterization of the protein family group, predict the target The correlation information between the protein and the drug molecule, and according to the correlation information, select the candidate target protein from the target protein to be screened; according to the characterization information of the drug molecule and the target protein To characterize the information, sort the candidate target proteins, and output the multi-target protein sequence matching the drug molecule according to the sorting results.
第三方面,本申请提供一种药物分子与靶点蛋白匹配方法,包括:响应于靶点蛋白适用药物分子的匹配指令,获取给定的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,以及待筛选的药物分子的结构信息;确定所述靶点蛋白的表征信息和所述药物分子的表征信息,所述靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,所述药物分子的表征信息包括药物分子的结构表征;根据所述药物分子的结构表征、所述靶点蛋白的结构表征和蛋白家族组别表征,预测所述靶点蛋白与所述药物分子的相关性信息,并根据所述相关性信息,从待筛选的药物分子中筛选出备选药物分子;根据所述药物分子的表征信息和所述靶点蛋白的表征信息,对所述备选药物分子进行排序,根据排序结果输出与所述靶点蛋白相匹配的多药物分子序列。In a third aspect, the present application provides a method for matching a drug molecule with a target protein, including: responding to a matching instruction for a drug molecule applicable to the target protein, obtaining structural information, protein family group information, and protein Functional description information, and structural information of the drug molecule to be screened; determine the characterization information of the target protein and the characterization information of the drug molecule, the characterization information of the target protein includes the structural characterization of the target protein, protein family Group characterization and protein function description and characterization, the characterization information of the drug molecule includes the structural characterization of the drug molecule; according to the structural characterization of the drug molecule, the structural characterization of the target protein and the characterization of the protein family group, predict the Correlation information between the target protein and the drug molecule, and according to the correlation information, select candidate drug molecules from the drug molecules to be screened; according to the characterization information of the drug molecule and the target protein To characterize the information, sort the candidate drug molecules, and output the sequence of multi-drug molecules matching the target protein according to the sorting results.
第四方面,本申请提供一种药物分子与靶点蛋白匹配方法,应用于端侧设备,包括:In the fourth aspect, the present application provides a method for matching drug molecules and target proteins, which is applied to end-to-end devices, including:
响应于药物分子与靶点蛋白的匹配请求,获取待匹配的给定对象和待筛选对象的信息,所述给定对象和待筛选对象两者中一个是药物分子另一个是靶点蛋白;将所述待匹配的给定对象和待筛选对象的信息发送至服务器,其中,所述匹配结果是所述服务器根据所述待匹配的给定对象和待筛选对象的信息,通过如上述任一方面所述的方法确定的;接收所述服务器发送的药物分子与靶点蛋白的匹配结果;输出所述药物分子与靶点蛋白的匹配结果。Responding to the matching request between the drug molecule and the target protein, obtaining the information of the given object to be matched and the object to be screened, one of the given object and the object to be screened is the drug molecule and the other is the target protein; The information of the given object to be matched and the object to be screened is sent to the server, wherein the matching result is that the server, according to the information of the given object to be matched and the object to be screened, through any of the above aspects Determined by the method; receiving the matching result of the drug molecule and the target protein sent by the server; outputting the matching result of the drug molecule and the target protein.
第五方面,本申请提供一种服务器,包括:处理器,以及与所述处理器通信连接的存储器;所述存储器存储计算机执行指令;所述处理器执行所述存储器存储的计算机执行指令,以实现上述所述的方法。In a fifth aspect, the present application provides a server, including: a processor, and a memory communicatively connected to the processor; the memory stores computer-executable instructions; the processor executes the computer-executable instructions stored in the memory, to Implement the methods described above.
本申请提供的药物分子与靶点蛋白匹配的方法、服务器,通过获取待匹配的药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息;确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征;根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选对象中筛选出备选对象,能够快速地从数量众多的待筛选对象中粗筛出与给定对象相关性较高的多个备选对象;进一步地,根据药物分子的表征信息和靶点蛋白的表征信息,对备选对象进行排序,根据排序结果输出药物分子与靶点蛋白的匹配结果,通过分子结构、蛋白家族组别、蛋白功能描述等多模态特征信息的表征信息,对粗筛得到的多个备选对象进行精排,能够自动且快速地筛选出给定对象相匹配的待筛选对象,来实现药物分子与靶点蛋白的匹配,提高了药物分子与靶点蛋白匹配的效率。The method and server for matching a drug molecule with a target protein provided in this application obtain the structural information of the drug molecule to be matched, the structural information of the target protein, the protein family group information, and the protein function description information; determine the characterization of the drug molecule Information and characterization information of the target protein, the characterization information of the drug molecule includes the structural characterization of the drug molecule, and the characterization information of the target protein includes the structural characterization of the target protein, the characterization of the protein family group and the description and characterization of the protein function; according to the characterization of the drug molecule Structural characterization, structural characterization of target protein and protein family group characterization, prediction of correlation information between target protein and drug molecules, and screening of candidate objects from objects to be screened based on correlation information Among the many objects to be screened, multiple candidates with high correlation with the given object are roughly screened out; furthermore, according to the characterization information of the drug molecule and the characterization information of the target protein, the candidate objects are sorted, and according to the ranking The results output the matching results of drug molecules and target proteins. Through the characterization information of multi-modal feature information such as molecular structure, protein family group, and protein function description, the multiple candidate objects obtained by rough screening can be fine-sorted, which can automatically And quickly screen out the to-be-screened object that matches the given object to realize the matching of the drug molecule and the target protein, and improve the matching efficiency of the drug molecule and the target protein.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application.
图1为本申请所适用的一种示例性的系统架构图;Fig. 1 is an exemplary system architecture diagram applicable to the present application;
图2为本申请一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图;Figure 2 is a flowchart of a method for matching a drug molecule with a target protein provided in an exemplary embodiment of the present application;
图3为本申请一示例性实施例提供的药物分子与靶点蛋白匹配的流程框架图;Fig. 3 is a framework diagram of the matching process between a drug molecule and a target protein provided by an exemplary embodiment of the present application;
图4为本申请另一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图;Fig. 4 is a flowchart of a method for matching a drug molecule with a target protein according to another exemplary embodiment of the present application;
图5为本申请一示例性实施例提供的分子靶点关系预测模型训练的方法流程图;Fig. 5 is a flow chart of the method for training the molecular target relationship prediction model provided by an exemplary embodiment of the present application;
图6为本申请一示例性实施例提供的序列模型训练的方法流程图;FIG. 6 is a flowchart of a method for sequence model training provided by an exemplary embodiment of the present application;
图7为本申请另一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图;Fig. 7 is a flowchart of a method for matching a drug molecule with a target protein according to another exemplary embodiment of the present application;
图8为本申请另一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图;Fig. 8 is a flowchart of a method for matching a drug molecule with a target protein according to another exemplary embodiment of the present application;
图9为本申请另一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图;Fig. 9 is a flowchart of a method for matching a drug molecule with a target protein according to another exemplary embodiment of the present application;
图10为本申请一示例性实施例提供的药物多靶点蛋白预测装置的结构示意图;Fig. 10 is a schematic structural diagram of a drug multi-target protein prediction device provided by an exemplary embodiment of the present application;
图11为本申请一示例实施例提供的服务器的结构示意图。Fig. 11 is a schematic structural diagram of a server provided by an example embodiment of the present application.
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。By means of the above drawings, specific embodiments of the present application have been shown, which will be described in more detail hereinafter. These drawings and text descriptions are not intended to limit the scope of the concept of the application in any way, but to illustrate the concept of the application for those skilled in the art by referring to specific embodiments.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.
首先对本申请所涉及的名词进行解释:First, the nouns involved in this application are explained:
靶点蛋白:是指药物靶点、也称为靶点蛋白质、或蛋白靶点,一般指药物分子直接结合的那些蛋白质,比如酶、离子通道和受体或其他生物分子(如脱氧核糖核酸DNA、核糖核酸RNA、肝素和肽等)。大多数药物靶点是蛋白质,用于治疗或诊断疾病的化学物质与靶标蛋白发生选择性相互作用,致使其生物途径或功能改变。Target protein: refers to the drug target, also known as target protein, or protein target, generally refers to those proteins directly bound by drug molecules, such as enzymes, ion channels and receptors or other biomolecules (such as deoxyribonucleic acid DNA , ribonucleic acid RNA, heparin and peptides, etc.). Most drug targets are proteins, and chemicals used to treat or diagnose disease selectively interact with target proteins, resulting in changes in their biological pathways or functions.
药物分子:通常是指小分子药物。Drug molecule: usually refers to a small molecule drug.
多模态:是指多种异构模态的数据协同推理。在人工智能领域中,多模态信息往往指感知信息,如图像、文本、语音等异构模态的信息。本申请实施例中,多模态信息包括药物分子和靶点蛋白的结构信息(图数据),靶点蛋白的蛋白功能描述信息(文本),靶点蛋白的蛋白家族组别信息(如编码序列),由药物分子、靶点蛋白和蛋白家族组别信息构成的三元组等多种异构模态的信息。Multimodality: refers to data collaborative reasoning of multiple heterogeneous modalities. In the field of artificial intelligence, multimodal information often refers to perceptual information, such as heterogeneous modal information such as images, texts, and voices. In the embodiment of this application, the multimodal information includes the structural information (graph data) of the drug molecule and the target protein, the protein function description information (text) of the target protein, and the protein family group information of the target protein (such as coding sequence ), multiple heterogeneous modal information such as triplets composed of drug molecules, target proteins, and protein family group information.
针对药物分子与靶点蛋白匹配的方法耗时长、效率低的问题,本申请提供一种药物分子与靶点蛋白匹配的方法,通过获取待匹配的药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息;确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征;根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选对象中筛选出备选对象,能够快速地从数量众多的待筛选对象中粗筛出与给定对象相关性较高的多个备选对象;进一步地,根据药物分子的表征信息和靶点蛋白的表征信息,对备选对象进行排序,根据排序结果输出药物分子与靶点蛋白的匹配结果,通过分子结构、蛋白家族组别、蛋白功能描述等多模态的特征信息的表征信息,对粗筛得到的多个备选对象进行精排,能够自动且快速地筛选出给定对象相匹配的待筛选对象,来实现药物分子与靶点蛋白的匹配,极大地提高了药物靶点蛋白筛选的效率。Aiming at the time-consuming and low-efficiency problems of matching a drug molecule with a target protein, this application provides a method for matching a drug molecule with a target protein. By obtaining the structural information of the drug molecule to be matched, the structural information of the target protein , protein family group information and protein function description information; determine the characterization information of the drug molecule and the characterization information of the target protein, the characterization information of the drug molecule includes the structural characterization of the drug molecule, and the characterization information of the target protein includes the structure of the target protein Characterization, protein family group characterization and protein function description and characterization; according to the structural characterization of drug molecules, target protein structural characterization and protein family group characterization, predict the correlation information between target proteins and drug molecules, and based on the correlation information , to screen out candidate objects from the objects to be screened, which can quickly and roughly screen out multiple candidate objects with high correlation with a given object from a large number of objects to be screened; further, according to the characterization information of drug molecules According to the characterization information of the target protein, the candidate objects are sorted, and the matching result of the drug molecule and the target protein is output according to the sorting result, and the characterization of the multimodal characteristic information such as molecular structure, protein family group, protein function description, etc. Information, fine sorting of multiple candidate objects obtained by coarse screening, can automatically and quickly screen out the objects to be screened that match the given object, so as to realize the matching of drug molecules and target proteins, which greatly improves the ability of drug targets. Efficiency of point protein screening.
一种可能的应用场景中,应用于为给定药物预测多靶点蛋白序列,这种情况下,给定对象为药物分子,待筛选对象为已知的靶点蛋白、或者用户指定的多个已知靶点蛋白的集合。具体地,响应于药物多靶点蛋白预测指令,获取给定的药物分子的结构信息,以及待筛选的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息;确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征;根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选的靶点蛋白中筛选出备选靶点蛋白,能够在数量众多的已知靶点蛋白中粗筛出与给定药物分子相关性较高的多个备选靶点蛋白;进一步地,根据药物分子的表征信息和备选靶点蛋白的表征信息,对备选靶点蛋白进行排序,根据排序结果输出与药物分子相匹配的多靶点蛋白序列,通过分子结构、蛋白家族组别、蛋白功能描述等多模态的特征信息的表征信息,对粗筛得到的多个备选靶点蛋白进行精排,能够自动且快速地筛选出给定药物分子的可能的靶点蛋白序列,极大地提高了药物靶点蛋白筛选的效率,通过人工智能节省大量人力和实验资源,从而能够缩短多靶点药物研发的周期,推动新药研发进程。输出的多靶点蛋白序列中的多个靶点蛋白可供用户参考作为给定药物可作用于的靶点蛋白。In a possible application scenario, it is applied to predicting multi-target protein sequences for a given drug. In this case, the given object is a drug molecule, and the object to be screened is a known target protein, or multiple user-specified A collection of known target proteins. Specifically, in response to the drug multi-target protein prediction instruction, obtain the structural information of the given drug molecule, as well as the structural information of the target protein to be screened, the protein family group information and the protein function description information; determine the characterization of the drug molecule Information and characterization information of the target protein, the characterization information of the drug molecule includes the structural characterization of the drug molecule, and the characterization information of the target protein includes the structural characterization of the target protein, the characterization of the protein family group and the description and characterization of the protein function; according to the characterization of the drug molecule Structural characterization, structural characterization of target proteins and protein family group characterization, predict the correlation information between target proteins and drug molecules, and screen candidate target proteins from target proteins to be screened based on the correlation information, Can roughly screen out multiple candidate target proteins that are highly related to a given drug molecule from a large number of known target proteins; further, according to the characterization information of the drug molecule and the characterization information of the candidate target protein , sort the candidate target proteins, and output the multi-target protein sequences that match the drug molecules according to the sorting results. Through the characterization information of multimodal characteristic information such as molecular structure, protein family group, protein function description, etc., the The fine sorting of multiple candidate target proteins obtained by coarse screening can automatically and quickly screen out the possible target protein sequences of a given drug molecule, which greatly improves the efficiency of drug target protein screening and saves A large number of manpower and experimental resources can shorten the cycle of multi-target drug development and promote the process of new drug development. Multiple target proteins in the output multi-target protein sequence can be referenced by users as target proteins that a given drug can act on.
另一种可能的应用场景中,应用于为给定靶点蛋白预测适用的药物分子,从而为相关医务人员提供可选择的靶点药物信息。这种情况下,给定对象为靶点蛋白,待筛选对象为已知的药物分子、或者用户指定的包含多个药物分子的集合。具体地,响应于靶点蛋白适用药物分子的匹配指令,获取给定的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,以及待筛选的药物分子的结构信息;确定靶点蛋白的表征信息和药物分子的表征信息,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,药物分子的表征信息包括药物分子的结构表征;根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选的药物分子中筛选出备选药物分子,能够从数量众多的药物分子中粗筛出与给定靶点蛋白相关性较高的多个备选药物分子。进一步地,根据备选药物分子的表征信息和靶点蛋白的表征信息,对备选药物分子进行排序,根据排序结果输出与靶点蛋白相匹配的多药物分子序列,通过分子结构、蛋白家族组别、蛋白功能描述等多模态特征信息的表征信息,对粗筛得到的多个备选药物分子进行精排,能够自动且快速地筛选出给定靶点蛋白的有效的药物分子序列,极大地提高了靶点蛋白适用药物分子筛选的效率。输出的多药物分子序列中包含的多个药物分子,可供用户参考作为针对给定靶点蛋白的有效药物。In another possible application scenario, it is applied to predict suitable drug molecules for a given target protein, so as to provide relevant medical personnel with information on target drugs that can be selected. In this case, the given object is the target protein, and the object to be screened is a known drug molecule, or a set of multiple drug molecules specified by the user. Specifically, in response to the matching instruction of the drug molecule applicable to the target protein, the structural information, protein family group information, and protein function description information of a given target protein, as well as the structural information of the drug molecule to be screened; determine the target Protein characterization information and drug molecule characterization information. Target protein characterization information includes target protein structure characterization, protein family group characterization, and protein function description characterization. Drug molecule characterization information includes drug molecule structure characterization; Molecular structural characterization, target protein structural characterization and protein family group characterization, predict the correlation information between the target protein and drug molecules, and screen candidate drug molecules from the drug molecules to be screened based on the correlation information, It can roughly screen out multiple candidate drug molecules that are highly related to a given target protein from a large number of drug molecules. Further, according to the characterization information of the candidate drug molecules and the characterization information of the target protein, the candidate drug molecules are sorted, and the sequence of multi-drug molecules matching the target protein is output according to the sorting results. Characterization information of multi-modal feature information such as gender and protein function description, and fine sorting of multiple candidate drug molecules obtained by coarse screening, can automatically and quickly screen out effective drug molecule sequences for a given target protein, which is extremely It greatly improves the efficiency of molecular screening of target proteins for drugs. The multiple drug molecules included in the output multi-drug molecular sequence can be used as an effective drug for a given target protein for user reference.
图1为本申请所适用的示例性的系统架构图,如图1所示,该系统架构具体可包括服务器以及端侧设备。其中,服务器具体可为部署在云端的服务器集群,或者部署在本地的计算设备。端侧设备具体可为具有网络通信功能、运算功能以及信息显示功能的硬件设备,其包括但不限于智能手机、平板电脑、台式电脑、物联网设备等实体设备,还可以是应用程序(APP)、软件产品或者数据库等可以作为用户侧的客户端设备与服务器进行交互的硬件/软件设备。FIG. 1 is an exemplary system architecture diagram applicable to this application. As shown in FIG. 1 , the system architecture may specifically include a server and end-side devices. Wherein, the server may specifically be a server cluster deployed in the cloud, or a computing device deployed locally. End-side devices can specifically be hardware devices with network communication functions, computing functions, and information display functions, including but not limited to smart phones, tablet computers, desktop computers, Internet of Things devices and other physical devices, and can also be application programs (APP) , software products or databases, etc. can be used as hardware/software devices that interact with the server as the client device on the user side.
该服务器存储有训练好的分子靶点关系预测模型和序列模型。该服务器可以获取到待匹配的药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,并对药物分子的结构信息,以及靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息等多模态的特征信息分别进行表征,确定药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。该服务器还可以使用分子靶点关系预测模型,根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息从待筛选对象中粗筛选出与给定对象相关的多个备选待筛选对象,从而大大减少备选待筛选对象的数量。进一步地,该服务器还可以使用根据药物分子的结构表征,靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,对备选待筛选对象进行精准排序,根据排序结果输出药物分子与靶点蛋白的匹配结果。The server stores the trained molecular target relationship prediction model and sequence model. The server can obtain the structural information of the drug molecule to be matched, the structural information of the target protein, the protein family group information and the protein function description information, and the structural information of the drug molecule, as well as the structural information of the target protein and the protein family Multimodal feature information such as group information and protein function description information are respectively characterized to determine the structural characterization of drug molecules, the structural characterization of target proteins, the characterization of protein family groups, and the description and characterization of protein functions. The server can also use the molecular-target relationship prediction model to predict the correlation information between the target protein and the drug molecule based on the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group. A plurality of candidate objects related to a given object are coarsely screened out from the objects to be filtered, thereby greatly reducing the number of candidate objects to be filtered. Furthermore, the server can also use the structural characterization of the drug molecule, the structural characterization of the target protein, the characterization of the protein family group and the description of the protein function to accurately sort the candidate objects to be screened, and output the drug molecule and Matching results for target proteins.
该服务器还可以预先存储待筛选对象的信息,包括如下至少一项:已知靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,已知药物分子的结构信息。示例性地,该服务器可以存储有预先建立的分子靶点关系知识图谱,该分子靶点关系知识图谱包含已知的药物分子与已知靶点蛋白的相互作用关系,以及已知靶点蛋白的蛋白家族组别信息。基于该分子靶点关系知识图谱,还可以对药物分子、靶点蛋白、蛋白家族组别的三元组信息进行表征,确定三元组表征,得到多模态特征信息的表征信息。The server can also pre-store the information of the object to be screened, including at least one of the following: the structural information of the known target protein, the information of the protein family group and the description information of the protein function, and the structural information of the known drug molecule. Exemplarily, the server may store a pre-established molecular-target relationship knowledge graph, which includes the interaction relationship between known drug molecules and known target proteins, as well as the interaction relationship between known target proteins. Protein family group information. Based on the molecular-target relationship knowledge map, the triplet information of drug molecules, target proteins, and protein family groups can also be characterized, the triplet representation can be determined, and the representation information of multimodal feature information can be obtained.
在一示例性的使用场景中,用户可以通过端侧设备向服务器提交给定对象和待筛选对象的信息,例如,给定药物分子的结构信息,或者,给定靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。服务器接收给定对象的信息。该服务器还可以获取待筛选对象的信息,如给定靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,或者,定药物分子的结构信息,对药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息等多模态的特征信息分别进行表征,确定药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。该服务器根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息从待筛选对象粗筛选出与给定对象相关的多个备选待筛选对象,从而大大减少待筛选对象的数量。该服务器根据结构表征药物分子的结构表征,靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,对备选对象进行精准排序,根据排序结果输出与给定对象向匹配的待筛选对象,得到药物分子与靶点蛋白的匹配结果。该服务器还可以将药物分子与靶点蛋白的匹配结果输出至端侧设备。In an exemplary usage scenario, the user can submit the information of the given object and the object to be screened to the server through the end-side device, for example, the structural information of a given drug molecule, or the structural information of a given target protein, protein Family group information and protein function description information. The server receives information for a given object. The server can also obtain the information of the object to be screened, such as the structural information of the given target protein, the information of the protein family group and the description information of the protein function, or, the structural information of the given drug molecule, the structural information of the drug molecule, the target Characterize the multi-modal feature information such as protein structure information, protein family group information, and protein function description information to determine the structural characterization of drug molecules, target protein structure characterization, protein family group characterization, and protein function description characterization . The server predicts the correlation information between the target protein and the drug molecule based on the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group, and roughly screens the given object from the objects to be screened based on the correlation information. Related multiple candidate objects to be screened, thereby greatly reducing the number of objects to be screened. According to the structural characterization of drug molecules, the structural characterization of target proteins, the characterization of protein family groups, and the characterization of protein functions, the server accurately sorts the candidate objects, and outputs the candidates that match the given object orientation according to the sorting results. object, to obtain the matching results between the drug molecule and the target protein. The server can also output the matching results of the drug molecule and the target protein to the end-side device.
示例性地,图1中以为给定药物预测多靶点蛋白序列为例对系统架构进行示例性地说明,如图1所示,用户可以通过端侧设备向服务器提交药物多靶点蛋白预测指令,并向服务器提交给定的药物分子的结构信息,如药物分子的结构图。服务器接收药物多靶点蛋白预测指令,并接收给定的药物分子的结构信息。该服务器还可以获取已知靶点蛋白(待筛选对象)的结构信息、蛋白家族组别信息和蛋白功能描述信息,并对药物分子的结构信息,以及已知靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息等多模态的特征信息分别进行表征,确定药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。该服务器根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息从已知的靶点蛋白中粗筛选出与药物分子相关的多个备选靶点蛋白,从而大大减少备选靶点蛋白的数量。该服务器根据药物分子的结构表征,备选靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,对备选靶点蛋白进行精准排序,根据排序结果输出药物分子相匹配的多靶点蛋白序列,从而基于分子结构、蛋白家族组别信息、蛋白功能描述等多模态的特征信息的表征信息,对粗筛得到的多个备选靶点蛋白进行精排,并根据精排结果确定药物分子相匹配的多靶点蛋白序列,提升药物分子与靶点蛋白匹配的效率,同时提升药物分子与靶点蛋白匹配的精准度。该服务器还可以将药物分子相匹配的多靶点蛋白序列输出至端侧设备。可选地,待筛选对象的信息也可以由用户指定的包含多个待筛选对象的集合,可以在用户指定的待筛选对象集合中,筛选给定对象相匹配地待筛选对象,可以缩减筛选范围,有效提升药物分子与靶点蛋白匹配的效率。例如,用户可以给定药物分子,以及待筛选的靶点蛋白集合;或者,用于可以给定靶点蛋白,以及待筛选的药物分子集合。Exemplarily, in Fig. 1, the system architecture is exemplified for predicting a multi-target protein sequence for a given drug. As shown in Fig. 1, the user can submit a drug multi-target protein prediction instruction to the server through the end-side device , and submit the given structural information of the drug molecule to the server, such as the structure diagram of the drug molecule. The server receives the drug multi-target protein prediction instruction, and receives the structural information of the given drug molecule. The server can also obtain structural information of known target proteins (objects to be screened), protein family group information, and protein function description information, as well as structural information of drug molecules, structural information of known target proteins, and protein family information. Multimodal feature information such as group information and protein function description information are respectively characterized to determine the structural characterization of drug molecules, the structural characterization of target proteins, the characterization of protein family groups, and the description and characterization of protein functions. The server predicts the correlation information between the target protein and the drug molecule based on the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group, and roughly screens out the known target proteins based on the correlation information. Multiple candidate target proteins associated with drug molecules, thereby greatly reducing the number of candidate target proteins. According to the structural characterization of the drug molecule, the structural characterization of the candidate target protein, the characterization of the protein family group and the description of the protein function, the server can accurately sort the candidate target proteins, and output the multi-target matching drug molecule according to the sorting results. Based on the characterization information of multimodal characteristic information such as molecular structure, protein family group information, and protein function description, the multiple candidate target proteins obtained by rough screening are fine-sorted, and according to the fine-sorting results Determine the multi-target protein sequence matching the drug molecule, improve the efficiency of matching the drug molecule and the target protein, and at the same time improve the accuracy of the matching between the drug molecule and the target protein. The server can also output the multi-target protein sequence matched by the drug molecule to the end-side device. Optionally, the information of the object to be filtered can also be specified by the user as a set containing multiple objects to be filtered. In the set of objects to be filtered specified by the user, the object to be filtered that matches the given object can be filtered, and the scope of filtering can be reduced. , effectively improving the matching efficiency of drug molecules and target proteins. For example, the user can specify a drug molecule and a set of target proteins to be screened; or, a user can specify a target protein and a set of drug molecules to be screened.
基于本申请提供的方法预测确定的药物分子相匹配的多靶点蛋白序列,是服务器基于待预测的药物分子的结构信息,以及已知靶点蛋白的信息自动分析预测药物分子与已知靶点蛋白相互作用的可能性信息并进行可能性排序后生成的药物分子的多靶点蛋白序列,能够针对具有成药潜质的药物分子预测出多靶点蛋白序列,多靶点蛋白序列包含多个备选靶点蛋白,且在序列中越靠前的备选靶点蛋白,与药物分子具有相互作用的可能性越高。所生成多靶点蛋白序列中的靶点蛋白可以被挑选出至激酶图谱,由药物设计人员筛选出符合成药理念的靶点蛋白,进入接下来的生物活性测试节,无需药物研发人员人工筛选靶点,能够极大的缩短多靶点药物研发的周期,推动新药研发进程。Based on the method provided by this application, the multi-target protein sequence that matches the determined drug molecule is predicted, and the server automatically analyzes and predicts the drug molecule and the known target based on the structural information of the drug molecule to be predicted and the information of the known target protein. The possibility information of protein interaction and the multi-target protein sequence of the drug molecule generated after the possibility sorting can predict the multi-target protein sequence for the drug molecule with drug potential. The multi-target protein sequence contains multiple alternatives The target protein, and the higher the candidate target protein in the sequence, the higher the possibility of interaction with the drug molecule. The target protein in the generated multi-target protein sequence can be selected to the kinase map, and the drug designer will screen out the target protein that meets the drug concept, and then enter the next biological activity test section, without the need for drug developers to manually screen the target This point can greatly shorten the cycle of multi-target drug development and promote the process of new drug development.
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。The technical solution of the present application and how the technical solution of the present application solves the above technical problems will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below in conjunction with the accompanying drawings.
图2为本申请一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图。本实施例的执行主体为上述提及的服务器。如图2所示,本实施例的方法具体步骤如下:Fig. 2 is a flowchart of a method for matching a drug molecule with a target protein according to an exemplary embodiment of the present application. The execution subject of this embodiment is the server mentioned above. As shown in Figure 2, the specific steps of the method of this embodiment are as follows:
步骤S201、获取待匹配的药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。Step S201, obtaining the structural information of the drug molecule to be matched, the structural information of the target protein, the protein family group information and the protein function description information.
其中,药物分子的结构信息可以是药物分子的结构图,是药物分子的特征信息。靶点蛋白的结构信息可以是靶点蛋白的3D结构图。靶点蛋白的蛋白家族组别信息用于指示靶点蛋白所属的蛋白家族(也即蛋白家族组别),可以用离散的数值表示靶点蛋白的蛋白家族组别。靶点蛋白的蛋白功能描述信息包括描述靶点蛋白的功能、对应疾病、配套基因等的文本信息。靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,为靶点蛋白的特征信息。Wherein, the structural information of the drug molecule may be a structure diagram of the drug molecule, which is characteristic information of the drug molecule. The structural information of the target protein may be a 3D structure diagram of the target protein. The protein family group information of the target protein is used to indicate the protein family (ie protein family group) to which the target protein belongs, and the protein family group of the target protein can be represented by a discrete value. The protein function description information of the target protein includes text information describing the function of the target protein, corresponding diseases, supporting genes, etc. The structural information, protein family group information, and protein function description information of the target protein are characteristic information of the target protein.
本实施例中,药物分子和靶点蛋白两者中的第一个是给定对象,另一个是待筛选的对象。给定对象的特征信息可以由服务器接收用户通过所使用的端侧设备发送的给定对象的特征信息。待筛选对象的数量通常较大,例如已知靶点蛋白、已知药物分子、某一类药物分子等的数量较大,服务器可以预先将待筛选对象的特征信息存储在本地存储空间中,在需要是从本地存储空间读取即可。In this embodiment, the first of the drug molecule and the target protein is the given object, and the other is the object to be screened. The feature information of the given object may be received by the server from the feature information of the given object sent by the user through the end-side device used. The number of objects to be screened is usually large, such as the number of known target proteins, known drug molecules, and a certain type of drug molecules, etc., and the server can pre-store the feature information of the objects to be screened in the local storage space. It needs to be read from the local storage space.
可选地,给定对象的特征信息和待筛选对象的特征信息都可以由用户通过所使用的端侧设备发送至服务器。可选地,可能的给定对象和待筛选对象的特征信息都可以存储在服务器上,在具体使用时用户通过所使用的端侧设备向服务器提交给定对象的标识、待筛选对象的标识,服务器基于给定对象的标识信息从本地存储空间读取给定对象的特征信息,基于待筛选对象的标识从本地存储空间读取待筛选对象的特征信息。Optionally, both the feature information of the given object and the feature information of the object to be screened may be sent to the server by the user through the end-side device used. Optionally, the feature information of the possible given object and the object to be screened can be stored on the server, and the user submits the identification of the given object and the identification of the object to be screened to the server through the end-side device used during specific use, The server reads the characteristic information of the given object from the local storage space based on the identification information of the given object, and reads the characteristic information of the object to be screened from the local storage space based on the identification of the object to be screened.
步骤S202、确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。Step S202, determine the characterization information of the drug molecule and the characterization information of the target protein, the characterization information of the drug molecule includes the structural characterization of the drug molecule, and the characterization information of the target protein includes the structural characterization of the target protein, the characterization of the protein family group and the protein Functional description characterization.
本实施例中,药物分子的结构信息和靶点蛋白的结构信息通常以图数据的方式存储,靶点蛋白的蛋白家族组别信息通常用编码序列表示,靶点蛋白的蛋白功能描述信息为文本,包含多种不同模态的特征信息。在获取到药物分子的结构信息,以及靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息等多模态的特征信息之后,对于不同模态的特征信息,可以使用不同的信息表征方法分别进行表征,得到药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。In this example, the structural information of the drug molecule and the target protein are usually stored in the form of graph data, the protein family group information of the target protein is usually represented by the coding sequence, and the protein function description information of the target protein is text , which contains feature information of a variety of different modalities. After obtaining the structural information of the drug molecule, as well as multi-modal feature information such as the structural information of the target protein, protein family group information, and protein function description information, different information can be used to characterize the feature information of different modalities. Methods Characterization was carried out to obtain the structural characterization of the drug molecule, the structural characterization of the target protein, the characterization of the protein family group and the characterization of the protein function.
示例性地,对于药物分子的结构表征,药物分子的结构信息通常为图,可以使用小分子预训练模型、药物分子预训练模型等对输入的药物分子的结构信息进行表征(embedding),得到药物分子的结构表征。其中,小分子预训练模型、药物分子预训练可以是预训练好的任意一种用于将小分子的结构图表征为表征向量的预训练模型。Exemplarily, for the structural characterization of a drug molecule, the structural information of the drug molecule is usually a graph, and the small molecule pre-training model, the drug molecule pre-training model, etc. can be used to characterize (embedding) the structure information of the input drug molecule to obtain the drug Structural characterization of molecules. Among them, the small molecule pre-training model and drug molecule pre-training can be any pre-trained model used to represent the structure diagram of a small molecule as a representation vector.
示例性地,对于靶点蛋白的结构表征,可以使用预训练的基于Transformer的蛋白质结构表征模型,对靶点蛋白的结构图进行表征得到。例如,可以使用预训练的基于Transformer的GPT模型,如GPT、GPT2.0等。Transformer是一种包含编码模块和解码模块的神经网络模型,使用了自注意力机制。GPT(Generative Pre-Training)模型:是一种生成式的预训练模型,GPT、GPT2.0是不同版本的GPT模型。Exemplarily, for the structural characterization of the target protein, a pre-trained Transformer-based protein structure characterization model can be used to characterize the structural diagram of the target protein. For example, a pre-trained Transformer-based GPT model, such as GPT, GPT2.0, etc., can be used. Transformer is a neural network model that includes an encoding module and a decoding module, using a self-attention mechanism. GPT (Generative Pre-Training) model: It is a generative pre-training model. GPT and GPT2.0 are different versions of the GPT model.
示例性地,对于靶点蛋白的蛋白功能描述表征,靶点蛋白的蛋白功能描述信息为文本信息,可以使用预训练的文本表征模型,将靶点蛋白的蛋白功能描述信息表征为蛋白功能描述表征。例如,可以使用Bert模型或其他预训练的文本表征模型、文本表示模型、文本编码模型等,本实施例此处不再赘述。另外,靶点蛋白的蛋白家族组别信息可以是文本信息,此时可以使用预训练的文本表征模型将靶点蛋白的蛋白家族组别信息表征为蛋白家族组别表征,此处不做具体限定。其中Bert(Bidirectional Encoder Representations fromTransformers)是一种基于Transformer的双向编码器,旨在通过在左右上下文中共有的条件计算来预先训练来自无标号文本的深度双向表示,是用于表征文本的神经网络模型。Exemplarily, for the protein function description and characterization of the target protein, the protein function description information of the target protein is text information, and the protein function description information of the target protein can be represented as a protein function description characterization by using a pre-trained text representation model . For example, a Bert model or other pre-trained text representation models, text representation models, text encoding models, etc. may be used, which will not be described in detail here in this embodiment. In addition, the protein family group information of the target protein can be text information. At this time, a pre-trained text representation model can be used to represent the protein family group information of the target protein as a protein family group representation, which is not specifically limited here. . Among them, Bert (Bidirectional Encoder Representations from Transformers) is a Transformer-based bidirectional encoder, which aims to pre-train deep bidirectional representations from unlabeled text through conditional calculations shared in the left and right contexts. It is a neural network model for representing text. .
示例性地,靶点蛋白的蛋白家族组别信息表示靶点蛋白所属的蛋白家族,可以是离散的数值或编码序列。对于靶点蛋白的蛋白家族组别表征,可以根据预设的映射规则,直接将靶点蛋白的蛋白家族组别信息映射为对应的向量表示得到,或者可以采用现有技术中任意一种将离散数值映射为表征向量的方式实现,此处不再赘述。Exemplarily, the protein family group information of the target protein indicates the protein family to which the target protein belongs, which may be a discrete value or a coding sequence. For the protein family group characterization of the target protein, the protein family group information of the target protein can be directly mapped to the corresponding vector representation according to the preset mapping rules, or any of the existing technologies can be used to convert the discrete Numerical mapping is implemented in the form of representation vectors, which will not be repeated here.
本实施例中,通过对药物分子的结构信息,以及靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息等多种不同模态的特征信息进行表征,得到对应的表征信息。基于多模态特征信息的表征信息来预测药物分子与靶点蛋白的相关性,可以提高相关性信息预测的精准度。In this embodiment, the corresponding characterization information is obtained by characterizing the structural information of the drug molecule, the structural information of the target protein, the protein family group information, and the protein function description information and other characteristic information of various modalities. Predicting the correlation between drug molecules and target proteins based on the characterization information of multimodal feature information can improve the accuracy of correlation information prediction.
步骤S203、根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息待筛选对象中筛选出备选对象,待筛选对象为靶点蛋白或药物分子。Step S203, according to the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group, predict the correlation information between the target protein and the drug molecule, and select candidate objects from the objects to be screened according to the correlation information, The object to be screened is a target protein or a drug molecule.
该步骤中,根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,从而能够综合靶点蛋白的多模态特征信息的表征信息,自动预测靶点蛋白与药物分子的相关性信息,为药物分子与靶点蛋白的匹配提供数据基础。In this step, according to the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group, the correlation information between the target protein and the drug molecule is predicted, so that the characterization of the multimodal feature information of the target protein can be integrated Information, automatically predicts the correlation information between the target protein and the drug molecule, and provides a data basis for the matching of the drug molecule and the target protein.
其中,靶点蛋白与药物分子的相关性信息表示药物分子与靶点蛋白间具有相互作用关系的可能性大小。靶点蛋白与药物分子的相关性信息越高,表示靶点蛋白与药物分子具有相互作用关系的可能性越大。靶点蛋白与药物分子的相关性信息越低,表示靶点蛋白与药物分子具有相互作用关系的可能性越小。Wherein, the correlation information between the target protein and the drug molecule indicates the possibility of an interaction relationship between the drug molecule and the target protein. The higher the correlation information between the target protein and the drug molecule, the greater the possibility that the target protein has an interaction relationship with the drug molecule. The lower the correlation information between the target protein and the drug molecule, the less likely it is that the target protein has an interaction relationship with the drug molecule.
进一步地,根据靶点蛋白与药物分子的相关性信息,可以对待筛选对象进行粗排序,并根据粗排序结果,筛选出与给定对象的相关性信息较高的多个待筛选对象作为备选对象,从而实现待筛选对象的粗筛,大大减少待筛选的对象数量。Further, according to the correlation information between the target protein and the drug molecule, the objects to be screened can be roughly sorted, and according to the results of the rough sorting, multiple objects to be screened with higher correlation information with the given object can be selected as candidates Objects, so as to achieve coarse screening of objects to be screened, greatly reducing the number of objects to be screened.
本实施例在应用于不同的应用场景时,待筛选对象的指代对象可以不同、给定对象的指定对象也可以相同。When this embodiment is applied to different application scenarios, the reference objects of the objects to be screened may be different, and the specified objects of the given objects may also be the same.
例如,在为给定药物预测多靶点蛋白序列的场景中,给定对象指代给定的药物分子,待筛选对象指代待筛选的靶点蛋白,备选对象指代备选靶点蛋白。该步骤中,根据待筛选的靶点蛋白与给定的药物分子的相关性信息,可以对待筛选的靶点蛋白进行粗排序,并根据粗排序结果,筛选出与给定的药物分子的相关性信息较高的多个靶点蛋白作为备选靶点蛋白,从而实现靶点蛋白的粗筛,大大减少待筛选的靶点蛋白的数量。通常,已知的靶点蛋白有上万种,经粗筛确定的备选靶点蛋白的数量可以缩减到几百个。For example, in the scenario of predicting a multi-target protein sequence for a given drug, the given object refers to the given drug molecule, the object to be screened refers to the target protein to be screened, and the candidate object refers to the candidate target protein . In this step, according to the correlation information between the target protein to be screened and the given drug molecule, the target protein to be screened can be roughly sorted, and the correlation with the given drug molecule can be screened out according to the results of the rough sorting Multiple target proteins with higher information are used as alternative target proteins, so as to achieve coarse screening of target proteins and greatly reduce the number of target proteins to be screened. Usually, there are tens of thousands of known target proteins, and the number of candidate target proteins determined by coarse screening can be reduced to a few hundred.
例如,在为给定靶点蛋白预测适用的药物分子的场景中,给定对象指代给定的靶点蛋白,待筛选对象指代待筛选的药物分子,备选对象指代备选药物分子。该步骤中,根据待筛选的药物分子与给定的靶点蛋白的相关性信息,可以对待筛选的药物分子进行粗排序,并根据粗排序结果,筛选出与给定的靶点蛋白的相关性信息较高的多个药物分子作为备选药物分子,从而实现药物分子的粗筛,大大减少待筛选的药物分子的数量。For example, in the scenario of predicting applicable drug molecules for a given target protein, the given object refers to the given target protein, the object to be screened refers to the drug molecule to be screened, and the candidate object refers to the candidate drug molecule . In this step, according to the correlation information between the drug molecules to be screened and the given target protein, the drug molecules to be screened can be roughly sorted, and the correlation with the given target protein can be screened out according to the results of the rough sorting Multiple drug molecules with higher information are used as candidate drug molecules, so as to realize the coarse screening of drug molecules and greatly reduce the number of drug molecules to be screened.
步骤S204、根据药物分子的表征信息和靶点蛋白的表征信息,对备选对象进行排序,根据排序结果输出药物分子与靶点蛋白的匹配结果。Step S204, sort the candidate objects according to the characterization information of the drug molecule and the characterization information of the target protein, and output the matching result of the drug molecule and the target protein according to the sorting result.
该步骤中,对于粗筛确定的多个备选对象,根据靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,以及药物分子的结构表征,分析预测备选对象与给定对象相互作用的可能性信息,并进行可能性排序后生成精排序结果,实现对备选对象的精准排序。根据精排序结果从中选择排序靠前的若干个备选对象,得到药物分子与靶点蛋白的匹配结果。其中,药物分子与靶点蛋白的匹配结果可以是由多个备选对象排列成的序列,且在序列中越靠前的备选对象与给定对象具有相互作用的可能性越高。例如,在为给定药物预测多靶点蛋白序列的场景中,对于粗筛确定的给定药物分子的多个备选靶点蛋白,根据备选靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,以及结构表征药物分子的结构表征,分析预测药物分子与已知靶点蛋白相互作用的可能性信息,并进行可能性排序后生成精排序结果,实现对备选靶点蛋白的精准排序。根据精排序结果从中选择排序靠前的若干个备选靶点蛋白组成药物分子的多靶点蛋白序列。其中,预测得到的药物分子的多靶点蛋白序列包含多个备选靶点蛋白,且在序列中越靠前的备选靶点蛋白,与药物分子具有相互作用的可能性越高。所生成多靶点蛋白序列中的靶点蛋白可以被挑选出,并由药物设计人员筛选出符合成药理念的靶点蛋白,进入接下来的生物活性测试节,无需药物研发人员人工筛选靶点蛋白,能够极大的缩短多靶点药物研发的周期,推动新药研发进程。In this step, for the multiple candidates determined by the coarse screening, according to the structural characterization of the target protein, the characterization of the protein family group and the description and characterization of the protein function, as well as the structural characterization of the drug molecule, analyze and predict the candidate and the given object Interaction possibility information, and generate fine sorting results after possibility sorting, to achieve precise sorting of candidate objects. According to the fine sorting results, several candidate objects with the highest ranking are selected to obtain the matching results between the drug molecule and the target protein. Wherein, the matching result of the drug molecule and the target protein may be a sequence of multiple candidate objects, and the higher the candidate object in the sequence, the higher the possibility of interacting with the given object. For example, in the scenario of predicting a multi-target protein sequence for a given drug, for multiple candidate target proteins of a given drug molecule determined by coarse screening, according to the structural characterization of the candidate target protein and the characterization of protein family groups and protein function description and characterization, as well as structural characterization of drug molecules, analyze and predict the possibility information of drug molecules interacting with known target proteins, and generate fine sorting results after sorting the possibilities, so as to realize the identification of alternative target proteins precise sorting. According to the fine sorting results, several candidate target proteins that are ranked first are selected to form the multi-target protein sequence of the drug molecule. Wherein, the predicted multi-target protein sequence of the drug molecule contains multiple candidate target proteins, and the higher the candidate target protein in the sequence, the higher the possibility of interacting with the drug molecule. The target protein in the generated multi-target protein sequence can be selected, and the target protein that meets the drug concept is screened out by the drug designer, and then enters the next biological activity test section, without the need for drug developers to manually screen the target protein , can greatly shorten the cycle of multi-target drug development and promote the process of new drug development.
例如,在为给定靶点蛋白预测适用的药物分子的场景中,对于粗筛确定的给定靶点蛋白的多个备选药物分子,根据给定靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,以及备选药物分子的结构表征,分析预测备选药物分子与给定靶点蛋白相互作用的可能性信息,并进行可能性排序后生成精排序结果,实现对备选药物分子的精准排序。根据精排序结果从中选择排序靠前的若干个备选药物分子组成的多药物分子序列。其中,预测得到的多药物分子序列包含多个备选药物分子,且在序列中越靠前的备选药物分子,与给定靶点蛋白具有相互作用的可能性越高。所生成的多药物分子序列中的药物分子可以被挑选出,作为针对给定靶点蛋白的推荐使用药物,能够为相关人员提供准确地可使用药物的推荐,从而提高相关人员的效率。For example, in the scenario of predicting applicable drug molecules for a given target protein, for multiple candidate drug molecules of a given target protein determined by coarse screening, according to the structural characterization of the given target protein, protein family group Characterization and protein function description and characterization, as well as the structural characterization of candidate drug molecules, analyze and predict the possibility information of the interaction between candidate drug molecules and a given target protein, and generate fine sorting results after sorting the possibilities, realizing the identification of candidate drugs Precise sequencing of drug molecules. According to the fine sorting results, a multi-drug molecule sequence composed of several candidate drug molecules that are ranked top is selected. Wherein, the predicted multi-drug molecule sequence contains multiple candidate drug molecules, and the higher the candidate drug molecule in the sequence, the higher the possibility of interacting with a given target protein. The drug molecules in the generated multi-drug molecular sequence can be selected as the recommended drugs for a given target protein, which can provide relevant personnel with accurate recommendations of available drugs, thereby improving the efficiency of relevant personnel.
本实施例中,通过获取待匹配的药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息;确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征;根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选对象中筛选出备选对象,能够快速地从数量众多的待筛选对象中粗筛出与给定对象相关性较高的多个备选对象;进一步地,根据药物分子的表征信息和靶点蛋白的表征信息,对备选对象进行排序,根据排序结果输出药物分子与靶点蛋白的匹配结果,通过分子结构、蛋白家族组别、蛋白功能描述等多模态特征信息的表征信息,对粗筛得到的多个备选对象进行精排,能够自动且快速地筛选出给定对象相匹配的待筛选对象,来实现药物分子与靶点蛋白的匹配,极大地提高了药物分子与靶点蛋白匹配的效率。In this embodiment, by obtaining the structural information of the drug molecule to be matched, the structural information of the target protein, the protein family group information, and the protein function description information; determining the characterization information of the drug molecule and the characterization information of the target protein, the drug molecule The characterization information of the drug molecule includes the structural characterization of the drug molecule, and the characterization information of the target protein includes the structural characterization of the target protein, the characterization of the protein family group, and the description and characterization of the protein function; Family group characterization, predicting the correlation information between target proteins and drug molecules, and screening candidate objects from the objects to be screened based on the correlation information, which can quickly and roughly screen out and give drugs from a large number of objects to be screened Identify multiple candidate objects with high correlation between objects; further, sort the candidate objects according to the characterization information of the drug molecule and the characterization information of the target protein, and output the matching result of the drug molecule and the target protein according to the sorting results , through the characterization information of multi-modal feature information such as molecular structure, protein family group, protein function description, etc., the multiple candidate objects obtained by rough screening are fine-sorted, and the matching of a given object can be automatically and quickly screened out. The objects to be screened are used to achieve the matching of drug molecules and target proteins, which greatly improves the efficiency of matching drug molecules and target proteins.
图3为本申请一示例性实施例提供的药物分子与靶点蛋白匹配的流程框架图,图4为本申请另一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图。在上述方法实施例的基础上,本实施例中结合图3所示的流程框架对药物分子与靶点蛋白匹配的方法进行详细地说明,上述实施例中所说明的内容包含在本实施例中。Fig. 3 is a framework diagram of a matching process between a drug molecule and a target protein provided in an exemplary embodiment of the present application, and Fig. 4 is a flow chart of a method for matching a drug molecule with a target protein provided in another exemplary embodiment of the present application. On the basis of the above-mentioned method embodiments, in this embodiment, the method for matching drug molecules and target proteins is described in detail in combination with the process framework shown in Figure 3, and the content described in the above-mentioned embodiments is included in this embodiment .
如图4所示,本实施例的方法具体步骤如下:As shown in Figure 4, the specific steps of the method of this embodiment are as follows:
步骤S401、获取待匹配的药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。Step S401, obtaining the structural information of the drug molecule to be matched, the structural information of the target protein, the information of the protein family group and the description information of the protein function.
该步骤与上述步骤S201实现方式一致,具体参见上述步骤S201的相关内容,此处不再赘述。This step is implemented in the same way as the above step S201, for details, please refer to the relevant content of the above step S201, which will not be repeated here.
步骤S402、确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。Step S402, determine the characterization information of the drug molecule and the characterization information of the target protein, the characterization information of the drug molecule includes the structural characterization of the drug molecule, and the characterization information of the target protein includes the structural characterization of the target protein, the characterization of the protein family group and the protein Functional description characterization.
其中,药物分子的结构信息和靶点蛋白的结构信息通常以图数据的方式存储,靶点蛋白的蛋白家族组别信息通常用编码序列表示,靶点蛋白的蛋白功能描述信息为文本,包含多种不同模态的特征信息。在获取到药物分子的结构信息,以及靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息等多模态的特征信息之后,对于不同模态的特征信息,可以使用不同的信息表征方法分别进行表征,得到药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。Among them, the structural information of the drug molecule and the target protein are usually stored in the form of graph data, the protein family group information of the target protein is usually represented by the coding sequence, and the protein function description information of the target protein is text, including multiple feature information of different modes. After obtaining the structural information of the drug molecule, as well as multi-modal feature information such as the structural information of the target protein, protein family group information, and protein function description information, different information can be used to characterize the feature information of different modalities. Methods Characterization was carried out to obtain the structural characterization of the drug molecule, the structural characterization of the target protein, the characterization of the protein family group and the characterization of the protein function.
示例性地,对于药物分子的结构表征,药物分子的结构信息通常为图,可以使用小分子预训练模型、药物分子预训练模型等对输入的药物分子的结构信息进行表征(embedding),得到药物分子的结构表征。其中,小分子预训练模型、药物分子预训练可以是预训练好的任意一种用于将小分子的结构图表征为表征向量的预训练模型。图3中以使用小分子预训练模型对药物分子的结构图表征为药物分子的结构表征为例,进行示例性地说明。Exemplarily, for the structural characterization of a drug molecule, the structural information of the drug molecule is usually a graph, and the small molecule pre-training model, the drug molecule pre-training model, etc. can be used to characterize (embedding) the structure information of the input drug molecule to obtain the drug Structural characterization of molecules. Among them, the small molecule pre-training model and drug molecule pre-training can be any pre-trained model used to represent the structure diagram of a small molecule as a representation vector. In FIG. 3 , an example is illustrated by using a small molecule pre-training model to represent a structure diagram of a drug molecule as a structure representation of a drug molecule.
示例性地,如图3所示,对于靶点蛋白的结构表征,可以使用预训练的基于Transformer的蛋白质结构表征模型,对靶点蛋白的结构图进行表征得到。例如,蛋白质结构表征模型可以使用预训练的基于Transformer的GPT模型,如GPT、GPT2.0等。Exemplarily, as shown in FIG. 3 , for the structural characterization of the target protein, a pre-trained Transformer-based protein structure characterization model can be used to characterize the structural diagram of the target protein. For example, the protein structure characterization model can use a pre-trained Transformer-based GPT model, such as GPT, GPT2.0, etc.
示例性地,如图3所示,对于靶点蛋白的蛋白功能描述表征,靶点蛋白的蛋白功能描述信息为文本信息,可以使用预训练的文本表征模型,将靶点蛋白的蛋白功能描述信息表征为蛋白功能描述表征。例如,可以使用Bert模型或其他预训练的文本表征模型、文本表示模型、文本编码模型等,本实施例此处不再赘述。Exemplarily, as shown in Figure 3, for the protein function description and characterization of the target protein, the protein function description information of the target protein is text information, and a pre-trained text representation model can be used to convert the protein function description information of the target protein Characterization is a description of protein function. For example, a Bert model or other pre-trained text representation models, text representation models, text encoding models, etc. may be used, which will not be described in detail here in this embodiment.
示例性地,靶点蛋白的蛋白家族组别信息表示靶点蛋白所属的蛋白家族,可以是离散的数值或编码序列。如图3所示,对于靶点蛋白的蛋白家族组别表征,可以根据预设的映射规则,直接将靶点蛋白的蛋白家族组别信息映射为对应的向量表示得到,或者可以采用现有技术中任意一种将离散数值映射为表征向量的方式实现,此处不再赘述。另外,靶点蛋白的蛋白家族组别信息可以是文本信息,此时可以使用预训练的文本表征模型将靶点的蛋白家族组别信息表征为蛋白家族组别表征,此处不做具体限定。Exemplarily, the protein family group information of the target protein indicates the protein family to which the target protein belongs, which may be a discrete value or a coding sequence. As shown in Figure 3, for the protein family group characterization of the target protein, the protein family group information of the target protein can be directly mapped to the corresponding vector representation according to the preset mapping rules, or the existing technology can be used Any one of the ways to map discrete values into representation vectors is implemented, so I won’t go into details here. In addition, the protein family group information of the target protein can be text information. In this case, a pre-trained text representation model can be used to represent the protein family group information of the target point as a protein family group representation, which is not specifically limited here.
通过对药物分子的结构信息,以及靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息等多种不同模态的特征信息,使用不同的预训练模型进行表征,得到药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征等多种表征信息。基于多模态特征信息的表征信息来预测药物分子与靶点蛋白的相关性,可以提高相关性信息预测的精准度。By using different pre-trained models to characterize the structural information of the drug molecule, as well as the structural information of the target protein, the protein family group information, and the protein function description information, the structure of the drug molecule is obtained. Characterization, structural characterization of target proteins, protein family group characterization, and protein function description characterization and other characterization information. Predicting the correlation between drug molecules and target proteins based on the characterization information of multimodal feature information can improve the accuracy of correlation information prediction.
步骤S403、将药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,输入分子靶点关系预测模型,通过分子靶点关系预测模型预测靶点蛋白与药物分子的相关性信息。Step S403, input the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group into the molecular-target relationship prediction model, and predict the correlation information between the target protein and the drug molecule through the molecular-target relationship prediction model.
本实施例中,预先获取训练好的分子靶点关系预测模型,在得到药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征等表征信息之后,通过训练好的分子靶点关系预测模型,挖掘并预测靶点蛋白与药物分子的相关性信息。In this example, the trained molecular-target relationship prediction model is obtained in advance. After obtaining the characterization information such as the structural characterization of the drug molecule, the structural characterization of the target protein, the characterization of the protein family group, and the characterization of the protein function, the trained Molecular-target relationship prediction model, which mines and predicts the correlation information between target proteins and drug molecules.
具体地,将药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,输入训练好的分子靶点关系预测模型,通过分子靶点关系预测模型预测靶点蛋白与药物分子的相关性信息。Specifically, the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group are input into the trained molecular-target relationship prediction model, and the relationship between the target protein and the drug molecule is predicted by the molecular-target relationship prediction model. sexual information.
其中,靶点蛋白与药物分子的相关性信息表示药物分子与靶点蛋白间具有相互作用关系的可能性大小。靶点蛋白与药物分子的相关性信息越高,表示靶点蛋白与药物分子具有相互作用关系的可能性越大。靶点蛋白与药物分子的相关性信息越低,表示靶点蛋白与药物分子具有相互作用关系的可能性越小。Wherein, the correlation information between the target protein and the drug molecule indicates the possibility of an interaction relationship between the drug molecule and the target protein. The higher the correlation information between the target protein and the drug molecule, the greater the possibility that the target protein has an interaction relationship with the drug molecule. The lower the correlation information between the target protein and the drug molecule, the less likely it is that the target protein has an interaction relationship with the drug molecule.
步骤S404、根据相关性信息从待筛选对象中筛选出备选对象,待筛选对象为靶点蛋白或药物分子。Step S404 , selecting candidate objects from the objects to be screened according to the correlation information, where the objects to be screened are target proteins or drug molecules.
在预测得到靶点蛋白与药物分子的相关性信息之后,根据靶点蛋白与药物分子的相关性信息,对待筛选对象进行粗排序,并根据粗排序结果,筛选出与给定对象的相关性信息较高的多个待筛选对象作为备选对象,从而实现待筛选对象的粗筛,大大减少待筛选的对象的数量。After the correlation information between the target protein and the drug molecule is predicted, the objects to be screened are roughly sorted according to the correlation information between the target protein and the drug molecule, and the correlation information with the given object is screened out according to the rough sorting results Higher multiple objects to be screened are used as candidate objects, so as to realize coarse screening of objects to be screened and greatly reduce the number of objects to be screened.
示例性地,该步骤中,根据与给定对象的相关性信息,对待筛选对象进行排序,得到第二排序结果;根据第二排序结果确定第二预设数量的待筛选对象,作为与给定对象相关的多个备选对象。Exemplarily, in this step, according to the correlation information with the given object, sort the objects to be screened to obtain a second sorting result; determine a second preset number of objects to be screened according to the second sorting result, as the Object related multiple alternatives.
其中,第二预设数量为预先设置的粗筛出的备选对象的数量阈值,可以根据实际应用场景的需要进行设置和调整,例如,第二预设数量可以为300、400、500等。另外,在应用于不同的应用场景时,第二预设数量可以使用不同的取值。通常第二预设数量相对于最终确定的药物分子的多靶点蛋白序列中靶点蛋白的第一预设数量是一个较大的值。因此该步骤S403-S404为对待筛选对象进行粗排序和粗筛的过程,经过粗筛确定与给定对象相关的第二预设数量的备选对象。Wherein, the second preset number is a pre-set threshold of the number of candidate objects roughly screened out, which can be set and adjusted according to the needs of actual application scenarios, for example, the second preset number can be 300, 400, 500, etc. In addition, when applied to different application scenarios, different values may be used for the second preset quantity. Usually the second preset number is a larger value than the first preset number of target proteins in the finally determined multi-target protein sequence of the drug molecule. Therefore, the steps S403-S404 are a process of rough sorting and rough screening of objects to be screened, and a second preset number of candidate objects related to a given object are determined through rough screening.
进一步地,通过如下步骤S405-S407,根据药物分子的结构表征,靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,对备选对象进行精准排序,得到第一排序结果。进一步地,通过步骤S408可以根据精准排序的第一排序结果,确定给定对象相匹配的包含多个备选对象的序列,该序列包含第一预设数量的备选对象。Further, through the following steps S405-S407, according to the structural characterization of the drug molecule, the structural characterization of the target protein, the characterization of the protein family group and the description of the protein function, the candidate objects are accurately sorted to obtain the first sorting result. Further, through step S408, according to the first sorting result of precise sorting, a sequence including multiple candidate objects matching the given object can be determined, and the sequence includes a first preset number of candidate objects.
步骤S405、根据预先建立的分子靶点关系知识图谱,将包含给定对象和备选对象的三元组进行表征,得到给定对象与备选对象的三元组表征,分子靶点关系知识图谱包含已知的药物分子与已知靶点蛋白的相互作用关系,以及已知靶点蛋白的蛋白家族组别信息,三元组包括药物分子、靶点蛋白及靶点蛋白的蛋白家族组别。Step S405, according to the pre-established molecular target relationship knowledge map, characterize the triplets containing the given object and the candidate object, and obtain the triplet representation of the given object and the candidate object, and the molecular target relationship knowledge map Contains the interaction relationship between the known drug molecule and the known target protein, as well as the protein family group information of the known target protein. The triplet includes the drug molecule, the target protein and the protein family group of the target protein.
在一种可选实施例中,根据已知药物分子的结构信息,已知靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,分别进行表征,得到已知的药物分子的结构表征、已知的靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征等多模态特征信息的表征信息。基于已知药物分子的靶点蛋白序列,构建包含药物分子-靶点蛋白-靶点蛋白家族组别的三元组的大规模的数据集。进一步地,基于该三元组的数据集,建立分子靶点关系知识图谱。该分子靶点关系知识图谱包含数据集中三元组的信息,也即包含已知的药物分子与已知靶点蛋白的相互作用关系,以及已知靶点蛋白的蛋白家族组别信息。In an optional embodiment, according to the structural information of the known drug molecule, the structural information of the known target protein, the information of the protein family group and the description information of the protein function, the characterization is performed separately to obtain the structure of the known drug molecule Representation information of multimodal feature information such as characterization, structural characterization of known target proteins, protein family group characterization, and protein function description and characterization. Based on the target protein sequences of known drug molecules, a large-scale data set containing triplets of drug molecule-target protein-target protein family groups is constructed. Further, based on the dataset of triples, a molecular target relationship knowledge graph is established. The molecular-target relationship knowledge graph contains triplet information in the data set, that is, it includes the interaction relationship between known drug molecules and known target proteins, and the protein family group information of known target proteins.
可选地,该分子靶点关系知识图谱中还可以包含药物分子和靶点蛋白的结构信息、靶点蛋白的蛋白功能描述信息等。在进行药物分子与靶点蛋白的匹配时,可以直接获取已存储的表征信息,可以在一定程度上提高药物分子与靶点蛋白的匹配效率。Optionally, the molecular-target relationship knowledge map may also include structural information of drug molecules and target proteins, protein function description information of target proteins, and the like. When matching drug molecules and target proteins, the stored characterization information can be directly obtained, which can improve the matching efficiency of drug molecules and target proteins to a certain extent.
可选地,该分子靶点关系知识图谱中还可以预先存储已知的药物分子的结构表征,已知的靶点蛋白的结构表征、蛋白家族组别表征、蛋白功能描述表征等表征数据,在进行药物分子与靶点蛋白的匹配时,可以直接获取已存储的表征数据,可以在一定程度上提高药物分子与靶点蛋白的匹配效率。例如,对任意药物分子进行多靶点预测时,可以直接获取已存储的靶点蛋白的表征信息,以提高多靶点预测的效率。Optionally, the molecular target relationship knowledge map can also pre-store the structural representation of known drug molecules, the structural representation of known target proteins, the representation of protein family groups, and the description and representation of protein functions. When matching drug molecules and target proteins, the stored characterization data can be directly obtained, which can improve the matching efficiency of drug molecules and target proteins to a certain extent. For example, when performing multi-target prediction for any drug molecule, the stored characterization information of the target protein can be directly obtained to improve the efficiency of multi-target prediction.
可选地,药物分子和靶点蛋白的结构信息、靶点蛋白的蛋白功能描述信息,可以不包含在分子靶点关系知识图谱中,独立于分子靶点关系知识图谱单独存储。另外,已知的靶点蛋白的结构表征、蛋白家族组别表征、蛋白功能描述表征等表征数据,也可以不包含在分子靶点关系知识图谱中,独立于分子靶点关系知识图谱单独存储。Optionally, the structural information of the drug molecule and the target protein, and the protein function description information of the target protein may not be included in the molecular-target relationship knowledge graph, and be stored independently of the molecular-target relationship knowledge graph. In addition, the characterization data such as the structural representation of the known target protein, the representation of the protein family group, and the description and representation of the protein function may not be included in the knowledge map of the molecular target relationship, and be stored independently of the knowledge map of the molecular target relationship.
本实施例中,对于每一给定对象,将该给定对象与其每个备选对象及蛋白家族组别的信息,组成一个三元组。根据分子靶点关系知识图谱对该三元组进行表征,得到每个给定对象样本与每个备选对象样本的三元组表征。In this embodiment, for each given object, information about the given object and each of its candidate objects and protein family groups are formed into a triplet. The triplet is characterized according to the molecular target relationship knowledge map, and the triplet representation of each given object sample and each candidate object sample is obtained.
示例性地,在为给定药物预测多靶点蛋白序列的场景中,可以根据分子靶点关系知识图谱,使用TransH模型,对给定的药物分子与每一备选靶点蛋白及备选靶点蛋白的蛋白家族组别构成的三元组进行表征,生成每一备选靶点蛋白的三元组表征,该三元组表征也是又一种模态特征信息的表征信息,用于对备选靶点蛋白的精排序所依据的表征数据,可以提高备选靶点蛋白排序的精准度,从而提高药物分子相匹配的多靶点蛋白序列的精准度。其中,TransH模型是一种知识库自动表示学习模型,能用于将知识图谱中的三元组表征为特征向量。Exemplarily, in the scenario of predicting a multi-target protein sequence for a given drug, the TransH model can be used according to the molecular-target relationship knowledge map to compare the given drug molecule with each candidate target protein and candidate target Characterize the triplet composed of the protein family group of the target protein, and generate the triplet representation of each candidate target protein. The triplet representation is also another kind of modal feature information representation information, which is used to identify the candidate target protein. The characterization data based on the precise sequencing of the selected target proteins can improve the accuracy of the sequencing of the candidate target proteins, thereby improving the accuracy of the multi-target protein sequences matched by drug molecules. Among them, the TransH model is a knowledge base automatic representation learning model, which can be used to represent the triples in the knowledge graph as feature vectors.
示例性地,在为给定靶点蛋白预测适用的药物分子的场景中,对于可以根据分子靶点关系知识图谱,对给定的靶点蛋白的每个备选药物分子、与给定靶点蛋白及给定靶点蛋白的蛋白家族组别构成的三元组进行表征,生成每一备选药物分子与给定靶点蛋白的三元组表征,该三元组表征也作为又一种模态特征信息的表征信息,用于对备选药物分子的精排序所依据的表征数据,可以提高备选药物分子排序的精准度,从而提高药物分子与靶点蛋白的匹配结果的精准度。Exemplarily, in the scenario of predicting applicable drug molecules for a given target protein, for each candidate drug molecule of a given target protein, and the given target The triplets composed of proteins and protein family groups of a given target protein are characterized to generate a triplet representation of each candidate drug molecule and a given target protein, and the triplet representation is also used as another model The characterization information of the state characteristic information is used for the characterization data based on the fine sorting of candidate drug molecules, which can improve the accuracy of the sorting of candidate drug molecules, thereby improving the accuracy of the matching results between drug molecules and target proteins.
步骤S406、将给定对象的表征信息、备选对象的表征信息和给定对象与备选对象的三元组表征输入序列模型,通过序列模型的编码模块,将给定对象的表征信息、备选对象的表征信息和给定对象与备选对象的三元组表征融合,得到备选对象的融合表征。Step S406, input the characterization information of the given object, the characterization information of the candidate object, and the triplet characterization of the given object and the candidate object into the sequence model, and use the coding module of the sequence model to input the characterization information of the given object, the candidate The representation information of the selected object is fused with the triplet representation of the given object and the candidate object to obtain the fusion representation of the candidate object.
该步骤中,根据得到给定对象的表征信息、备选对象的表征信息和给定对象与备选对象的三元组表征等表征信息,将这些表征信息输入训练好的序列模型,通过序列模型的编码模块,将给定对象的表征信息、备选对象的表征信息和给定对象与备选对象的三元组表征进行融合,得到每个备选对象的融合表征。In this step, according to the characterization information of the given object, the characterization information of the candidate object, and the triplet characterization of the given object and the candidate object and other characterization information, these characterization information are input into the trained sequence model, through the sequence model The encoding module of the given object is fused with the representation information of the given object, the representation information of the candidate object, and the triplet representation of the given object and the candidate object to obtain the fusion representation of each candidate object.
在另一种可选实施例中,还可以将根据得到给定对象的表征信息、备选对象的表征信息,输入序列模型,通过序列模型的编码模块,将每个备选对象的表征信息,与给定对象的表征信息融合,得到每个备选对象的融合表征。在训练序列模型时,训练集中训练样本只需包含给定对象的表征信息和待筛选对象的表征信息,可以不包含三元组表征。In another optional embodiment, it is also possible to input the sequence model according to the characterization information of the given object and the characterization information of the candidate objects, and use the coding module of the sequence model to convert the characterization information of each candidate object, It is fused with the representation information of the given object to obtain the fusion representation of each candidate object. When training the sequence model, the training samples in the training set only need to contain the representation information of the given object and the representation information of the object to be screened, and the triplet representation may not be included.
本实施例中,通过序列模型的编码模块将多模态特征信息的表征信息进行融合,使得每个备选对象的融合表征包含丰富的特征,基于该融合表征可以准确地预测各个备选对象与给定对象的相互作用关系的可能性。In this embodiment, the characterization information of the multimodal feature information is fused through the encoding module of the sequence model, so that the fused characterization of each candidate object contains rich features, and based on the fused characterization, it is possible to accurately predict the relationship between each candidate object and The likelihood of an interaction relationship for a given object.
示例性地,序列模型的编码模块可以使用transformer、长短期记忆网络(LongShort-Term Memory,简称LSTM)等的编码层实现。例如,如图3所示,序列模型的编码模块可以使用transformer模型的8层编码(encoder)层堆叠而成(用“8×”表示),每层编码层包含交叉注意力层和前馈神经网络(Feed-Forward Network,简称FFN)层。Exemplarily, the encoding module of the sequence model can be implemented using an encoding layer of a transformer, a long-short-term memory network (LongShort-Term Memory, LSTM for short), and the like. For example, as shown in Figure 3, the encoding module of the sequence model can be stacked using the 8-layer encoder (encoder) layer of the transformer model (indicated by "8×"), and each encoding layer contains a cross-attention layer and a feed-forward neural network. Network (Feed-Forward Network, FFN for short) layer.
步骤S407、通过序列模型的排序模块,根据备选对象的融合表征,对备选对象进行排序,得到第一排序结果。Step S407, sort the candidate objects according to the fusion representation of the candidate objects through the sorting module of the sequence model, and obtain the first sorting result.
在得到每一备选对象的融合表征之后,通过序列模型的排序模块,根据每个备选对象的融合表征,对备选对象进行排序,得到第一排序结果。该第一排序结果中备选对象的排序顺序体现了备选对象与给定对象间具有相互作用的可能性,越靠前的备选对象与给定对象间具有相互作用的可能性越大,越靠后的备选对象与给定对象间具有相互作用的可能性越小。After obtaining the fused representation of each candidate object, the candidate objects are sorted according to the fused representation of each candidate object through the sorting module of the sequence model to obtain a first sorting result. The sorting order of the candidate objects in the first sorting result reflects the possibility of interaction between the candidate objects and the given object, the higher the possibility of interaction between the candidate objects and the given object, The further back the candidate objects are, the less likely they are to have interactions with the given object.
其中,序列模型的排序模块可以采用列表法(ListWise)实现,例如,具体可以使用ListNet、LambdaRank、或者其他类似的排序方法实现。ListNet和LambdaRank是常用的列表排序算法。Wherein, the sorting module of the sequence model can be realized by using a list method (ListWise), for example, specifically, it can be realized by using ListNet, LambdaRank, or other similar sorting methods. ListNet and LambdaRank are commonly used list sorting algorithms.
步骤S408、根据第一排序结果,确定第一预设数量的备选对象,作为与给定对象相匹配目标对象,得到药物分子与靶点蛋白的匹配结果。Step S408, according to the first sorting result, determine a first preset number of candidate objects as target objects matching the given object, and obtain a matching result of the drug molecule and the target protein.
该步骤中,根据第一排序结果,截取排序靠前的第一预设数量的备选对象作为与给定对象相匹配目标对象,并构成目标对象序列,该目标对象序列中各个目标对象的排列顺序与在第一排序结果中的排列顺序相同。In this step, according to the first sorting result, intercept the first preset number of candidate objects that are ranked higher as the target objects that match the given object, and form a target object sequence, and the arrangement of each target object in the target object sequence The order is the same as the sort order in the first sorted result.
其中,第一预设数量为预先设置的匹配结果中包含的目标对象的数量阈值,可以由用户或者相关人员根据实际应用场景的需要进行设置和调整。例如,第一预设数量可以为20、30、40、50等。通常第一预设数量相对于粗筛结果中备选对象的数量(第二预设数量)是一个较小的值。因此该步骤S405-S408为对备选对象进行精排序和精筛的过程,经过精筛确定与给定对象相匹配的第一预设数量的备选对象,得到与给定对象相匹配的多目标对象序列,也即得到药物分子与靶点蛋白的匹配结果。Wherein, the first preset number is a preset threshold of the number of target objects included in the matching result, which can be set and adjusted by the user or relevant personnel according to the needs of the actual application scenario. For example, the first preset number may be 20, 30, 40, 50 and so on. Usually the first preset number is a smaller value than the number of candidate objects (second preset number) in the rough screening result. Therefore, the steps S405-S408 are the process of finely sorting and finely screening the candidate objects. After the fine screening, the first preset number of candidate objects matching the given object are determined, and the number of candidate objects matching the given object is obtained. The sequence of the target object, that is, the matching result of the drug molecule and the target protein.
步骤S409、输出药物分子与靶点蛋白的匹配结果。Step S409, outputting the matching result of the drug molecule and the target protein.
在确定药物分子与靶点蛋白的匹配结果后,可以将药物分子与靶点蛋白的匹配结果进行可视化处理,以将与给定对象相匹配的多目标对象序列的信息、以及目标对象的相关信息(如结构信息、标识信息)等信息进行可视化展示,以向用户提供参考数据。After determining the matching result of the drug molecule and the target protein, the matching result of the drug molecule and the target protein can be visualized, so that the information of the multi-target object sequence matching the given object and the relevant information of the target object can be visualized. (such as structural information, identification information) and other information are displayed visually to provide users with reference data.
例如,在为给定药物预测多靶点蛋白序列的场景中,匹配结果为与给定药物分子相匹配的多靶点蛋白序列,通过可视化输出多靶点蛋白序列中各个靶点蛋白的相关信息,向相关药物研发人员提供参考数据,以通过人工交互筛选,从药物设计的角度选出具有成药性的药物分子,针对特定疾病的靶点蛋白进行药物结构衍生,以便投入后续的药物开发环节。For example, in the scenario of predicting a multi-target protein sequence for a given drug, the matching result is the multi-target protein sequence that matches the given drug molecule, and the relevant information of each target protein in the multi-target protein sequence is output through visualization , to provide reference data to relevant drug developers to select druggable drug molecules from the perspective of drug design through manual interactive screening, and to derivate drug structures for target proteins of specific diseases, so as to put them into subsequent drug development links.
本实施例中,通过获取药物分子的结构信息,以及靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,并针对不同模态的特征信息使用不同的预训练模型生成药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征等多模态特征信息的表征信息;根据多模态特征信息的表征信息,使用训练好的分子靶点关系预测模型预测靶点蛋白与药物分子的相关性信息,筛选出与给定对象的相关性信息较高的多个待筛选对象作为备选对象,从而实现待筛选对象的粗筛,大大减少待筛选的对象数量。进一步地,通过建立的分子靶点关系知识图谱,基于该分子靶点关系知识图谱对包含备选对象的药物分子靶点蛋白及靶点蛋白的蛋白家族组别构成的三元组进行表征,备选对象的三元组表征,得到有一种模态的特征信息的表征信息,根据多模态特征信息的表征信息,使用训练好的序列模型,将每个备选对象的表征信息和三元组表征,与给定对象的表征信息,得到每个备选对象的融合表征,并根据每个备选对象的融合表征,对备选对象进行精准排序,根据精准排序结果确定与给定对象相匹配的待筛选对象序列,能快速地且精准地筛选出给定对象的可能的待筛选对象序列,极大地提高了药物分子与靶点蛋白匹配的效率。In this example, by obtaining the structural information of the drug molecule, as well as the structural information of the target protein, the information of the protein family group and the description information of the protein function, and using different pre-trained models for the characteristic information of different modalities to generate the drug molecule Representation information of multimodal feature information such as structural characterization, structural characterization of target protein, protein family group characterization, and protein function description; according to the characterization information of multimodal feature information, use the trained molecular target relationship prediction model Predict the correlation information between the target protein and the drug molecule, and select multiple objects to be screened with high correlation information with the given object as candidate objects, so as to realize the coarse screening of the objects to be screened and greatly reduce the number of objects to be screened quantity. Further, through the established molecular-target relationship knowledge graph, based on the molecular-target relationship knowledge graph, the triplet composed of the drug molecule target protein and the protein family group of the target protein is characterized, and prepared Select the triplet representation of the selected object to obtain the representation information of the feature information of one mode. According to the representation information of the multi-modal feature information, use the trained sequence model to combine the representation information of each candidate object with the triplet Representation, and the representation information of the given object, to obtain the fusion representation of each candidate object, and according to the fusion representation of each candidate object, accurately sort the candidate objects, and determine the match with the given object according to the precise sorting result The sequence of the target to be screened can quickly and accurately screen out the possible sequence of the target to be screened for a given object, which greatly improves the matching efficiency of drug molecules and target proteins.
图5为本申请一示例性实施例提供的分子靶点关系预测模型训练的方法流程图。在前述任一方法实施例的基础上,如图5所示,分子靶点关系预测模型可以通过如下步骤训练得到:Fig. 5 is a flowchart of a method for training a molecular target relationship prediction model provided by an exemplary embodiment of the present application. On the basis of any of the aforementioned method embodiments, as shown in Figure 5, the molecular target relationship prediction model can be trained through the following steps:
步骤S501、获取预先建立的分子靶点关系知识图谱,分子靶点关系知识图谱包含已知的药物分子与靶点蛋白的相互作用关系,以及靶点蛋白的蛋白家族组别信息。Step S501, obtaining a pre-established molecular-target relationship knowledge graph, which includes known interaction relationships between drug molecules and target proteins, and protein family group information of target proteins.
本实施例中,基于已知药物分子的靶点蛋白序列,构建包含药物分子-靶点蛋白-蛋白家族组别的三元组的大规模的数据集。进一步地,基于该三元组的数据集,建立分子靶点关系知识图谱。该分子靶点关系知识图谱包含数据集中三元组的信息,也即包含已知的药物分子与已知靶点蛋白的相互作用关系,以及已知靶点蛋白的蛋白家族组别信息。In this example, based on the target protein sequences of known drug molecules, a large-scale data set including triplets of drug molecule-target protein-protein family groups was constructed. Further, based on the dataset of triples, a molecular target relationship knowledge graph is established. The molecular-target relationship knowledge graph contains triplet information in the data set, that is, it includes the interaction relationship between known drug molecules and known target proteins, and the protein family group information of known target proteins.
示例性地,分子靶点关系知识图谱中,药物分子、靶点蛋白分别作为图中的节点,药物分子与靶点蛋白间的关联边表示药物分子与靶点蛋白间具有相互作用关系。蛋白家族组别可以作为图中的节点、或者图中靶点蛋白对应节点的属性信息。Exemplarily, in the knowledge graph of molecular-target relationship, drug molecules and target proteins are respectively used as nodes in the graph, and the associated edges between drug molecules and target proteins indicate that there is an interaction relationship between drug molecules and target proteins. The protein family group can be used as a node in the graph, or the attribute information of the node corresponding to the target protein in the graph.
可选地,该分子靶点关系知识图谱中还可以包含药物分子和靶点蛋白的结构信息、靶点蛋白的蛋白功能描述信息等。Optionally, the molecular-target relationship knowledge graph may also include structural information of drug molecules and target proteins, protein function description information of target proteins, and the like.
可选地,该分子靶点关系知识图谱中还可以预先存储已知的药物分子的结构表征,已知的靶点蛋白的结构表征、蛋白家族组别表征、蛋白功能描述表征等表征信息,在对任意药物分子进行多靶点预测时,可以直接获取已存储的靶点蛋白的表征信息,以提高多靶点预测的效率。Optionally, the molecular target relationship knowledge map can also pre-store the structural representation of known drug molecules, the structural representation of known target proteins, the representation of protein family groups, and the description and representation of protein functions. When performing multi-target prediction for any drug molecule, the stored characterization information of the target protein can be directly obtained to improve the efficiency of multi-target prediction.
例如,药物分子的结构表征可以作为药物分子节点的一项属性信息,已知的靶点蛋白的结构表征、蛋白家族组别表征、蛋白功能描述表征可以分别作为靶点蛋白对应节点的一项属性信息,存储在分子靶点关系知识图谱中。For example, the structural characterization of a drug molecule can be used as an attribute information of a drug molecule node, and the structural characterization of a known target protein, the characterization of a protein family group, and the description of a protein function can be used as an attribute of a node corresponding to a target protein The information is stored in the molecular target relationship knowledge graph.
可选地,药物分子和靶点蛋白的结构信息、靶点的蛋白功能描述信息,可以不包含在分子靶点关系知识图谱中,独立于分子靶点关系知识图谱单独存储。另外,已知的靶点蛋白的结构表征、蛋白家族组别表征、蛋白功能描述表征等表征数据,也可以不包含在分子靶点关系知识图谱中,独立于分子靶点关系知识图谱单独存储。Optionally, the structural information of the drug molecule and the target protein, and the protein function description information of the target may not be included in the molecular-target relationship knowledge graph, and be stored independently of the molecular-target relationship knowledge graph. In addition, the characterization data such as the structural representation of the known target protein, the representation of the protein family group, and the description and representation of the protein function may not be included in the knowledge map of the molecular target relationship, and be stored independently of the knowledge map of the molecular target relationship.
步骤S502、根据分子靶点关系知识图谱,采样生成对比学习训练集,对比学习训练集中每个样本为包含药物分子、药物分子对应靶点蛋白和靶点蛋白的蛋白家族组别信息的三元组,对比学习训练集包含正样本和负样本。Step S502, according to the molecular-target relationship knowledge map, sample and generate a comparative learning training set, each sample in the comparative learning training set is a triplet containing drug molecules, target proteins corresponding to drug molecules, and protein family group information of target proteins , the contrastive learning training set contains positive and negative samples.
本实施例中,分子靶点关系预测模型可以通过对深度学习模型进行对比学习训练得到。对比学习训练所使用的对比学习训练集,可以基于分子靶点关系知识图谱生成。In this embodiment, the molecular target relationship prediction model can be obtained by performing comparative learning training on the deep learning model. The comparative learning training set used in the comparative learning training can be generated based on the molecular target relationship knowledge map.
该步骤中,从分子靶点关系知识图谱中,采样得到多个三元组,三元组包含药物分子、靶点蛋白和靶点蛋白的蛋白家族组别信息。每个三元组作为一个样本,构成对比学习训练集。对比学习训练集包含正样本和负样本。其中,正样本是正确的三元组,正样本中药物分子与靶点蛋白间具有相互作用关系。负样本是不存在的三元组或错误的三元组,负样本中药物分子与靶点蛋白间不具有相互作用关系。In this step, multiple triplets are sampled from the molecular-target relationship knowledge map, and the triplets include drug molecules, target proteins, and protein family group information of target proteins. Each triplet is used as a sample to form a contrastive learning training set. The contrastive learning training set contains positive and negative samples. Among them, the positive sample is the correct triplet, and there is an interaction relationship between the drug molecule and the target protein in the positive sample. Negative samples are nonexistent triplets or wrong triplets, and there is no interaction relationship between drug molecules and target proteins in negative samples.
示例性地,可以从分子靶点关系知识图谱中,采样药物分子、与该药物分子具有相互作用关系的靶点蛋白、该靶点蛋白的蛋白家族组别信息,组成正样本;从分子靶点关系知识图谱中,采样药物分子、与该药物分子不具有相互作用关系的靶点蛋白、该靶点蛋白的蛋白家族组别信息,组成负样本;正样本和负样本构成对比学习训练集。Exemplarily, a positive sample can be composed of drug molecules, target proteins that have an interaction relationship with the drug molecules, and protein family group information of the target proteins from the molecular target relationship knowledge map; from the molecular target In the relational knowledge graph, negative samples are composed of drug molecules, target proteins that do not have an interaction relationship with the drug molecules, and protein family group information of the target proteins; positive samples and negative samples constitute a comparative learning training set.
可选地,从分子靶点关系知识图谱中,采样药物分子、与该药物分子不具有相互作用关系的靶点蛋白、该靶点蛋白的蛋白家族组别信息,组成负样本,还可以采用如下增强采样的方式实现:Optionally, from the molecular-target relationship knowledge map, sample drug molecules, target proteins that do not have an interaction relationship with the drug molecule, and protein family group information of the target protein to form a negative sample, and the following can also be used Enhanced sampling is achieved by:
从分子靶点关系知识图谱中,采样药物分子、与该药物分子具有相互作用关系的第一靶点蛋白、该第一靶点蛋白的蛋白家族组别信息,药物分子、第一靶点蛋白和第一靶点蛋白的蛋白家族组别信息组成正样本。进一步地,根据该第一靶点蛋白的蛋白家族组别信息,从分子靶点关系知识图谱中,采样与第一靶点蛋白具有相同组别的第二靶点蛋白;将该药物分子、第二靶点蛋白、第二靶点蛋白的蛋白家族组别信息,组成负样本。From the molecular target relationship knowledge map, sample drug molecules, the first target protein that has an interaction relationship with the drug molecule, the protein family group information of the first target protein, the drug molecule, the first target protein and The protein family group information of the first target protein constitutes the positive sample. Further, according to the protein family group information of the first target protein, a second target protein with the same group as the first target protein is sampled from the molecular target relationship knowledge map; the drug molecule, the second target protein The protein family group information of the second target protein and the second target protein constitutes a negative sample.
例如,可以从分子靶点关系知识图谱中随机采样一个药物分子A,并采样与该药物分子A具有相互作用关系(具有关联边)的靶点蛋白B1,以及靶点蛋白B1的蛋白家族组别C,得到正样本(药物分子A、靶点蛋白B1、蛋白家族组别C)。进一步地,根据蛋白家族组别C,获取具有相同的蛋白家族组别C的另一靶点蛋白B2,并且分子靶点关系知识图谱中药物分子A与靶点蛋白B2间不具有相互作用关系(不具有关联边),得到负样本(药物分子A、靶点蛋白B2、蛋白家族组别C)。For example, a drug molecule A can be randomly sampled from the molecular target relationship knowledge graph, and the target protein B1 that has an interaction relationship with the drug molecule A (with associated edges), and the protein family group of the target protein B1 C, get positive samples (drug molecule A, target protein B1, protein family group C). Further, according to the protein family group C, another target protein B2 with the same protein family group C is obtained, and there is no interaction relationship between the drug molecule A and the target protein B2 in the molecular target relationship knowledge map ( does not have associated edges), get negative samples (drug molecule A, target protein B2, protein family group C).
通过该增强采样策略,可以提高负样本的质量,从而提升对分子靶点关系预测模型进行对比学习的训练效果,提高分子靶点关系预测模型的预测精准度。Through this enhanced sampling strategy, the quality of negative samples can be improved, thereby improving the training effect of the comparative learning of the molecular target relationship prediction model, and improving the prediction accuracy of the molecular target relationship prediction model.
示例性地,可以从分子靶点关系知识图谱中,采样药物分子、与该药物分子具有相互作用关系的靶点蛋白、该靶点蛋白的蛋白家族组别信息,组成正样本。使用样本增强策略对正样本进行增强,以生成多个负样本。Exemplarily, a drug molecule, a target protein that has an interaction relationship with the drug molecule, and protein family group information of the target protein can be sampled from the molecular-target relationship knowledge map to form a positive sample. Positive samples are augmented using a sample augmentation strategy to generate multiple negative samples.
步骤S503、根据对比学习训练集,对分子靶点关系预测模型进行对比学习训练,得到训练好的分子靶点关系预测模型。Step S503 , according to the comparative learning training set, perform comparative learning training on the molecular target relationship prediction model to obtain a trained molecular target relationship prediction model.
基于生成的对比学习训练集,对分子靶点关系预测模型进行对比学习训练,使得分子靶点关系预测模型预测的负样本中药物分子与靶点蛋白间的相关性信息趋于预设的极小值(如0),而分子靶点关系预测模型预测的正样本中药物分子与靶点蛋白间的相关性信息趋于预设的极大值(如1)。Based on the generated comparative learning training set, the molecular target relationship prediction model is trained for comparative learning, so that the correlation information between the drug molecule and the target protein in the negative sample predicted by the molecular target relationship prediction model tends to be minimal by default value (such as 0), while the correlation information between the drug molecule and the target protein in the positive sample predicted by the molecular target relationship prediction model tends to the preset maximum value (such as 1).
示例性地,根据对不学习训练集中的样本,获取样本的多模态特征信息的表征信息,具体包括样本中药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征。将样本的多模态特征信息的表征信息输入分子靶点关系预测模型,通过分子靶点关系预测模型预测样本中药物分子与靶点蛋白的相关性信息,并计算对比损失函数值,通过最小化该对比损失函数值,使得预测结果中负样本中药物分子与靶点蛋白间的相关性信息趋于预设的极小值(如0),而分子靶点关系预测模型预测的正样本中药物分子与靶点蛋白间的相关性信息趋于预设的极大值(如1)。Exemplarily, based on the samples in the non-learning training set, the representation information of the multimodal feature information of the samples is obtained, specifically including the structural representation of the drug molecule in the sample, the structural representation of the target protein, and the representation of the protein family group. Input the characterization information of the multimodal feature information of the sample into the molecular target relationship prediction model, predict the correlation information between the drug molecule and the target protein in the sample through the molecular target relationship prediction model, and calculate the contrast loss function value, by minimizing The comparison loss function value makes the correlation information between the drug molecule and the target protein in the negative sample in the prediction result tend to the preset minimum value (such as 0), while the molecular target relationship prediction model predicts that the drug molecule in the positive sample The correlation information between the molecule and the target protein tends to the preset maximum value (such as 1).
本实施例中,通过预先构建分子靶点关系知识图谱,基于该知识图谱采样生成对比学习训练集,根据对比学习训练集,对分子靶点关系预测模型中用于预测相关性信息的模块进行对比学习训练,以提高训练结果的召回率,得到训练好的分子靶点关系预测模型。通过训练好的分子靶点关系预测模型,可以根据输入的药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,较准确地预测药物分子与靶点蛋白间的相关性信息,并且根据该相关性信息对待筛选对象进行粗排序,根据粗排序结果可以粗筛出给定对象相关性较高的多个备选对象,从而大大缩减待筛选的对象的数量,提高药物分子与靶点蛋白匹配的效率。In this embodiment, by pre-constructing the molecular target relationship knowledge map, a comparative learning training set is generated based on the knowledge map sampling, and according to the comparative learning training set, the modules used to predict correlation information in the molecular target relationship prediction model are compared Learn and train to improve the recall rate of training results and obtain a trained molecular target relationship prediction model. Through the trained molecular target relationship prediction model, the correlation information between the drug molecule and the target protein can be predicted more accurately according to the structure representation of the input drug molecule, the structure representation of the target protein and the representation of the protein family group. And according to the correlation information, the objects to be screened are roughly sorted, and multiple candidate objects with high correlation with the given object can be roughly screened out according to the rough sorting results, thereby greatly reducing the number of objects to be screened and improving the relationship between drug molecules and targets. Point protein matching efficiency.
图6为本申请一示例性实施例提供的序列模型训练的方法流程图。在前述任一方法实施例的基础上,如图6所示,序列模型可以通过如下步骤训练得到:Fig. 6 is a flowchart of a method for training a sequence model provided by an exemplary embodiment of the present application. On the basis of any of the aforementioned method embodiments, as shown in Figure 6, the sequence model can be obtained by training through the following steps:
步骤S600、从分子靶点关系知识图谱中,采样给定对象样本和给定对象样本对应的待筛选对象样本。Step S600 , sampling the given object sample and the object samples to be screened corresponding to the given object sample from the knowledge graph of molecular-target relationship.
本实施例中,基于预先构建的分子靶点关系知识图谱,可以从分子靶点关系知识图谱中随机采样大量的给定对象作为样本,本实施例称之为给定对象样本。对于每个给定对象样本,从分子靶点关系知识图谱采样数量较多的待筛选对象,作为给定对象样本对应的待筛选对象样本。不同给定对象对应的待筛选对象可以不同,也可以相同。例如,可以将分子靶点关系知识图中全部的待筛选对象作为每个给定对象样本对应的待筛选对象样本。In this embodiment, based on the pre-built molecular target relationship knowledge graph, a large number of given objects can be randomly sampled from the molecular target relationship knowledge graph as samples, which is called a given object sample in this embodiment. For each given object sample, a larger number of objects to be screened is sampled from the molecular target relationship knowledge map, and used as the object sample to be screened corresponding to the given object sample. The objects to be filtered corresponding to different given objects may be different or the same. For example, all the objects to be screened in the molecular target relationship knowledge graph can be used as the object samples to be screened corresponding to each given object sample.
示例性地,在为给定药物预测多靶点蛋白序列的场景中,采样药物分子样本,作为给定的药物分子,将分子靶点关系知识图谱中的靶点蛋白作为待筛选对象样本。在为给定靶点蛋白预测适用的药物分子的场景中,采样靶点蛋白样本,作为给定的靶点蛋白,将分子靶点关系知识图谱中的药物分子作为待筛选对象样本。Exemplarily, in the scenario of predicting a multi-target protein sequence for a given drug, a drug molecule sample is sampled as a given drug molecule, and the target protein in the molecular-target relationship knowledge map is used as the target sample to be screened. In the scenario of predicting applicable drug molecules for a given target protein, the target protein sample is sampled as a given target protein, and the drug molecule in the molecular-target relationship knowledge map is used as the sample to be screened.
步骤S601、利用训练好的分子靶点关系预测模型,根据给定对象样本的表征信息、对应的待筛选对象样本的表征信息,预测给定对象样本与待筛选对象样本的相关性信息,并根据相关性信息从待筛选对象样本中筛选出与给定对象样本相关的多个备选对象样本。Step S601, using the trained molecular target relationship prediction model, according to the characterization information of the given object sample and the corresponding characterization information of the object sample to be screened, to predict the correlation information between the given object sample and the object sample to be screened, and according to The correlation information screens out a plurality of candidate object samples related to a given object sample from the object samples to be screened.
该步骤中,针对每个给定对象样本,将该给定对象样本的表征信息、该给定对象样本对应的每个待筛选对象样本的表征信息,输入训练好的分子靶点关系预测模型,通过训练好的分子靶点关系预测模型,预测该给定对象样本与每个待筛选对象样本的相关性信息,并根据相关信息,确定与该给定对象样本相关性较高的多个待筛选对象样本,作为该给定对象样本的多个备选对象样本。In this step, for each given object sample, input the characterization information of the given object sample and the characterization information of each object sample to be screened corresponding to the given object sample into the trained molecular target relationship prediction model, Through the trained molecular target relationship prediction model, predict the correlation information between the given object sample and each object sample to be screened, and determine multiple to-be-screened objects with high correlation with the given object sample according to the relevant information An object sample, as a number of candidate object samples for that given object sample.
例如,在为给定药物预测多靶点蛋白序列的场景中,针对每一药物分子样本,将该药物分子样本的分结构表征,以及作为待筛选对象的已知靶点蛋白的结构表征和蛋白家族组别表征,输入训练好的分子靶点关系预测模型,通过训练好的分子靶点关系预测模型,预测药物分子样本与已知的靶点蛋白的相关性信息,并确定药物分子样本的多个备选靶点蛋白样本。For example, in the scenario of predicting a multi-target protein sequence for a given drug, for each drug molecule sample, the substructure characterization of the drug molecule sample, as well as the structural characterization and protein Family group characterization, input the trained molecular target relationship prediction model, through the trained molecular target relationship prediction model, predict the correlation information between the drug molecule sample and the known target protein, and determine the multiplicity of the drug molecule sample A candidate target protein sample.
例如,在为给定靶点蛋白预测适用的药物分子的场景中,针对每一靶点蛋白样本,将该靶点蛋白样本的结构表征和蛋白家族组别表征,以及作为待筛选对象的药物分子的结构表征,输入训练好的分子靶点关系预测模型,通过训练好的分子靶点关系预测模型,预测待筛选的药物分子样本与该靶点蛋白的相关性信息,并确定该靶点单元样本的多个备选药物分子样本。For example, in the scenario of predicting applicable drug molecules for a given target protein, for each target protein sample, the structural characterization and protein family group characterization of the target protein sample, as well as the drug molecule to be screened Input the trained molecular target relationship prediction model, through the trained molecular target relationship prediction model, predict the correlation information between the drug molecule sample to be screened and the target protein, and determine the target unit sample Multiple candidate drug molecule samples.
该步骤中确定每一给定对象样本的多个备选对象样本的具体实现方式,与上述步骤S403-S404中利用训练好的分子靶点关系预测模型,预测给定对象的多个备选对象的实现方式类似,具体参见上述实施例中的相关描述,本实施例此处不再赘述。The specific implementation of determining multiple candidate object samples for each given object sample in this step is the same as using the trained molecular target relationship prediction model in the above steps S403-S404 to predict multiple candidate objects for a given object The implementation manners are similar, and for details, refer to the related descriptions in the foregoing embodiments, and details are not repeated here in this embodiment.
步骤S602、根据分子靶点关系知识图谱,对包含给定对象样本和备选对象样本的三元组进行表征,得到给定对象样本与备选对象样本的三元组表征,三元组包括药物分子、靶点蛋白及靶点蛋白的蛋白家族组别。Step S602, according to the molecular target relationship knowledge map, characterize the triplets including the given object sample and the candidate object sample, and obtain the triplet representation of the given object sample and the candidate object sample, the triplet includes the drug Molecules, target proteins, and protein family groups of target proteins.
具体地,对于每一给定对象样本,将该给定对象样本与其每个备选对象样本及蛋白家族组别的信息,组成一个三元组。根据分子靶点关系知识图谱对该三元组进行表征,得到每个给定对象样本与每个备选对象样本的三元组表征。Specifically, for each given object sample, information about the given object sample and each of its candidate object samples and protein family groups are formed into a triplet. The triplet is characterized according to the molecular target relationship knowledge map, and the triplet representation of each given object sample and each candidate object sample is obtained.
例如,在为给定药物预测多靶点蛋白序列的场景中,对于每一药物分子样本,将该药物分子样本与该药物分子样本的每个备选靶点蛋白样本及备选靶点蛋白样本的蛋白家族组别的信息,组成一个三元组。基于分子靶点关系知识图谱,对该三元组进行表征,得到该药物分子样本与每个备选靶点蛋白样本的三元组表征。For example, in the scenario of predicting a multi-target protein sequence for a given drug, for each drug molecule sample, the drug molecule sample and each candidate target protein sample and candidate target protein sample of the drug molecule sample The information of the protein family group is composed of a triplet. Based on the molecular target relationship knowledge map, the triplet is characterized, and the triplet representation of the drug molecule sample and each candidate target protein sample is obtained.
例如,在为给定靶点蛋白预测适用的药物分子的场景中,对于每一靶点蛋白样本,将该靶点蛋白样本的每个备选药物分子样本与该靶点蛋白样本及该靶点蛋白样本的蛋白家族组别的信息,组成一个三元组。基于分子靶点关系知识图谱,对每个三元组进行表征,得到该靶点蛋白样本与每个备选药物分子样本的三元组表征。For example, in the scenario of predicting applicable drug molecules for a given target protein, for each target protein sample, each candidate drug molecule sample of the target protein sample is associated with the target protein sample and the target protein sample The information of the protein family group of the protein sample forms a triplet. Based on the molecular target relationship knowledge map, each triplet is characterized, and the triplet representation of the target protein sample and each candidate drug molecule sample is obtained.
该步骤中对任一给定对象样本进行处理确定该给定对象样本与每一备选对象样本的三元组表征,具体实现方式与上述步骤S405的具体实现方式类似,具体参见上述实施例中的相关内容,本实施例此处不再赘述。In this step, any given object sample is processed to determine the triplet representation of the given object sample and each candidate object sample. The specific implementation method is similar to the specific implementation method of the above-mentioned step S405. For details, refer to the above-mentioned embodiment. The relevant content of this embodiment will not be repeated here.
步骤S603、任一给定对象样本的表征信息、给定对象样本的多个备选对象样本的表征信息和给定对象样本与备选对象样本的三元组表征,构成一条训练样本。Step S603, the characterization information of any given object sample, the characterization information of multiple candidate object samples of the given object sample, and the triplet characterization of the given object sample and the candidate object samples to form a training sample.
本实施例中,用于训练序列模型训练集中的每一条训练样本,包括:给定对象样本的表征信息、多个备选对象样本的表征信息、给定对象样本与每个备选对象样本的三元组表征。In this embodiment, each training sample used in the training sequence model training set includes: characterization information of a given object sample, characterization information of multiple candidate object samples, characterization information of a given object sample and each candidate object sample triplet representation.
示例性地,在为给定药物预测多靶点蛋白序列的场景中,训练样本包括:药物分子样本的结构表征、多个备选靶点蛋白样本的结构表征、蛋白家族组别表征和蛋白功能描述表征,该药物分子样本与每一备选靶点蛋白样本对应的三元组表征。Exemplarily, in the scenario of predicting multi-target protein sequences for a given drug, the training samples include: structural characterization of drug molecule samples, structural characterization of multiple candidate target protein samples, protein family group characterization, and protein function Describe the characterization, the triplet characterization corresponding to the drug molecule sample and each candidate target protein sample.
示例性地,在为给定靶点蛋白预测适用的药物分子的场景中,训练样本包括:靶点蛋白样本的结构表征、蛋白家族组别表征和蛋白功能描述表征,多个备选药物分子样本的结构表征,该靶点蛋白样本与每一备选药物分子样本对应的三元组表征。Exemplarily, in the scenario of predicting applicable drug molecules for a given target protein, the training samples include: structural characterization of target protein samples, protein family group characterization and protein function description characterization, multiple candidate drug molecule samples The structural characterization of the target protein sample and the triplet characterization corresponding to each candidate drug molecule sample.
步骤S604、获取训练样本的匹配结果标注信息,训练样本及训练样本的匹配结果标注信息构成训练集。Step S604, acquiring matching result annotation information of the training samples, and the training samples and the matching result annotation information of the training samples constitute a training set.
其中,训练样本的匹配结果标注信息包括:给定对象样本的多备选对象样本序列。示例性地,对于每一训练样本,可以通过人工标注的方式获取训练样本的匹配结果标注信息。Wherein, the matching result annotation information of the training samples includes: multiple candidate object sample sequences of a given object sample. Exemplarily, for each training sample, the labeling information of the matching result of the training sample may be obtained through manual labeling.
例如,在为给定药物预测多靶点蛋白序列的场景中,匹配结果标注信息包括:药物分子样本的多靶点蛋白样本序列。在为给定靶点蛋白预测适用的药物分子的场景中,匹配结果标注信息包括:靶点蛋白样本的多药物分子样本序列。For example, in the scenario of predicting a multi-target protein sequence for a given drug, the matching result annotation information includes: the multi-target protein sample sequence of the drug molecule sample. In the scenario of predicting applicable drug molecules for a given target protein, the matching result annotation information includes: the multi-drug molecule sample sequence of the target protein sample.
步骤S605、使用训练集训练序列模型,得到训练好的序列模型。Step S605, using the training set to train the sequence model to obtain a trained sequence model.
具体地,基于训练集,将训练样本中给定对象样本的表征信息、给定对象样本的多个备选对象样本的表征信息和给定对象样本与备选对象样本的三元组表征,输入序列模型,通过序列模型的编码模块,将给定对象样本的表征信息、备选对象样本的表征信息和给定对象样本与备选对象样本的三元组表征融合,得到备选对象样本的融合表征;通过序列模型的排序模块,根据备选对象样本的融合表征,对备选对象样本进行排序,并确定匹配预测结果;根据匹配预测结果和训练样本的匹配结果标注信息,计算损失函数值,并根据损失函数值优化编码模块和排序模块的模型参数,以获得训练好的序列模型。Specifically, based on the training set, the representation information of the given object sample in the training sample, the representation information of multiple candidate object samples of the given object sample, and the triplet representation of the given object sample and the candidate object samples are input as The sequence model, through the encoding module of the sequence model, combines the representation information of the given object sample, the representation information of the candidate object sample, and the triplet representation of the given object sample and the candidate object sample to obtain the fusion of candidate object samples Representation; through the sorting module of the sequence model, according to the fusion representation of the candidate object samples, the candidate object samples are sorted, and the matching prediction result is determined; the loss function value is calculated according to the matching prediction result and the matching result labeling information of the training sample, And optimize the model parameters of the encoding module and the sorting module according to the loss function value to obtain a trained sequence model.
以为给定药物预测多靶点蛋白序列的场景为例,对本实施例的效果进行说明,基于预先构建分子靶点关系知识图谱,采用得到多个药物分子样本;利用训练好的分子靶点关系预测模型,预测每个药物分子样本的多个备选靶点蛋白样本。基于预先构建分子靶点关系知识图谱,将每一药物分子样本与每一备选靶点蛋白样本及备选靶点蛋白样本的蛋白家族组别构成的三元组进行表征,得到三元组表征,药物分子样本的结构表征、每一备选靶点蛋白样本的结构表征、蛋白家族组别表征和蛋白功能描述表征,以及三元组表征等表征信息,构成训练样本,并获取训练样本的标注信息,得到训练集。基于该训练集对序列模型进行训练,以提高训练结果的召回率,得到训练好的序列模型。通过训练好的序列模型,可以根据输入的多模态特征信息的表征信息,精准地对备选靶点蛋白进行排序,并确定精准地确定药物分子的多靶点蛋白序列,使得预测得到的较短的多靶点蛋白序列有较高的靶点召回率。Taking the scenario of predicting a multi-target protein sequence for a given drug as an example, the effect of this embodiment is described. Based on the pre-built molecular target relationship knowledge map, multiple drug molecule samples are obtained by using the trained molecular target relationship prediction. A model that predicts multiple candidate target protein samples for each drug molecule sample. Based on the pre-constructed molecular target relationship knowledge map, characterize the triplets composed of each drug molecule sample, each candidate target protein sample and the protein family group of the candidate target protein sample, and obtain a triplet representation , the structural characterization of the drug molecule sample, the structural characterization of each candidate target protein sample, the characterization of protein family group and protein function description, and the characterization information such as triplet characterization, constitute the training sample, and obtain the annotation of the training sample information to get the training set. The sequence model is trained based on the training set to improve the recall rate of the training result and obtain a trained sequence model. Through the trained sequence model, the candidate target proteins can be accurately sorted according to the characterization information of the input multimodal feature information, and the multi-target protein sequence of the drug molecule can be determined accurately, so that the predicted Short multi-target protein sequences have higher target recall.
本申请提供的药物多靶点预测方案,采用了最新的知识图谱技术进行了已知靶点蛋白(或药物分子)的粗排和粗筛,得到数量较少的备选靶点蛋白(或备选药物分子),联合transformer编码模块和ListWise排序算法等融合多模态特征信息的表征信息对备选靶点蛋白(或备选药物分子)进行精排和精筛选,得到药物分子的多靶点蛋白序列(或靶点蛋白相匹配的多药物分子序列),相比于其他预测药物靶标关系类的技术而言,能够更高效的实现药物分子与靶点蛋白的匹配,通过引入对比学习的方式来训练用于粗排的分子靶点关系预测模型,能够大大缩小多靶点蛋白(或多药物分子)寻找的范围,这样得到的分子靶点关系预测模型不仅仅停留在药物分子靶点作用力预测任务上,同时可用于对具有良好性质的化合物进行疾病用药评估,为药物研发提供高效准确的靶点蛋白参考序列,也可用于重大疾病对应靶点蛋白的药物搜寻任务中。同时,采用“粗排”+“精排”手段,在药物分子与靶点蛋白匹配的准确度、运算性能上均优于其他靶点筛选模型和药物筛选模型。The drug multi-target prediction scheme provided by this application uses the latest knowledge map technology to carry out rough sorting and rough screening of known target proteins (or drug molecules), and obtain a small number of candidate target proteins (or candidate drug molecules). Select drug molecules), combined with transformer coding module and ListWise sorting algorithm and other characterization information that fuses multi-modal feature information to fine-sort and fine-screen candidate target proteins (or candidate drug molecules), and obtain multi-target drug molecules Protein sequence (or multi-drug molecular sequence matched with target protein), compared with other techniques for predicting drug-target relationship, can more efficiently match drug molecules with target proteins, by introducing comparative learning To train the molecular target relationship prediction model for rough sorting, it can greatly narrow the scope of searching for multi-target proteins (or multi-drug molecules), and the molecular target relationship prediction model obtained in this way does not only stop at the force of drug molecular targets In terms of prediction tasks, it can also be used to evaluate compounds with good properties for disease medication, provide efficient and accurate target protein reference sequences for drug development, and can also be used in drug search tasks for target proteins corresponding to major diseases. At the same time, the "rough sorting" + "fine sorting" method is adopted, which is superior to other target screening models and drug screening models in terms of matching accuracy and computational performance between drug molecules and target proteins.
通过本申请提供的方案,经湿实验室实验结果验证,具备了预测分子多靶点蛋白任务能力,为药物研发人员高效提供药物分子的多靶点蛋白序列的筛选结果。药物研发不再需要药物研发人员人工从几万个已知靶点蛋白中进行高通量筛选来进行多靶点蛋白定位,仅需基于本申请的方案自动生成的药物分子的多靶点蛋白序列,从多靶点蛋白序列中进行小量的验证实验,大大减少了靶点蛋白筛选的工作量,节省了人力、物力和时间成本,能快速得到符合研究疾病的多靶点蛋白先导化合物,药物研发效率得到了大力提升,推动新药研发进程。Through the scheme provided by this application, verified by the results of wet laboratory experiments, it has the ability to predict molecular multi-target protein tasks, and efficiently provides screening results of multi-target protein sequences of drug molecules for drug researchers. Drug development no longer requires drug developers to manually perform high-throughput screening from tens of thousands of known target proteins for multi-target protein localization, only the multi-target protein sequence of the drug molecule automatically generated based on the scheme of this application , a small amount of verification experiments are carried out from the multi-target protein sequence, which greatly reduces the workload of target protein screening, saves manpower, material resources and time costs, and can quickly obtain multi-target protein lead compounds and drugs that meet the research diseases. The efficiency of research and development has been greatly improved, and the process of new drug research and development has been promoted.
图7为本申请另一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图,本实施例提供的方法应用于为给定药物预测多靶点蛋白序列的场景。如图7所示,该方法具体步骤如下:Fig. 7 is a flowchart of a method for matching a drug molecule with a target protein provided in another exemplary embodiment of the present application. The method provided in this embodiment is applied to the scenario of predicting a multi-target protein sequence for a given drug. As shown in Figure 7, the specific steps of the method are as follows:
步骤S701、响应于药物多靶点蛋白预测指令,获取给定的药物分子的结构信息,以及待筛选的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。Step S701, in response to the drug multi-target protein prediction instruction, obtain the structural information of a given drug molecule, as well as the structural information, protein family group information and protein function description information of the target protein to be screened.
其中,药物多靶点预测指令是指端侧设备基于用户的药物多靶点预测请求向服务器发送的指令,用于指示服务器获取给定的药物分子和待筛选的靶点蛋白的信息,并执行药物分子与靶点蛋白的匹配方法。Among them, the drug multi-target prediction instruction refers to the instruction sent by the end-side device to the server based on the user's drug multi-target prediction request, which is used to instruct the server to obtain the information of a given drug molecule and the target protein to be screened, and execute Matching method of drug molecule and target protein.
可选地,药物多靶点蛋白预测指令包含给定的药物分子的标识信息和/或待筛选的靶点蛋白的标识信息。服务器根据接收到药物多靶点蛋白预测指令,从指令中提取给定的药物分子的标识信息和/或待筛选的靶点蛋白的标识信息。进一步地,若提取到给定的药物分子的标识信息,根据给定的药物分子的标识信息,获取给定的药物分子的结构信息。若提取到待筛选的靶点蛋白的标识信息,根据待筛选的靶点蛋白的标识信息,获取待筛选的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。Optionally, the drug multi-target protein prediction instruction includes identification information of a given drug molecule and/or identification information of a target protein to be screened. According to the received drug multi-target protein prediction instruction, the server extracts the identification information of a given drug molecule and/or the identification information of the target protein to be screened from the instruction. Further, if the identification information of the given drug molecule is extracted, the structural information of the given drug molecule is obtained according to the identification information of the given drug molecule. If the identification information of the target protein to be screened is extracted, according to the identification information of the target protein to be screened, the structural information, protein family group information and protein function description information of the target protein to be screened are obtained.
可选地,药物多靶点蛋白预测指令可以包含给定的药物分子的标识信息。服务器根据给定的药物分子的标识信息,获取给定的药物分子的结构信息;将已知的靶点蛋白作为待筛选的靶点蛋白,获取已知的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。Optionally, the drug multi-target protein prediction instruction may include identification information of a given drug molecule. The server obtains the structural information of a given drug molecule according to the identification information of the given drug molecule; takes the known target protein as the target protein to be screened, and obtains the structural information of the known target protein, protein family group Identification information and protein function description information.
可选地,药物多靶点蛋白预测指令还可以包含给定的药物分子的结构信息。服务器将已知的靶点蛋白作为待筛选的靶点蛋白,获取已知的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。Optionally, the drug multi-target protein prediction instruction may also include structural information of a given drug molecule. The server takes the known target protein as the target protein to be screened, and obtains the structural information, protein family group information and protein function description information of the known target protein.
本实施例中,药物分子和靶点蛋白的各项信息的存储方式,可以采用上述任一方法实施例中提供的存储方式,相应地,获取药物分子和靶点蛋白的各项信息的方式与存储方式相对于,具体参见上述方法实施例中的相关内容,此处不再赘述。In this embodiment, the storage method of the various information of the drug molecule and the target protein can adopt the storage method provided in any of the above-mentioned method embodiments. Correspondingly, the method of obtaining various information of the drug molecule and the target protein is the same as For the storage method, refer to the relevant content in the foregoing method embodiments for details, and details are not repeated here.
步骤S702、确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。Step S702, determine the characterization information of the drug molecule and the characterization information of the target protein, the characterization information of the drug molecule includes the structural characterization of the drug molecule, and the characterization information of the target protein includes the structural characterization of the target protein, the characterization of the protein family group and the protein Functional description characterization.
该步骤的具体实现方式与上述步骤S202或S402中的实现方式类似,具体参见前述上述步骤S202或S402的相关内容,此处不再赘述。The specific implementation manner of this step is similar to the implementation manner in the above-mentioned step S202 or S402. For details, refer to the relevant content of the above-mentioned step S202 or S402, which will not be repeated here.
步骤S703、根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选的靶点蛋白中筛选出备选靶点蛋白。Step S703: According to the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group, predict the correlation information between the target protein and the drug molecule, and screen from the target proteins to be screened according to the correlation information Alternative target proteins.
该步骤中,将药物分子作为给定对象,将靶点蛋白作为待筛选对象,采用与上述步骤S403-S404类似的方法实现,具体参见上述步骤S403-S404的相关内容,此处不再赘述。In this step, the drug molecule is used as the given object, and the target protein is used as the object to be screened, which is realized by a method similar to the above steps S403-S404. For details, please refer to the relevant content of the above steps S403-S404, which will not be repeated here.
步骤S704、根据药物分子的表征信息和靶点蛋白的表征信息,对备选靶点蛋白进行排序,根据排序结果输出与药物分子相匹配的多靶点蛋白序列。Step S704, sort the candidate target proteins according to the characterization information of the drug molecule and the characterization information of the target protein, and output the sequence of the multi-target protein that matches the drug molecule according to the sorting result.
该步骤中,将药物分子作为给定对象,将靶点蛋白作为待筛选对象,将备选靶点蛋白作为备选对象,采用与上述步骤S405-S409类似的方法实现,得到的药物分子与靶点蛋白的匹配结果即为与给定的药物分子相匹配的多靶点蛋白序列,具体参见上述步骤S405-S409的相关内容,此处不再赘述。In this step, the drug molecule is used as a given object, the target protein is used as an object to be screened, and the candidate target protein is used as a candidate object. The matching result of the dot protein is the multi-target protein sequence that matches the given drug molecule. For details, please refer to the relevant content of the above steps S405-S409, which will not be repeated here.
本实施例提供的方法,可以应用于为给定药物预测多靶点蛋白序列的场景,响应于药物多靶点蛋白预测指令,获取给定的药物分子的结构信息,以及待筛选的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,并针对不同模态的特征信息使用不同的预训练模型生成药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征等表征信息;根据多模态特征信息的表征信息,使用训练好的分子靶点关系预测模型预测靶点蛋白与药物分子的相关性信息,筛选出与药物分子的相关性信息较高的多个靶点蛋白作为备选靶点蛋白,从而实现备选靶点蛋白的粗筛,大大减少备选靶点蛋白的数量。进一步地,通过建立的分子靶点关系知识图谱,基于该分子靶点关系知识图谱对药物分子与每一备选靶点蛋白及备选靶点蛋白的组别构成的三元组进行表征,得到药物分子与每一备选靶点蛋白的三元组表征,得到又一种模态特征信息的表征信息,根据多模态特征信息的表征信息,使用训练好的序列模型,将每个备选靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,对应的三元组表征,与药物分子的结构表征融合,得到每个备选靶点蛋白的融合表征,并根据每个备选靶点蛋白的融合表征,对备选靶点蛋白进行精准排序,根据精准排序结果确定药物分子相匹配的多靶点蛋白序列,能快速地且精准地筛选出给定药物分子的可能的靶点蛋白序列,极大地提高了药物靶点筛选的效率,通过人工智能节省大量人力和实验资源,从而能够缩短多靶点药物研发的周期,推动新药研发进程。The method provided in this example can be applied to the scenario of predicting a multi-target protein sequence for a given drug, and obtain the structural information of a given drug molecule and the target protein to be screened in response to the drug multi-target protein prediction instruction structure information, protein family group information, and protein function description information, and use different pre-training models to generate structural representations of drug molecules, target protein structural representations, protein family group representations, and protein Representation information such as functional description and characterization; according to the characterization information of multimodal feature information, use the trained molecular target relationship prediction model to predict the correlation information between the target protein and the drug molecule, and screen out the high correlation information with the drug molecule Multiple target proteins can be used as candidate target proteins, so as to achieve coarse screening of candidate target proteins and greatly reduce the number of candidate target proteins. Further, through the established molecular-target relationship knowledge map, based on the molecular-target relationship knowledge map, the triplet composed of the drug molecule and each candidate target protein and the group of the candidate target protein is characterized, and the obtained The triplet characterization of the drug molecule and each candidate target protein obtains the characterization information of another modal feature information. According to the characterization information of the multi-modal feature information, using the trained sequence model, each candidate The structural characterization of the target protein, the characterization of the protein family group, the characterization of the protein function, and the corresponding triplet characterization are fused with the structural characterization of the drug molecule to obtain the fusion characterization of each candidate target protein, and according to each candidate Fusion characterization of selected target proteins, precise sequencing of candidate target proteins, and determination of multi-target protein sequences matching drug molecules based on the precise sequencing results, which can quickly and accurately screen possible targets for a given drug molecule The protein sequence greatly improves the efficiency of drug target screening, and saves a lot of manpower and experimental resources through artificial intelligence, thereby shortening the cycle of multi-target drug development and promoting the process of new drug development.
图8为本申请另一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图,本实施例提供的方法应用于为给定靶点蛋白预测适用的药物分子的场景。如图8所示,该方法具体步骤如下:FIG. 8 is a flowchart of a method for matching a drug molecule with a target protein provided in another exemplary embodiment of the present application. The method provided in this embodiment is applied to the scenario of predicting an applicable drug molecule for a given target protein. As shown in Figure 8, the specific steps of the method are as follows:
步骤S801、响应于靶点蛋白适用药物分子的匹配指令,获取给定的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,以及待筛选的药物分子的结构信息。Step S801, in response to the matching instruction of the drug molecule applicable to the target protein, obtain the structural information, protein family group information and protein function description information of a given target protein, as well as the structural information of the drug molecule to be screened.
其中,靶点蛋白适用药物分子的匹配指令,是指端侧设备基于用户的为靶点蛋白匹配适用的药物的请求向服务器发送的指令,用于指示服务器获取给定的靶点蛋白和待筛选的药物分子的信息,并执行药物分子与靶点蛋白的匹配方法。Among them, the matching instruction of the drug molecule applicable to the target protein refers to the instruction sent by the end-side device to the server based on the user's request to match the applicable drug for the target protein, and is used to instruct the server to obtain the given target protein and the target protein to be screened. The information of the drug molecule, and the matching method of the drug molecule and the target protein.
可选地,靶点蛋白适用药物分子的匹配指令包含给定的靶点蛋白的标识信息和/或待筛选的药物分子的标识信息。服务器根据接收到靶点蛋白适用药物分子的匹配指令,从匹配指令中提取给定的靶点蛋白的标识信息和/或待筛选的药物分子的标识信息。进一步地,若提取到给定的靶点蛋白的标识信息,根据给定的靶点蛋白的标识信息,获取给定的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。若提取到待筛选的药物分子的标识信息,根据待筛选的药物分子的标识信息,获取待筛选的药物分子的结构信息。Optionally, the matching instruction for a drug molecule applicable to a target protein includes identification information of a given target protein and/or identification information of a drug molecule to be screened. The server extracts the identification information of the given target protein and/or the identification information of the drug molecule to be screened from the matching instruction according to the received matching instruction of the drug molecule applicable to the target protein. Further, if the identification information of the given target protein is extracted, according to the identification information of the given target protein, the structural information, protein family group information and protein function description information of the given target protein are obtained. If the identification information of the drug molecule to be screened is extracted, the structural information of the drug molecule to be screened is obtained according to the identification information of the drug molecule to be screened.
可选地,靶点蛋白适用药物分子的匹配指令可以包含给定的靶点蛋白的标识信息。服务器根据给定的靶点蛋白的标识信息,获取给定的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息;将已知的药物分子作为待筛选的药物分子,获取已知的药物分子的结构信息。可选地,靶点蛋白适用药物分子的匹配指令可以包含给定的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。服务器将已知的药物分子作为待筛选的药物分子,获取已知的药物分子的结构信息。Optionally, the matching instruction for a drug molecule applicable to a target protein may include identification information of a given target protein. According to the identification information of the given target protein, the server obtains the structural information, protein family group information and protein function description information of the given target protein; takes the known drug molecule as the drug molecule to be screened, and obtains the known structural information of drug molecules. Optionally, the matching instruction for the drug molecule applicable to the target protein may include structural information, protein family group information and protein function description information of a given target protein. The server uses the known drug molecule as the drug molecule to be screened, and obtains the structure information of the known drug molecule.
本实施例中,药物分子和靶点蛋白的各项信息的存储方式,可以采用上述任一方法实施例中提供的存储方式,相应地,获取药物分子和靶点蛋白的各项信息的方式与存储方式相对于,具体参见上述方法实施例中的相关内容,此处不再赘述。In this embodiment, the storage method of the various information of the drug molecule and the target protein can adopt the storage method provided in any of the above-mentioned method embodiments. Correspondingly, the method of obtaining various information of the drug molecule and the target protein is the same as For the storage method, refer to the relevant content in the foregoing method embodiments for details, and details are not repeated here.
步骤S802、确定靶点蛋白的表征信息和药物分子的表征信息,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征,药物分子的表征信息包括药物分子的结构表征。Step S802. Determine the characterization information of the target protein and the characterization information of the drug molecule. The characterization information of the target protein includes the structural characterization of the target protein, the characterization of the protein family group and the description of the protein function. The characterization information of the drug molecule includes the drug molecule structural characterization.
该步骤的具体实现方式与上述步骤S202或S402中的实现方式类似,具体参见前述上述步骤S202或S402的相关内容,此处不再赘述。The specific implementation manner of this step is similar to the implementation manner in the above-mentioned step S202 or S402. For details, refer to the relevant content of the above-mentioned step S202 or S402, which will not be repeated here.
步骤S803、根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选的药物分子中筛选出备选药物分子。Step S803: According to the structural characterization of the drug molecule, the structural characterization of the target protein and the characterization of the protein family group, predict the correlation information between the target protein and the drug molecule, and screen out the drug molecules to be screened according to the correlation information Alternative Drug Molecules.
该步骤中,将靶点蛋白作为给定对象,将药物分子作为待筛选对象,采用与上述步骤S403-S404类似的方法实现,具体参见上述步骤S403-S404的相关内容,此处不再赘述。In this step, the target protein is used as a given object, and the drug molecule is used as an object to be screened, which is realized by a method similar to the above-mentioned steps S403-S404. For details, please refer to the relevant content of the above-mentioned steps S403-S404, which will not be repeated here.
步骤S804、根据药物分子的表征信息和靶点蛋白的表征信息,对备选药物分子进行排序,根据排序结果输出与靶点蛋白相匹配的多药物分子序列。Step S804, sort the candidate drug molecules according to the characterization information of the drug molecule and the characterization information of the target protein, and output the multi-drug molecule sequence matching the target protein according to the sorting result.
该步骤中,将靶点蛋白作为给定对象,将药物分子作为待筛选对象,将备选药物分子作为备选对象,采用与上述步骤S405-S409类似的方法实现,得到的药物分子与靶点蛋白的匹配结果即为与给定的靶点蛋白相匹配的多药物分子序列,具体参见上述步骤S405-S409的相关内容,此处不再赘述。In this step, the target protein is used as a given object, the drug molecule is used as an object to be screened, and the candidate drug molecule is used as a candidate object, and a method similar to the above steps S405-S409 is used to achieve the obtained drug molecule and target The protein matching result is the multi-drug molecular sequence matching the given target protein. For details, please refer to the relevant content of the above steps S405-S409, which will not be repeated here.
本实施例提供的方法,可以应用于为给定靶点蛋白预测适用的药物分子的场景,响应于靶点蛋白适用药物分子的匹配指令,获取给定的靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,以及待筛选的药物分子的结构信息,并针对不同模态的特征信息使用不同的预训练模型生成药物分子的结构表征、靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征等表征信息;根据多模态特征信息的表征信息,使用训练好的分子靶点关系预测模型预测靶点蛋白与药物分子的相关性信息,筛选出与给定靶点蛋白的相关性较高的多个药物分子作为备选药物分子,从而实现药物分子的粗筛,大大减少待筛选的药物分子的数量。进一步地,通过建立的分子靶点关系知识图谱,基于该分子靶点关系知识图谱对靶点蛋白的每一备选药物分子与靶点蛋白及靶点蛋白的组别构成的三元组进行表征,得到每一靶点蛋白与每一备选药物分子对应的三元组表征,得到又一种模态特征信息的表征信息,根据多模态特征信息的表征信息,使用训练好的序列模型,将多模态特征信息的表征信息融合,得到每个备选药物分子的融合表征,并根据每个备选药物分子的融合表征,对备选药物分子进行精准排序,根据精准排序结果确定靶点蛋白相匹配的多药物分子序列,能快速地且精准地筛选出能够有效作用于给定靶点蛋白的可能的药物分子序列,极大地提高了药物筛选的效率。The method provided in this embodiment can be applied to the scene of predicting the applicable drug molecule for a given target protein, and in response to the matching instruction of the target protein applicable drug molecule, the structural information and protein family group of the given target protein can be obtained. The specific information and protein function description information, as well as the structural information of the drug molecule to be screened, and use different pre-training models for the feature information of different modalities to generate the structural representation of the drug molecule, the structural representation of the target protein, and the protein family group Characterization information such as characterization and protein function description characterization; according to the characterization information of multimodal feature information, use the trained molecular target relationship prediction model to predict the correlation information between the target protein and the drug molecule, and screen the given target protein Multiple drug molecules with high correlation are used as candidate drug molecules, so as to realize the coarse screening of drug molecules and greatly reduce the number of drug molecules to be screened. Further, through the established molecular-target relationship knowledge graph, based on the molecular-target relationship knowledge graph, the triplet composed of each candidate drug molecule of the target protein, the target protein and the group of the target protein is characterized , to obtain the triplet characterization corresponding to each target protein and each candidate drug molecule, and to obtain another modal feature information characterization information, according to the multi-modal feature information characterization information, using the trained sequence model, The characterization information of the multimodal feature information is fused to obtain the fusion characterization of each candidate drug molecule, and according to the fusion characterization of each candidate drug molecule, the candidate drug molecules are accurately sorted, and the target is determined according to the precise sorting result The multi-drug molecular sequence matching the protein can quickly and accurately screen out the possible drug molecular sequence that can effectively act on a given target protein, which greatly improves the efficiency of drug screening.
图9为本申请另一示例性实施例提供的药物分子与靶点蛋白匹配的方法流程图。本实施例提供的方法应用于上述系统架构中的端侧设备。如图9所示,端侧设备的处理流程如下:Fig. 9 is a flowchart of a method for matching a drug molecule with a target protein according to another exemplary embodiment of the present application. The method provided in this embodiment is applied to the end-side devices in the above system architecture. As shown in Figure 9, the processing flow of the end-side device is as follows:
步骤S901、响应于药物分子与靶点蛋白的匹配请求,获取待匹配的给定对象和待筛选对象的信息,给定对象和待筛选对象两者中一个是药物分子另一个是靶点蛋白。Step S901 , in response to the matching request between the drug molecule and the target protein, obtain the information of a given object to be matched and an object to be screened, one of which is a drug molecule and the other is a target protein.
用户可以通过在端侧设备的操作,指定进行匹配的给定对象和待筛选对象,并向端侧设备发出药物分子与靶点蛋白的匹配请求,例如为给定药物预测多靶点蛋白序列的请求、或者为给定靶点蛋白预测适用的药物分子的请求等。该请求可以包含待匹配的给定对象和待筛选对象的信息。可选地,用户还可以通过端侧设备输入给定对象的特征信息。The user can specify the given object for matching and the object to be screened through the operation of the end-to-end device, and send a matching request between the drug molecule and the target protein to the end-to-end device, such as predicting the multi-target protein sequence for a given drug requests, or requests for predicting applicable drug molecules for a given target protein, etc. The request may contain information about a given object to be matched and an object to be filtered. Optionally, the user may also input characteristic information of a given object through the end-side device.
步骤S902、将待匹配的给定对象和待筛选对象的信息发送至服务器。Step S902, sending the information of the given object to be matched and the object to be screened to the server.
可选地,药物分子与靶点蛋白的匹配请求可以包含给定对象的标识信息和待筛选对象的标识信息。服务器根据给定对象的标识信息和待筛选对象的标识信息,可以获取到给定对象和待筛选对象的特征信息。Optionally, the matching request between the drug molecule and the target protein may include the identification information of the given object and the identification information of the object to be screened. According to the identification information of the given object and the identification information of the object to be screened, the server can acquire the feature information of the given object and the object to be screened.
可选地,药物分子与靶点蛋白的匹配请求可以仅包含给定的药物分子的标识信息。服务器根据给定对象的标识信息,可以获取到给定对象的特征信息,服务器获取已有的待筛选对象的特征信息。Optionally, the matching request between a drug molecule and a target protein may only contain the identification information of a given drug molecule. According to the identification information of the given object, the server can obtain the characteristic information of the given object, and the server obtains the characteristic information of the existing objects to be screened.
可选地,药物多靶点蛋白预测指令还可以包含给定的药物分子的结构信息,服务器获取已有的待筛选对象的特征信息。Optionally, the drug multi-target protein prediction instruction may also include structural information of a given drug molecule, and the server obtains characteristic information of existing objects to be screened.
本实施例中,匹配结果是由服务器通过如下方式确定的:In this embodiment, the matching result is determined by the server in the following manner:
根据药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征;根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选对象中筛选出备选对象,待筛选对象为靶点蛋白或药物分子;根据药物分子的表征信息和靶点蛋白的表征信息,对备选对象进行排序,根据排序结果输出药物分子与靶点蛋白的匹配结果。According to the structural information of the drug molecule, the structural information of the target protein, the information of the protein family group and the description information of the protein function, the characterization information of the drug molecule and the characterization information of the target protein are determined. The characterization information of the drug molecule includes the structural characterization of the drug molecule , the characterization information of the target protein includes the structural characterization of the target protein, the characterization of the protein family group and the description and characterization of the protein function; according to the structural characterization of the drug molecule, the structural characterization of the target protein and the characterization of the protein family group, predict the target protein Correlation information with the drug molecule, and according to the correlation information, select candidate objects from the objects to be screened, the object to be screened is the target protein or drug molecule; according to the characterization information of the drug molecule and the characterization information of the target protein, The candidate objects are sorted, and the matching results of drug molecules and target proteins are output according to the sorting results.
另外,服务器根据给定对象和待筛选对象的信息,以及药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息,通过药物分子与靶点蛋白的匹配结果,具体可以采用前述任一方法实施例提供的方法实现,具体实施方案及技术效果参见前述方法实施例,此处不再赘述。In addition, the server, based on the information of the given object and the object to be screened, as well as the structural information of the drug molecule, the structural information of the target protein, the information of the protein family group and the description information of the protein function, through the matching results of the drug molecule and the target protein, Specifically, it can be realized by using the method provided by any of the foregoing method embodiments. For specific implementation schemes and technical effects, refer to the foregoing method embodiments, which will not be repeated here.
步骤S903、接收服务器发送的药物分子与靶点蛋白的匹配结果。Step S903, receiving the matching result of the drug molecule and the target protein sent by the server.
步骤S904、输出药物分子与靶点蛋白的匹配结果。Step S904, outputting the matching result of the drug molecule and the target protein.
在接收到药物分子与靶点蛋白的匹配结果后,端侧设备可以将药物分子与靶点蛋白的匹配结果进行可视化输出,以将与给定对象相匹配的多目标对象序列的信息、以及目标对象的相关信息(如结构信息、标识信息)等信息进行可视化展示,从而向用户提供参考数据。After receiving the matching result of the drug molecule and the target protein, the end-side device can visualize and output the matching result of the drug molecule and the target protein, so that the information of the multi-target object sequence matching the given object, and the target Object-related information (such as structure information, identification information) and other information are displayed visually, thereby providing reference data to users.
例如,在为给定药物预测多靶点蛋白序列的场景中,匹配结果为与给定药物分子相匹配的多靶点蛋白序列,通过可视化输出多靶点蛋白序列中各个靶点蛋白的相关信息,向相关药物研发人员提供参考数据,以通过人工交互筛选,从药物设计的角度选出具有成药性的药物分子,针对特定疾病的靶点蛋白进行药物结构衍生,以便投入后续的药物开发环节。For example, in the scenario of predicting a multi-target protein sequence for a given drug, the matching result is the multi-target protein sequence that matches the given drug molecule, and the relevant information of each target protein in the multi-target protein sequence is output through visualization , to provide reference data to relevant drug developers to select druggable drug molecules from the perspective of drug design through manual interactive screening, and to derivate drug structures for target proteins of specific diseases, so as to put them into subsequent drug development links.
本实施例中,端侧设备可以是能够实现与服务器交互的终端、客户端、应用程序(APP)、软件产品或者数据库等,此处对于端侧设备的具体产品形态及与服务器间的交互形式不做具体限定。In this embodiment, the end-side device may be a terminal, a client, an application program (APP), a software product, or a database that can interact with the server. Here, the specific product form of the end-side device and the form of interaction with the server Not specifically limited.
图10为本申请一示例性实施例提供的药物分子与靶点蛋白匹配装置的结构示意图。本申请实施例提供的药物分子与靶点蛋白匹配装置可以执行药物分子与靶点蛋白匹配方法实施例提供的处理流程。如图10所示,该药物分子与靶点蛋白匹配装置100包括:多模态信息获取模块101、多模态表征模块102、粗筛选模块103和精筛选模块104。Fig. 10 is a schematic structural diagram of a device for matching drug molecules and target proteins according to an exemplary embodiment of the present application. The device for matching drug molecules and target proteins provided in the embodiments of the present application can execute the processing procedures provided in the embodiments of the methods for matching drug molecules and target proteins. As shown in FIG. 10 , the
其中,多模态信息获取模块101用于获取待匹配的药物分子的结构信息,靶点蛋白的结构信息、蛋白家族组别信息和蛋白功能描述信息。Among them, the multimodal
多模态表征模块102用于确定药物分子的表征信息和靶点蛋白的表征信息,药物分子的表征信息包括药物分子的结构表征,靶点蛋白的表征信息包括靶点蛋白的结构表征、蛋白家族组别表征和蛋白功能描述表征。The
粗筛选模块103用于根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息,并根据相关性信息,从待筛选对象中筛选出备选对象,待筛选对象为靶点蛋白或药物分子。The
精筛选模块104用于根据药物分子的表征信息和靶点蛋白的表征信息,对备选对象进行排序,根据排序结果输出药物分子与靶点蛋白的匹配结果。The
在一可选实施例中,靶点蛋白作为待筛选对象,药物分子作为给定对象,备选对象为备选靶点蛋白,匹配结果包含与药物分子相匹配的靶点蛋白。In an optional embodiment, the target protein is the object to be screened, the drug molecule is the given object, the candidate object is the candidate target protein, and the matching result includes the target protein matching the drug molecule.
在一可选实施例中,药物分子作为待筛选对象,靶点蛋白作为给定对象,备选对象为备选药物分子,匹配结果包含与靶点蛋白相匹配的药物分子。In an optional embodiment, the drug molecule is used as the object to be screened, the target protein is used as the given object, the candidate object is the candidate drug molecule, and the matching result includes the drug molecule matching the target protein.
在一可选实施例中,在实现根据药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,预测靶点蛋白与药物分子的相关性信息时,粗筛选模块103还用于:In an optional embodiment, when predicting the correlation information between the target protein and the drug molecule based on the structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group, the
将药物分子的结构表征、靶点蛋白的结构表征和蛋白家族组别表征,输入分子靶点关系预测模型,通过分子靶点关系预测模型预测靶点蛋白与药物分子的相关性信息。The structural characterization of the drug molecule, the structural characterization of the target protein, and the characterization of the protein family group are input into the molecular-target relationship prediction model, and the correlation information between the target protein and the drug molecule is predicted through the molecular-target relationship prediction model.
在一可选实施例中,在实现根据药物分子的表征信息和靶点蛋白的表征信息,对备选对象进行排序时,精筛选模块104还用于:根据预先建立的分子靶点关系知识图谱,将包含给定对象和备选对象的三元组进行表征,得到给定对象与备选对象的三元组表征,分子靶点关系知识图谱包含已知的药物分子与已知靶点蛋白的相互作用关系,以及已知靶点蛋白的蛋白家族组别信息,三元组包括药物分子、靶点蛋白及靶点蛋白的蛋白家族组别;将给定对象的表征信息、备选对象的表征信息和给定对象与备选对象的三元组表征输入序列模型,通过序列模型的编码模块,将给定对象的表征信息、备选对象的表征信息和给定对象与备选对象的三元组表征融合,得到备选对象的融合表征;通过序列模型的排序模块,根据备选对象的融合表征,对备选对象进行排序,得到第一排序结果。In an optional embodiment, when the candidate objects are sorted according to the characterization information of the drug molecule and the characterization information of the target protein, the
在一可选实施例中,在实现根据相关性信息,筛选出备选对象时,粗筛选模块103还用于:根据与给定对象的相关性信息,对待筛选对象进行排序,得到第二排序结果;根据第二排序结果确定第二预设数量的待筛选对象,作为与给定对象相关的多个备选对象。In an optional embodiment, when the candidate objects are screened out according to the correlation information, the
在一可选实施例中,在实现根据排序结果输出药物分子与靶点蛋白的匹配结果时,精筛选模块104还用于:根据第一排序结果,确定第一预设数量的备选对象,作为与给定对象相匹配目标对象,得到药物分子与靶点蛋白的匹配结果,输出与给定对象相匹配目标对象的信息。其中,第二预设数量大于第一预设数量。In an optional embodiment, when outputting the matching result of the drug molecule and the target protein according to the sorting result, the
在一可选实施例中,该药物分子与靶点蛋白匹配装置100还包括:第一模型训练模块,用于:获取预先建立的分子靶点关系知识图谱,分子靶点关系知识图谱包含已知的药物分子与靶点蛋白的相互作用关系,以及靶点蛋白的蛋白家族组别信息;根据分子靶点关系知识图谱,采样生成对比学习训练集,对比学习训练集中样本为包含药物分子、药物分子对应靶点蛋白和靶点蛋白的蛋白家族组别信息的三元组,对比学习训练集包含正样本和负样本;根据对比学习训练集,对分子靶点关系预测模型进行对比学习训练,得到训练好的分子靶点关系预测模型。In an optional embodiment, the
在一可选实施例中,在实现根据分子靶点关系知识图谱,采样生成对比学习训练集时,第一模型训练模块还用于:从分子靶点关系知识图谱中,采样药物分子、与该药物分子具有相互作用关系的靶点蛋白、该靶点蛋白的蛋白家族组别信息,组成正样本;从分子靶点关系知识图谱中,采样药物分子、与该药物分子不具有相互作用关系的靶点蛋白、该靶点蛋白的蛋白家族组别信息,组成负样本;正样本和负样本构成对比学习训练集。In an optional embodiment, when realizing sampling and generating a comparative learning training set according to the molecular-target relationship knowledge graph, the first model training module is also used to: sample drug molecules from the molecular-target relationship knowledge graph, and the The target protein with interaction relationship with the drug molecule and the protein family group information of the target protein form a positive sample; from the molecular target relationship knowledge map, sample the drug molecule and the target protein with no interaction relationship with the drug molecule. The target protein and the protein family group information of the target protein constitute a negative sample; positive samples and negative samples constitute a comparative learning training set.
在一可选实施例中,在实现从分子靶点关系知识图谱中,采样药物分子、与该药物分子不具有相互作用关系的靶点蛋白、该靶点蛋白的蛋白家族组别信息,组成负样本时,第一模型训练模块还用于:从分子靶点关系知识图谱中,采样药物分子、与该药物分子具有相互作用关系的第一靶点蛋白、该第一靶点蛋白的蛋白家族组别信息;根据该第一靶点蛋白的蛋白家族组别信息,从分子靶点关系知识图谱中,采样与第一靶点蛋白具有相同组别的第二靶点蛋白;将该药物分子、第二靶点蛋白、第二靶点蛋白的蛋白家族组别信息,组成负样本。In an optional embodiment, in realizing the molecular-target relationship knowledge map, sampling drug molecules, target proteins that do not have an interaction relationship with the drug molecule, and protein family group information of the target protein form a negative When sampling, the first model training module is also used to: sample the drug molecule, the first target protein that has an interaction relationship with the drug molecule, and the protein family group of the first target protein from the molecular-target relationship knowledge map According to the protein family group information of the first target protein, the second target protein with the same group as the first target protein is sampled from the molecular target relationship knowledge map; the drug molecule, the second target protein The protein family group information of the second target protein and the second target protein constitutes a negative sample.
在一可选实施例中,该药物分子与靶点蛋白匹配装置100还包括:第二模型训练模块,用于:从分子靶点关系知识图谱中,采样给定对象样本和给定对象样本对应的待筛选对象样本;利用训练好的分子靶点关系预测模型,根据给定对象样本的表征信息、对应的待筛选对象样本的表征信息,预测给定对象样本与待筛选对象样本的相关性信息,并根据相关性信息从待筛选对象样本中筛选出与给定对象样本相关的多个备选对象样本;根据分子靶点关系知识图谱,对包含给定对象样本和备选对象样本的三元组进行表征,得到给定对象样本与备选对象样本的三元组表征,三元组包括药物分子、靶点蛋白及靶点蛋白的蛋白家族组别;任一给定对象样本的表征信息、给定对象样本的多个备选对象样本的表征信息和给定对象样本与备选对象样本的三元组表征,构成一条训练样本;获取训练样本的匹配结果标注信息,训练样本及训练样本的匹配结果标注信息构成训练集;使用训练集训练序列模型,得到训练好的序列模型。In an optional embodiment, the
在一可选实施例中,在实现使用训练集训练序列模型,得到训练好的序列模型时,第二模型训练模块还用于:基于训练集,将训练样本中给定对象样本的表征信息、给定对象样本的多个备选对象样本的表征信息和给定对象样本与备选对象样本的三元组表征,输入序列模型,通过序列模型的编码模块,将给定对象样本的表征信息、备选对象样本的表征信息和给定对象样本与备选对象样本的三元组表征融合,得到备选对象样本的融合表征;通过序列模型的排序模块,根据备选对象样本的融合表征,对备选对象样本进行排序,并确定匹配预测结果;根据匹配预测结果和训练样本的匹配结果标注信息,计算损失函数值,并根据损失函数值优化编码模块和排序模块的模型参数,以获得训练好的序列模型。In an optional embodiment, when using the training set to train the sequence model to obtain the trained sequence model, the second model training module is also used for: based on the training set, the representation information of the given object sample in the training sample, The characterization information of multiple candidate object samples of a given object sample and the triplet characterization of the given object sample and the candidate object samples are input into the sequence model, and the characterization information of the given object sample, The characterization information of the candidate object samples and the triplet representations of the given object samples and the candidate object samples are fused to obtain the fusion representation of the candidate object samples; through the sorting module of the sequence model, according to the fusion representation of the candidate object samples, the The candidate object samples are sorted, and the matching prediction results are determined; the loss function value is calculated according to the matching prediction result and the matching result of the training sample, and the model parameters of the encoding module and the sorting module are optimized according to the loss function value to obtain a good training result. sequence model.
本申请实施例提供的装置可以具体用于执行上述任一方法实施例提供的方法,所实现具体功能和所能实现的技术效果此处不再赘述。The device provided in the embodiment of the present application can be specifically used to execute the method provided in any one of the above method embodiments, and the specific functions and technical effects achieved will not be repeated here.
图11为本申请一示例实施例提供的服务器的结构示意图。如图11所示,该服务器110包括:处理器1101,以及与处理器1101通信连接的存储器1102,存储器1102存储计算机执行指令。其中,处理器执行存储器存储的计算机执行指令,以实现上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。Fig. 11 is a schematic structural diagram of a server provided by an example embodiment of the present application. As shown in FIG. 11 , the
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,计算机执行指令被处理器执行时用于实现上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。The embodiment of the present application also provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, they are used to implement the solutions and specific functions provided by any of the above-mentioned method embodiments And the technical effects that can be achieved will not be repeated here.
本申请实施例还提供了一种计算机程序产品,计算机程序产品包括:计算机程序,计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从可读存储介质读取计算机程序,至少一个处理器执行计算机程序使得电子设备执行上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。The embodiment of the present application also provides a computer program product, the computer program product includes: a computer program, the computer program is stored in a readable storage medium, at least one processor of the electronic device can read the computer program from the readable storage medium, at least A processor executes the computer program so that the electronic device executes the solution provided by any of the above method embodiments, and the specific functions and technical effects that can be achieved are not repeated here.
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。“多个”的含义是两个以上,除非另有明确具体的限定。In addition, in some of the processes described in the above embodiments and accompanying drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be executed in the order in which they appear herein or executed in parallel , is only used to distinguish different operations, and the serial number itself does not represent any execution order. Additionally, these processes can include more or fewer operations, and these operations can be performed sequentially or in parallel. It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc. are different types. "Multiple" means two or more, unless otherwise clearly and specifically defined.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求书指出。Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any modification, use or adaptation of the application, these modifications, uses or adaptations follow the general principles of the application and include common knowledge or conventional technical means in the technical field not disclosed in the application . The specification and examples are to be considered exemplary only, with a true scope and spirit of the application indicated by the following claims.
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求书来限制。It should be understood that the present application is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Claims (14)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310094426.7A CN116052762A (en) | 2023-01-13 | 2023-01-13 | Method and server for matching drug molecules and target proteins |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310094426.7A CN116052762A (en) | 2023-01-13 | 2023-01-13 | Method and server for matching drug molecules and target proteins |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116052762A true CN116052762A (en) | 2023-05-02 |
Family
ID=86116332
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310094426.7A Pending CN116052762A (en) | 2023-01-13 | 2023-01-13 | Method and server for matching drug molecules and target proteins |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116052762A (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117854630A (en) * | 2023-12-26 | 2024-04-09 | 之江实验室 | Multi-target drug discovery system based on artificial intelligence |
| CN120199516A (en) * | 2025-05-27 | 2025-06-24 | 长春中医药大学 | A drug efficacy prediction method for drug development |
| CN120319301A (en) * | 2025-06-17 | 2025-07-15 | 上海金福康制药工程技术有限公司 | Drug target prediction method, device, equipment, medium and product |
-
2023
- 2023-01-13 CN CN202310094426.7A patent/CN116052762A/en active Pending
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117854630A (en) * | 2023-12-26 | 2024-04-09 | 之江实验室 | Multi-target drug discovery system based on artificial intelligence |
| CN120199516A (en) * | 2025-05-27 | 2025-06-24 | 长春中医药大学 | A drug efficacy prediction method for drug development |
| CN120319301A (en) * | 2025-06-17 | 2025-07-15 | 上海金福康制药工程技术有限公司 | Drug target prediction method, device, equipment, medium and product |
| CN120319301B (en) * | 2025-06-17 | 2025-10-03 | 上海金福康制药工程技术有限公司 | Drug target prediction methods, devices, equipment, media and products |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Kimmel et al. | Semisupervised adversarial neural networks for single-cell classification | |
| US11176462B1 (en) | System and method for prediction of protein-ligand interactions and their bioactivity | |
| Hashemifar et al. | Predicting protein–protein interactions through sequence-based deep learning | |
| Aguilera-Mendoza et al. | Automatic construction of molecular similarity networks for visual graph mining in chemical space of bioactive peptides: an unsupervised learning approach | |
| CN116052762A (en) | Method and server for matching drug molecules and target proteins | |
| CN112085205A (en) | Method and system for automatically training machine learning models | |
| Zou et al. | Approaches for recognizing disease genes based on network | |
| Pang et al. | DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model | |
| Xuan et al. | GVDTI: graph convolutional and variational autoencoders with attribute-level attention for drug–protein interaction prediction | |
| JP2024541108A (en) | Optimization of ionic liquid-based depolymerization | |
| CN113764034B (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
| Diaz-Flores et al. | Evolution of artificial intelligence-powered technologies in biomedical research and healthcare | |
| CA2942106A1 (en) | Aligning and clustering sequence patterns to reveal classificatory functionality of sequences | |
| CN115588463A (en) | Prediction method for mining protein interaction type based on deep learning | |
| CN114974397A (en) | Training method of protein structure prediction model and protein structure prediction method | |
| CN113160886A (en) | Cell type prediction system based on single cell Hi-C data | |
| Zhao et al. | PocketDTA: an advanced multimodal architecture for enhanced prediction of drug− target affinity from 3D structural data of target binding pockets | |
| Halsana et al. | Denseppi: A novel image-based deep learning method for prediction of protein–protein interactions | |
| WO2023148684A1 (en) | Local steps in latent space and descriptors-based molecules filtering for conditional molecular generation | |
| CN119068972B (en) | Drug target interaction relation prediction method and system | |
| Wang et al. | DeepSP: a deep learning framework for spatial proteomics | |
| US20250225126A1 (en) | Utilizing language machine learning models for autonomous executions of computerized tech-bio exploration tools | |
| Larsen et al. | A simulated annealing algorithm for maximum common edge subgraph detection in biological networks | |
| Citarella et al. | Gene Ontology Terms Visualization with Dynamic Distance-Graph and Similarity Measures (S). | |
| Wang et al. | WUREN: Whole-modal union representation for epitope prediction |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |