WO2025028367A1 - Procédé de génération de bibliothèque de composés, système de génération de bibliothèque de composés, programme informatique et procédé de génération de modèle d'apprentissage - Google Patents
Procédé de génération de bibliothèque de composés, système de génération de bibliothèque de composés, programme informatique et procédé de génération de modèle d'apprentissage Download PDFInfo
- Publication number
- WO2025028367A1 WO2025028367A1 PCT/JP2024/026484 JP2024026484W WO2025028367A1 WO 2025028367 A1 WO2025028367 A1 WO 2025028367A1 JP 2024026484 W JP2024026484 W JP 2024026484W WO 2025028367 A1 WO2025028367 A1 WO 2025028367A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- compound
- compounds
- target substance
- information
- binding ability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/60—In silico combinatorial chemistry
- G16C20/62—Design of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Definitions
- the present invention relates to a compound library generation method, a compound library generation system, a computer program, and a learning model generation method.
- a compound library contains a large number of compounds. It takes time and money to search through the large number of compounds contained in a compound library for compounds that have the ability to bind to a target. To efficiently find useful compounds, a compound library that contains a high proportion of compounds that have the ability to bind to a target is desirable.
- the main objective of this disclosure is to provide a compound library generation method and the like that can realize the construction of a compound library that increases the proportion of compounds that have the ability to bind to a target.
- a computer executes a process of acquiring information on a plurality of compounds stored in a first compound library, classifying the plurality of compounds stored in the first compound library into a group of compounds that have the ability to bind to a target substance and a group of compounds that do not have the ability to bind to a target substance, using a learning model that has been trained to output information indicating the binding ability of the compounds to a target substance when the compound information of the compounds is input, and generating a second compound library including the compounds classified into the group of compounds that have the ability to bind to the target substance.
- the compound library generation system includes a control unit that acquires information on a plurality of compounds stored in a first compound library, and classifies the plurality of compounds stored in the first compound library into a group of compounds that have the ability to bind to a target substance and a group of compounds that do not have the ability to bind to a target substance, using a learning model that has been trained to output information indicating the binding ability of the compounds to a target substance when the compound information of the compounds is input, and executes a process of generating a second compound library including the compounds classified into the group of compounds that have the ability to bind to the target substance.
- a computer program causes a computer to execute a process of acquiring information on a plurality of compounds stored in a first compound library, classifying the plurality of compounds stored in the first compound library into a group of compounds that have the ability to bind to a target substance and a group of compounds that do not have the ability to bind to a target substance, using a learning model that has been trained to output information indicating the binding ability of the compounds to a target substance when the compound information of the compounds is input, and generating a second compound library including the compounds classified into the group of compounds that have the ability to bind to the target substance.
- a method for generating a learning model acquires training data for a plurality of compounds stored in a compound library, the training data including compound information indicating the structure or properties of the compound and information indicating the binding ability to a target substance, and generates a learning model trained to output information indicating the binding ability to a target substance when compound information is input based on the acquired training data.
- the present disclosure makes it possible to construct a compound library that increases the proportion of compounds that have the ability to bind to a target.
- FIG. 1 is a diagram showing an overview of a compound library generation system according to an embodiment of the present invention.
- 2 is a block diagram showing an example of the configuration of an information processing device and a terminal device;
- FIG. 2 is an explanatory diagram showing an overview of a learning model and an example of the contents of information stored in a training DB.
- 13 is a flowchart illustrating an example of a learning model generation processing procedure.
- 13 is a flowchart illustrating an example of a procedure for generating a focused library.
- 13 is a flowchart illustrating an example of a learning model generation processing procedure executed by an information processing apparatus according to a second embodiment.
- 13 is a flowchart illustrating an example of a procedure for generating a predicted value of combined information according to the third embodiment.
- 13 is an explanatory diagram showing an overview of a learning model of the fourth embodiment and an example of the contents of information stored in a training DB.
- First Embodiment 1 is a diagram showing an overview of a compound library generation system 100 according to the present embodiment.
- the compound library generation system 100 includes an information processing device 1 as a main device.
- the information processing device 1 is communicably connected to a terminal device 2 via a network N such as the Internet.
- the number of terminal devices 2 may be one or three or more.
- the information processing device 1 is a device capable of various types of information processing and sending and receiving information, and is, for example, a server computer, a personal computer, a quantum computer, etc.
- the terminal device 2 is an information processing terminal used by a person in charge of a drug discovery company, which is an example of a user.
- the terminal device 2 is, for example, a personal computer, a smartphone, a tablet terminal, etc.
- the information processing device 1 receives, via the terminal device 2, a basic library 31, which is a compound library held by a drug discovery company.
- the information processing device 1 generates a focused library 32 according to the received basic library 31, and provides the generated focused library 32 to the drug discovery company via the terminal device 2.
- the basic library 31 corresponds to the first compound library
- the focused library 32 corresponds to the second compound library.
- the basic library 31 is a compound library that manages compounds in a database.
- the basic library 31 contains a large number of compounds, including compounds obtained in past drug discovery research at drug discovery companies, compounds obtained from outside, and the like.
- the basic library 31 contains information on each of these compounds, such as the compound name, structural formula, physical properties, and physiochemical characteristics.
- the basic library 31 may be a library that is independently held by a company, and contains information on a large number of compounds that are independently owned by the drug discovery company.
- the focused library 32 is a compound library generated based on the basic library 31, and is a library that selectively stores compounds having the desired properties among the compounds stored in the basic library 31. More specifically, the focused library 32 selectively stores compounds that have activity against a target substance.
- the target substance may be any of DNA, RNA, proteins, etc.
- compound discovery involves screening the group of compounds in the basic library 31 to find effective candidate compounds. It takes a lot of time and money to investigate the activity of all the compounds in the basic library 31 against a target substance. This system improves the efficiency of screening work by providing a focused library 32 that has a higher proportion of compounds that are active against the target substance.
- the target substance is RNA
- a focused library 32 is generated in which the proportion of low molecular weight compounds that bind to the target RNA is increased.
- the target RNA may be non-translated RNA.
- FIG. 2 is a block diagram showing an example of the configuration of the information processing device 1 and the terminal device 2.
- the information processing device 1 includes a control unit 11, a storage unit 12, a communication unit 13, a display unit 14, an operation unit 15, and an input/output unit 16.
- the information processing device 1 may be a multi-computer consisting of multiple computers, or may be a virtual machine virtually constructed by software.
- the control unit 11 has one or more arithmetic processing devices such as a CPU (Central Processing Unit), an MPU (Micro-Processing Unit), or a GPU (Graphics Processing Unit).
- the control unit 11 controls each component to execute processing using built-in memory such as a ROM (Read Only Memory) or RAM (Random Access Memory), a clock, a counter, etc.
- Each functional unit of the information processing device 1 may be realized by software, hardware (e.g., an FPGA or ASIC), or a combination of these.
- the memory unit 12 includes a non-volatile memory such as a hard disk, a flash memory, or an SSD (Solid State Drive).
- the memory unit 12 may be an external storage device connected to the information processing device 1.
- the memory unit 12 stores various computer programs and data referenced by the control unit 11.
- the memory unit 12 of this embodiment stores a program 1P for causing a computer to execute processing related to the generation of the focused library 32, a learning model 121, and a training DB (Data Base) 122.
- the learning model 121 is a model generated by machine learning. It is expected that the learning model 121 will be used as a program module that constitutes part of artificial intelligence software.
- a computer program (computer program product) including program 1P may be provided by a non-transitory recording medium 1A on which the computer program is recorded in a readable manner.
- the storage unit 12 stores the computer program read from the recording medium 1A by a reading device (not shown).
- the recording medium 1A is, for example, a magnetic disk, an optical disk, or a semiconductor memory.
- the computer program may also be downloaded from an external server connected to a communications network and stored in the storage unit 12.
- Program 1P may be a single computer program or may be composed of multiple computer programs, and may be executed on a single computer or on multiple computers interconnected by a communications network.
- the communication unit 13 includes a communication module that communicates with external devices via the network N.
- the control unit 11 transmits and receives various information to and from the terminal device 2 via the communication unit 13.
- the display unit 14 includes a display device such as a liquid crystal display or an organic EL (Electro Luminescence) display.
- the display unit 14 displays information to be notified to the user according to instructions from the control unit 11.
- the operation unit 15 is an interface that accepts operations from the user.
- the operation unit 15 includes, for example, a keyboard, a mouse, a touch panel device with a built-in display, a speaker, and a microphone.
- the operation unit 15 accepts operation input from the user and sends a control signal according to the operation content to the control unit 11.
- the input/output unit 16 has an input/output interface that connects to an external device via a wired or wireless connection.
- the detection device 4 is connected to the input/output unit 16.
- the detection device 4 is a device that detects the interaction between the target RNA and the compound.
- the detection device 4 measures the interaction between the target RNA and the compound, for example, using surface plasmon resonance (SPR).
- SPR surface plasmon resonance
- the interaction analysis method is not limited to SPR, and may be, for example, isothermal titration calorimetry, mass spectrometry, nuclear magnetic resonance spectroscopy, melting temperature measurement, absorption spectroscopy, fluorescence spectroscopy, circular dichroism spectroscopy, etc.
- the detection device 4 outputs detection data obtained by detection to the information processing device 1.
- the information processing device 1 may acquire the detection data stored in a specified memory area through the detection device 4.
- the terminal device 2 includes a control unit 21, a memory unit 22, a communication unit 23, a display unit 24, and an operation unit 25.
- the control unit 21 includes one or more arithmetic processing units such as a CPU, an MPU, or a GPU.
- the control unit 11 uses built-in memory such as ROM or RAM, a clock, a counter, etc. to control each component and execute processing.
- the storage unit 22 includes a non-volatile memory such as a hard disk, a flash memory, or an SSD.
- the storage unit 22 stores various computer programs and data referenced by the control unit 21.
- the storage unit 22 stores a program 2P for causing a computer to execute processing related to the acquisition of a focused library 32, and a basic library 31.
- the storage unit 22 may also store a focused library 32 received from the information processing device 1.
- the communication unit 23 includes a communication module that realizes communication via the network N.
- the display unit 24 includes a display device such as a liquid crystal display or an organic EL display.
- the operation unit 25 is an interface that accepts user operations.
- the operation unit 25 includes, for example, a keyboard, a mouse, a touch panel device with a built-in display, a speaker, a microphone, etc.
- Figure 3 is an explanatory diagram showing an overview of the learning model 121 and an example of the information stored in the training DB 122.
- the training DB 122 is a database that stores training data used for training the learning model 121. As shown in FIG. 3, the training DB 122 stores records that link compound information and binding information, etc., using a compound ID that identifies the compound as a key.
- the group of compounds that make up the training data i.e., the multiple compounds included in the training DB 122, are a portion of the compounds extracted from the compounds stored in the basic library 31.
- Compound information includes information that represents the structure or physical properties of a compound.
- Compound information includes, for example, molecular descriptors of the compound.
- Molecular descriptors are numerical representations of the structural features and physicochemical properties of a substance to make them easier to handle on a computer.
- Molecular descriptors can be calculated from the structural formula of a substance, and can be obtained using known software such as alvaDesc, Dragon, Codessa, RDKit, and modred.
- Compound information may include values of multiple types of molecular descriptors.
- Compound information is not limited to molecular descriptors, but may also include, for example, structural formulas expressed as character strings according to the SMILES (Simplified Molecular Input Line Entry System) notation, molecular graphs in which chemical structural formulas are converted into graph information, etc.
- SMILES Simple Molecular Input Line Entry System
- the binding information is information indicating the binding ability to the target RNA, and includes the presence or absence of the compound's binding ability to the target RNA.
- the binding information may further include the degree of binding strength to the target RNA, the binding ratio, the presence or absence of structural changes upon binding, etc.
- the binding information is obtained by analyzing detection data indicating the interaction between the compound and the target RNA obtained by an experiment using the detection device 4. For example, as an indicator for determining the presence or absence of binding ability, it can be determined that the compound has binding ability when the Resonance Unit (RU value) of the SPR signal obtained from the compound is equal to or greater than a preset threshold value.
- RU value Resonance Unit
- the bond information used in the training data may be a simulation value obtained by a specified algorithm.
- Examples of calculation methods for the simulation value include quantum chemical calculations and molecular dynamics calculations.
- the simulation value obtained by each calculation method can be obtained using known theoretical calculation software.
- the learning model 121 receives compound information of a compound and outputs binding information indicating the binding ability of the compound to a specific target RNA.
- the specific target RNA recognized by the learning model 121 is also referred to as the first target RNA.
- the learning model 121 of this embodiment outputs a classification result indicating the presence or absence of binding ability to the first target RNA, i.e., whether or not the compound has binding ability.
- the learning model 121 is used in the compound selection process when generating the focused library 32.
- the learning model 121 is constructed, for example, by a random forest.
- the information processing device 1 creates multiple decision trees with low correlation by randomly selecting features to be used in model construction based on data sampled from the training data, and generates the learning model 121 using the multiple decision trees.
- the final output of the learning model 121 is the majority vote or average of the estimated values from each decision tree.
- the compound information that is input to the learning model 121 includes multiple molecular descriptors.
- the compound information may also include a SMILES string, a molecular graph, etc.
- One or more types of molecular descriptors selected from a large number may be input to the learning model 121.
- the type of molecular descriptor to be input to the learning model 121 can be set appropriately depending on the type of the first target RNA.
- the molecular descriptor to be input to the learning model 121 is automatically determined by the information processing device 1 based on the contribution of the input information obtained when learning the learning model 121.
- the information processing device 1 executes an estimation process using the learning model 121 using a predetermined number of molecular descriptors selected in advance, thereby calculating the contribution (variable importance) of each molecular descriptor in estimating binding information in the learning model 121.
- the contribution can be calculated based on, for example, SHAP (Shapley Additive exPlanation) values, Gini coefficients, oob (out-of-bag) data, LIME (Local Interpretable Model-Agnostic Explanations), etc.
- the information processing device 1 determines the molecular descriptors to be used as input information to the learning model 121 by preferentially selecting a predetermined number of molecular descriptors set in advance in descending order of contribution based on the calculated contribution of each molecular descriptor.
- the above-mentioned process makes it possible to identify molecular descriptors suitable for input to the learning model 121 from among the many defined molecular descriptors.
- the binding information that is output from the learning model 121 includes whether or not the compound has the binding ability to the first target RNA.
- the learning model 121 can be generated by preparing training data in which the presence or absence of binding ability to the first target RNA, which is the correct answer value, is labeled for each compound information, and using the training data to machine-learn an unlearned model.
- the information processing device 1 inputs compound information in the training data to the learning model 121, and learns the learning model 121 so that the output from the learning model 121 approximates the correct value.
- the information processing device 1 generates the learning model 121 by adjusting the parameters in the learning model 121 using the presence or absence of binding ability associated with the input compound information as the correct value.
- the output information from the learning model 121 is not limited to the presence or absence of binding ability to the first target RNA.
- the learning model 121 may be configured to further output, for example, the degree of binding strength to the first target RNA, the binding ratio, the presence or absence of structural change upon binding, etc.
- the configuration of the learning model 121 is not limited to the example shown in FIG. 3, and it is sufficient if it is capable of identifying information indicating binding ability to target RNA from compound information.
- the learning model 121 may be a model based on other learning algorithms, such as a Convolution Neural Network (CNN), a Recurrent Neural Network (RNN), a Graph Neural Network (GNN), a Transformer, a Support Vector Machine, Logistic Regression, or an eXtreme Gradient Boosting (XGBoost).
- CNN Convolution Neural Network
- RNN Recurrent Neural Network
- GNN Graph Neural Network
- XGBoost eXtreme Gradient Boosting
- the learning model 121 may be composed of multiple individual learning models constructed by different learning algorithms.
- the learning model 121 may include, for example, a first individual learning model constructed by random forest and a second individual learning model constructed by XGBoost.
- the compound library generation system 100 constructs a learning model 121 using training data generated based on compound information of some of the compounds stored in the basic library 31.
- the obtained learning model 121 is used to estimate the presence or absence of binding ability in the large number of compounds stored in the basic library 31, and compounds estimated to have binding ability are extracted to generate a new focused library 32.
- the processing method for generating the focused library 32 is described in detail below.
- FIG. 4 is a flowchart showing an example of a process for generating the learning model 121.
- the processes in each of the following flowcharts are executed by the control unit 11 in accordance with the program 1P stored in the storage unit 12 of the information processing device 1.
- the control unit 11 of the information processing device 1 receives from the user, via the terminal device 2, the selection of the target RNA to be generated as the focused library 32, and user identification information to identify the user (step S11).
- the control unit 11 receives information indicating the name and structure of the target RNA from the terminal device 2, for example, based on the user's operation of the operation unit 25.
- the user can select any target RNA (e.g., a first target RNA) depending on the purpose of the drug discovery research.
- the control unit 11 acquires a compound group including a plurality of compounds to be used in the training data through the terminal device 2 (step S12).
- the compound group is extracted from the compounds stored in the basic library 31.
- the control unit 11 receives from the terminal device 2 information on the compounds stored in the basic library 31 that is necessary for the processing described below (e.g., the compound's name, molecular structure, necessary physical properties, etc.) for each compound included in the compound group.
- the user can arbitrarily select compounds to be used in the training data according to the purpose of the drug discovery research.
- the user determines the number of compounds to be used in the training data according to the time and cost allowed for generating the training data, and extracts compounds from the basic library 31 so that the number is the determined number.
- the compound group to be used in the training data may be determined by the information processing device by randomly extracting a predetermined number from the basic library 31.
- the control unit 11 acquires binding information for each compound included in the acquired compound group, including the presence or absence of binding ability to the first target RNA, the degree of binding strength, the binding ratio, the presence or absence of structural changes upon binding, etc. (step S13).
- the binding information for the first target RNA is obtained by analyzing the interaction between each compound and the target RNA detected using the detection device 4.
- the control unit 11 may acquire the binding information by, for example, accepting a manually determined result of the binding ability via the operation unit 15 or the communication unit 13, or may automatically derive the binding information based on the detection data accepted from the detection device 4 via the input/output unit 16.
- the control unit 11 acquires compound information for each compound included in the compound group (step S14). For example, the control unit 11 derives a structural formula using the SMILES notation from the molecular structure of the compound acquired in step S12, and calculates multiple types of molecular descriptors that are set in advance based on the derived structural formula.
- the control unit 11 associates the obtained compound information with information indicating binding ability and stores them in the training DB 122 (step S15). Through the above process, training data is generated.
- the control unit 11 acquires training data in which the compound information of the training compound is assigned with the presence or absence of binding ability to the first target RNA based on the information stored in the training DB 122 (step S16).
- the control unit 11 uses the acquired training data to generate a learning model 121 that outputs the presence or absence of binding ability of the compound to the first target RNA when compound information of the compound is input (step S17). Specifically, the control unit 11 inputs the compound information contained in the training data to the learning model 121, and optimizes various parameters so that the output from the learning model 121 approximates the correct value. For example, when the learning is completed because the number of learning times meets a predetermined standard, the control unit 11 stores definition information regarding the learned learning model 121 in the storage unit 12 as the learned learning model 121.
- the learning model 121 is constructed by the processes from step S16 to step S17.
- the group of compounds used as training data for generating the learning model 121 is not limited to being extracted from compounds stored in the basic library 31, i.e., the library from which the focused library 32 is generated.
- the learning model 121 may be constructed using training data information from a compound library other than the library from which the focused library 32 is generated.
- FIG. 5 is a flowchart showing an example of a process for generating the focused library 32. After completing step S17 in the flowchart of FIG. 4, for example, the control unit 11 of the information processing device 1 executes the following process.
- the control unit 11 of the information processing device 1 receives the basic library 31 held by the user from the terminal device 2 (step S21).
- the control unit 11 may receive only the information required for the processing described below (e.g., the compound's name, molecular structure, etc.) from the information related to each compound contained in the basic library 31.
- the control unit 11 obtains information on compounds to be evaluated among the compounds contained in the basic library 31.
- the compounds to be evaluated may be all compounds stored in the basic library 31, or a portion selected from all the compounds.
- the control unit 11 acquires compound information including molecular descriptors, structural formulas, molecular graphs, etc. for each compound in the acquired basic library 31 (step S22).
- the control unit 11 inputs the acquired compound information for each compound into the learning model 121 (step S23), and acquires the presence or absence of binding ability output from the learning model 121 (step S24).
- the control unit 11 classifies each compound in the basic library 31 into either a hit group or a non-hit group based on the obtained estimation result of the presence or absence of binding ability (step S25).
- the hit group is a group to which compounds that have the ability to bind to the first target RNA belong
- the non-hit group is a group to which compounds that do not have the ability to bind to the first target RNA belong.
- the processing of step S25 corresponds to a selection process that selects compounds that have the ability to bind to the first target RNA from the basic library 31.
- the control unit 11 obtains a score that quantifies the priority of each compound classified into the hit group (step S26).
- the priority is quantified so that the higher the probability that the compound has the ability to bind to the first target RNA, the higher the score.
- the control unit 11 obtains a value of the classification accuracy in the learning model 121, and can use the obtained classification accuracy as the score.
- the classification accuracy can be obtained, for example, from the proportion of decision trees that match the estimate, the confidence level for the classification class at the output node, etc.
- the control unit 11 Based on the classification results, the control unit 11 generates a focused library 32 including the compounds classified into the hit group (step S27). In this case, the control unit 11 ranks each compound classified into the hit group in descending order of score based on the priority score, and stores the compounds in the focused library 32 in descending order.
- the generated focused library 32 only needs to include information that allows recognition of the compounds included in the focused library 32, and may be in the form of a compound list that displays multiple compound names in order of score, for example.
- the focused library 32 may be associated with a score for each compound.
- the control unit 11 transmits the generated focused library 32 to the terminal device 2 corresponding to the user identified by the user identification information, i.e., the user of the basic library 31 (step S28), and ends the series of processes.
- the user searches for drug discovery targets using the focused library 32 provided by the information processing device 1.
- the focused library 32 is composed of compounds that are estimated by the learning model 121 to have binding ability to the first target RNA, so that compounds that have activity against the first target RNA can be obtained with a high hit rate.
- the information processing device 1 Every time the information processing device 1 receives from the terminal device 2 a request to generate a focused library 32 targeting a new target RNA, it repeatedly executes a series of processes including the generation of training data, the generation of a learning model 121, and the generation of a focused library 32 described above. That is, different learning models 121 are prepared according to the type of target RNA, and compounds are selected using the learning models 121 corresponding to each prepared target RNA.
- the information processing device 1 When a selection of a second target RNA is received from the user as the target substance, the information processing device 1 generates a learning model 121 that estimates binding information regarding the second target RNA, and generates a new focused library 32 by evaluating the binding ability of each compound to the second target RNA using the learning model 121.
- control unit 11 may omit the score acquisition process in step S26.
- control unit 11 may further select compounds to be stored in the focused library 32 based on the score acquired in step S26.
- the control unit 11 may generate the focused library 32 by extracting only compounds whose scores are equal to or greater than a threshold value, for example.
- the control unit 11 may classify the compounds in the basic library 31 taking into account these estimation results. For example, the control unit 11 classifies the compounds into three or more groups according to the presence or absence of binding ability to the first target RNA and the binding strength. The control unit 11 stores compounds belonging to groups according to preset selection criteria (e.g., having binding ability and strong binding strength) in the focused library 32.
- preset selection criteria e.g., having binding ability and strong binding strength
- the control unit 11 may repeatedly perform classification using the learning model 121 multiple times, calculate a total score that is the sum of the priorities calculated each time, and perform ranking based on the calculated total score.
- the control unit 11 may ultimately extract a predetermined number of compounds, for example, in descending order of total score, or compounds whose total score is equal to or greater than a preset threshold, and record the extracted compounds in the focused library 32.
- the control unit 11 may classify the compounds into a hit group and a non-hit group based on the output results of each individual learning model.
- the control unit 11 obtains the presence or absence of binding ability output from each individual learning model for a certain compound.
- a classification result indicating that a compound has binding ability is obtained from all individual learning models or from individual learning models equal to or greater than a preset threshold, the control unit 11 classifies the compound into a hit group.
- the compounds to be evaluated may include compounds that are included in a compound group.
- the compounds included in the compound group may be classified into a hit group or a non-hit group based on binding information obtained by actual measurement, instead of estimation using the learning model 121.
- the focused library 32 is not limited to containing only compounds that are presumed to have the ability to bind to the first target RNA, but may also contain some compounds that are presumed not to have the ability to bind to the first target RNA.
- a high-quality focused library 32 can be generated based on the user's basic library 31, with an increased proportion of compounds that have the ability to bind to the target RNA.
- generating the focused library 32 by extracting compounds from the basic library 31 owned by a drug discovery company, it is possible to provide a focused library 32 that can improve search efficiency while ensuring originality.
- the learning model 121 By using the learning model 121, compounds can be selected from the basic library 31 efficiently and accurately. By generating training data by evaluating the interaction between the compound and the target substance through actual measurement, the estimation accuracy of the learning model 121 is improved. By generating the learning model 121 according to the first target RNA, the binding ability to the first target RNA can be accurately estimated using the learning model 121.
- (Modification) 5 may be executed by the terminal device 2. In this case, the processes of steps S21 and S28 may be omitted.
- the information processing device 1 executes the generation process of the learning model 121 shown in FIG. 4 and stores the generated learning model 121 in an area accessible to the terminal device 2.
- the information processing device 1 may deploy the generated learning model 121 to the terminal device 2.
- the terminal device 2 accesses the learning model 121 stored in an external device or references the memory unit 22 of its own device to read the learning model 121, and executes the processes of steps S22 to S27 using the read learning model 121 to execute compound selection processing, etc.
- the above configuration allows the focused library 32 to be generated on the terminal device 2 side. This eliminates the need to provide information on the compounds to be evaluated from the terminal device 2 to the information processing device 1, improving the confidentiality of the basic library 31.
- Second Embodiment In the second embodiment, a part of the compounds that do not have the ability to bind to the first target RNA is removed from the compounds contained in the compound group.
- differences from the first embodiment will be mainly described, and the same reference numerals will be used to designate the same components as the first embodiment, and detailed description thereof will be omitted.
- the multiple compounds included in the training DB122 i.e., the multiple compounds included in the group of compounds selected from the basic library 31, include compounds with binding ability and compounds without binding ability as binding information obtained by interaction analysis. It is assumed that the number of compounds included in the group of compounds that do not have binding ability to the first target RNA is greater than the number of compounds that have binding ability to the first target RNA. In other words, it is assumed that the data included in the training DB122 is imbalanced data in which there is a bias in the ratio of compounds with binding ability and compounds without binding ability.
- the information processing device 1 of the second embodiment aims to improve the accuracy of the learning model 121 by eliminating some of the compounds that do not have binding ability included in the compound group.
- FIG. 6 is a flowchart showing an example of a process for generating a learning model 121 executed by the information processing device 1 of the second embodiment.
- the control unit 11 of the information processing device 1 executes the processes from step S11 to step S15 of the first embodiment, thereby storing the compound information and binding information of all compounds in the compound group acquired in step S15 in the training DB 122. Based on the compound information of each compound stored in the training DB 122, the control unit 11 classifies each compound included in the compound group into compounds that have binding ability to the first target RNA and compounds that do not have binding ability to the first target RNA (step S31).
- the control unit 11 removes some of the compounds that do not have the ability to bind to the first target RNA from among the compounds included in the compound group (step S32).
- the control unit 11 thins out the compounds that do not have the ability to bind to the first target RNA so that the ratio of compounds in the compound group that have the ability to bind to the first target RNA to compounds that do not have the ability to bind (number of compounds that have the ability to bind: number of compounds that do not have the ability to bind) is, for example, 1:1, 1:4, 1:8, 1:16, etc.
- the above ratio is one example, and the ratio in the removal process is not necessarily limited to this value.
- control unit 11 executes processing similar to steps S16 to S17 to use the data of the compound group after removing some of the compounds as training data and generate a learning model 121.
- a learning model 121 is generated that estimates binding information for other target substances using predicted values of binding information for other target substances generated based on binding information for a specific target substance and correlations between the target substances.
- the learning model 121 is trained using training data that associates compound information with binding information.
- predicted values of binding information are used as training data instead of binding information as actual measurement data using a specified mutual analysis method.
- the predicted values of binding information can be calculated using binding information already obtained by actual measurement, taking into account the correlation between multiple target RNAs.
- FIG. 7 is a flowchart showing an example of a process procedure for generating predicted values of binding information in the third embodiment.
- the new target substance is a second target RNA in which adenine at a specific position in the RNA structure of the first target RNA is replaced with uracil.
- the control unit 11 of the information processing device 1 calculates a correlation coefficient that indicates the strength of the correlation function between the first target RNA and the second target RNA (step S41).
- the correlation coefficient between the first target RNA and the second target RNA is calculated, for example, by comprehensively evaluating the difference in the material structure between the first target RNA and the second target RNA, specifically the difference between adenine and uracil, using various indices related to the material (for example, the number of hydrogen bond donors, the number of hydrogen bond acceptors, surface area, volume, etc.).
- the control unit 11 derives a predicted value of binding information for the second target RNA based on the calculated correlation degree and the binding information for the first target RNA stored in the training DB 122 (step S42). For example, the control unit 11 determines a predicted value of the presence or absence of binding ability for the second target RNA by taking into account the calculated correlation degree in addition to the presence or absence of binding ability for the first target RNA.
- the control unit 11 stores the compound information and the predicted value of the binding information for the second target RNA in the training DB 122 as training data (step S43).
- control unit 11 executes the same processes as steps S16 to S17 of the first embodiment to generate a learning model 121 that uses training data including predicted values of binding information for the second target RNA to estimate the presence or absence of binding ability for the second target RNA for the compound information.
- the control unit 11 generates a focused library 32 for the second target RNA by selecting compounds that have binding ability for the second target RNA using the generated learning model 121.
- the time and cost required for experiments to generate training data can be reduced, making it easier to generate a focused library 32.
- a learning model 121 capable of estimating binding information for various target RNAs is generated.
- FIG. 8 is an explanatory diagram showing an overview of the learning model 121 of the fourth embodiment and an example of the content of information stored in the training DB 122.
- the learning model 121 of the fourth embodiment receives compound information of a compound and target substance information of a target substance as input, and outputs binding information of the compound to the target substance.
- the target substance information includes information about the target substance, such as information representing the substance name, primary structure (sequence), secondary structure, etc. of the target substance.
- the information processing device 1 accumulates, for example, binding information acquired in the generation process of the focused library 32 targeting multiple types of target substances in the training DB 122. Based on the accumulated information, the information processing device 1 generates training data in which the binding information of the compound to the target substance is labeled as a correct value for the compound information of the compound and the target substance information of the target substance. The information processing device 1 trains the learning model 121 using the generated training data. The learning model 121 trained in this way makes it possible to estimate binding information for a variety of target substances.
- binding information for a variety of target substances can be estimated using a single learning model, making it unnecessary to generate a learning model 121 for each type of target substance, and facilitating the generation of a focused library 32.
- REFERENCE SIGNS LIST 100 Compound library generation system 1 Information processing device 11 Control unit 12 Memory unit 13 Communication unit 14 Display unit 15 Operation unit 16 Input/output unit 121 Learning model 122 Training DB 1P program 1A recording medium 2 terminal device 21 control unit 22 storage unit 23 communication unit 24 display unit 25 operation unit 2P program 2A recording medium 31 basic library 32 focused library
Landscapes
- Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Computing Systems (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medicinal Chemistry (AREA)
- Library & Information Science (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Sont prévus un procédé de génération de bibliothèque de composés et similaires qui permettent de construire une bibliothèque de composés qui augmente la proportion de composés qui ont la capacité de se lier à une cible. Un procédé de génération de bibliothèque de composés selon la présente invention implique un traitement d'exécution informatique qui acquiert des informations concernant une pluralité de composés stockés au niveau d'une première bibliothèque de composés, utilise un modèle d'apprentissage qui a été entraîné pour délivrer des informations qui indiquent la capacité à se lier à une substance cible quand des informations de composé concernant un composé ont été entrées pour trier la pluralité de composés stockés au niveau de la première bibliothèque de composés dans un groupe de composés qui ont la capacité de se lier à une substance cible et un groupe de composés qui n'ont pas la capacité de se lier à la substance cible, et génère une seconde bibliothèque de composés qui comprend les composés triés dans le groupe de composés qui ont la capacité de se lier à la substance cible.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023-126515 | 2023-08-02 | ||
| JP2023126515 | 2023-08-02 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025028367A1 true WO2025028367A1 (fr) | 2025-02-06 |
Family
ID=94395302
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2024/026484 Pending WO2025028367A1 (fr) | 2023-08-02 | 2024-07-24 | Procédé de génération de bibliothèque de composés, système de génération de bibliothèque de composés, programme informatique et procédé de génération de modèle d'apprentissage |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025028367A1 (fr) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2007139037A1 (fr) * | 2006-05-26 | 2007-12-06 | Kyoto University | Estimation d'une interaction protéine-composé et conception rationnelle d'une bibliothèque de composés sur la base d'informations génomiques chimiques |
| JP2018092575A (ja) * | 2016-10-27 | 2018-06-14 | 武田薬品工業株式会社 | 化合物の生物活性を予測するためのプログラム、装置及び方法 |
| JP2022106287A (ja) * | 2021-01-06 | 2022-07-19 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 親和性予測方法及びモデルのトレーニング方法、装置、機器及び媒体 |
| JP2022150078A (ja) * | 2021-03-26 | 2022-10-07 | 富士通株式会社 | 情報処理プログラム、情報処理装置、及び情報処理方法 |
| JP2022184048A (ja) * | 2021-05-31 | 2022-12-13 | 国立大学法人九州大学 | 相互作用推定方法、相互作用推定装置および相互作用推定プログラム |
-
2024
- 2024-07-24 WO PCT/JP2024/026484 patent/WO2025028367A1/fr active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2007139037A1 (fr) * | 2006-05-26 | 2007-12-06 | Kyoto University | Estimation d'une interaction protéine-composé et conception rationnelle d'une bibliothèque de composés sur la base d'informations génomiques chimiques |
| JP2018092575A (ja) * | 2016-10-27 | 2018-06-14 | 武田薬品工業株式会社 | 化合物の生物活性を予測するためのプログラム、装置及び方法 |
| JP2022106287A (ja) * | 2021-01-06 | 2022-07-19 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 親和性予測方法及びモデルのトレーニング方法、装置、機器及び媒体 |
| JP2022150078A (ja) * | 2021-03-26 | 2022-10-07 | 富士通株式会社 | 情報処理プログラム、情報処理装置、及び情報処理方法 |
| JP2022184048A (ja) * | 2021-05-31 | 2022-12-13 | 国立大学法人九州大学 | 相互作用推定方法、相互作用推定装置および相互作用推定プログラム |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Yu et al. | Meta-ADD: A meta-learning based pre-trained model for concept drift active detection | |
| Sharma et al. | A survey on decision tree algorithms of classification in data mining | |
| CN109905772B (zh) | 视频片段查询方法、装置、计算机设备及存储介质 | |
| JP6839342B2 (ja) | 情報処理装置、情報処理方法およびプログラム | |
| US20200074306A1 (en) | Feature subset evolution by random decision forest accuracy | |
| Ahmad et al. | RETRACTED ARTICLE: diagnosis of cardiovascular disease using deep learning technique | |
| CN108763354B (zh) | 一种个性化的学术文献推荐方法 | |
| Stepišnik et al. | A comprehensive comparison of molecular feature representations for use in predictive modeling | |
| CN110633667B (zh) | 一种基于多任务随机森林的动作预测方法 | |
| CN114117240B (zh) | 基于大数据需求分析的互联网内容推送方法及ai系统 | |
| US20050288868A1 (en) | Molecular property modeling using ranking | |
| CN113706285A (zh) | 一种信用卡欺诈检测方法 | |
| EP2609209A1 (fr) | Sélection de composés dans la recherche de médicaments | |
| Sagala et al. | Enhanced churn prediction model with boosted trees algorithms in the banking sector | |
| WO2019159602A1 (fr) | Dispositif de traitement d'informations, procédé, et support à programme mémorisé | |
| KR20220083649A (ko) | 단백질 진화정보를 이용한 화합물 결합 유사성 탐색 방법 | |
| Tamvakis et al. | Optimized classification predictions with a new index combining machine learning algorithms | |
| WO2025028367A1 (fr) | Procédé de génération de bibliothèque de composés, système de génération de bibliothèque de composés, programme informatique et procédé de génération de modèle d'apprentissage | |
| Reddy | Particle swarm optimized neural network for predicting customer behaviour in digital marketing | |
| Sasikala et al. | A novel memetic algorithm for discovering knowledge in binary and multi class predictions based on support vector machine | |
| CN115147020B (zh) | 装修数据处理方法、装置、设备及存储介质 | |
| CN118410235A (zh) | 一种基于因果图的兴趣点推荐去偏方法 | |
| WO2016144360A1 (fr) | Approche interactive progressive pour traitement analytique de mégadonnées | |
| JP4891638B2 (ja) | 目的データをカテゴリに分類する方法 | |
| Vinogradov et al. | Bioptic--A Target-Agnostic Potency-Based Small Molecules Search Engine |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24849022 Country of ref document: EP Kind code of ref document: A1 |