WO2023216065A1

WO2023216065A1 - Differentiable drug design

Info

Publication number: WO2023216065A1
Application number: PCT/CN2022/091721
Authority: WO
Inventors: Hui Li; Le SONG; Ziyao XU; Jiarun QU
Original assignee: Biomap Beijing Intelligence Technology Ltd
Current assignee: Biomap Beijing Intelligence Technology Ltd
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2023-11-16
Anticipated expiration: 2024-11-09

Abstract

Provided herein are methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying a target protein corresponding to an object protein. In some implementations, an object protein and a baseline protein are received by a data processing apparatus. A generative model is identified, which receives the baseline protein and outputs a generated protein, wherein the generative model is differentiable with respect to the baseline protein. A target protein is solved for by solving an optimization problem to minimize a binding energy between the object protein and the generated protein output from the generative model, wherein the binding energy between the object protein and the generated protein is calculated by querying a pre-trained binding energy model that predicts a binding energy between two proteins that are input into the pre-trained binding energy model. The target protein is output.

Description

Differentiable Drug Design

TECHNICAL FIELD

This specification relates to design of drug (e.g., antibody) based on a differential approach.

BACKGROUND

Bioinformatics has been a popular research area for many biologists. However, due to the limitation of equipment and technologies, researchers have mainly been exploring physiological responses of living organisms based on laboratory experiments with sophisticated instruments. Such laboratory experiments often require significant labor and time costs. Conventional pharmaceutical drug design can be cumbersome, requiring a lot of manual intervention, such as the need to continuously rotate and shift proteins and drugs when screening drugs in order to obtain protein-drug compounds (e.g., amino acids or molecules) , and ultimately find compounds with the best interfacing effect through virtual screening or other means for drug research and development.

Techniques for more efficient and effective drug design are desirable, especially with the rapid development of artificial intelligence (AI) and big data technologies.

SUMMARY

Described embodiments of the subject matter can include one or more features, alone or in combination.

For example, in one embodiment, a computer-implemented method for identifying a target protein corresponding to an object protein includes receiving, by a data processing apparatus, an object protein S1; receiving, by the data processing apparatus, a baseline protein S0; identifying, by the data processing apparatus, a generative model G that receives the baseline protein S0 and outputs a generated protein S2, wherein the generative model G is differentiable with respect to the baseline protein S0; solving, by the data processing apparatus, for a target protein S2*by solving an optimization problem to minimize a binding energy between the object protein S1 and the generated protein S2 output from the generative model G, wherein the generated protein S2 is subject to a constraint set S, and the binding energy between the object protein S1 and the generated protein S2 is calculated by querying a pre-trained binding energy model F that predicts a binding energy between two proteins that are input into the pre-trained binding energy model F; and outputting, by the data processing apparatus, the target protein S2*.

In some embodiments, these general and specific aspects may be implemented using a system, a method, or a computer program, or any combination of systems, methods, and computer programs. The foregoing and other described embodiments can each, optionally, include one or more of the following aspects:

In some embodiments, the computer-implemented method further includes performing testing on the target protein S2*; and designing an antibody drug corresponding to an antigen based on the target protein S2*, wherein the object protein S1 includes the antigen.

In some embodiments, wherein the object protein S1 includes an amino acid sequence, and the baseline protein S0 includes one or more amino acid sequences.

In some embodiments, wherein solving the optimization problem to minimize the binding energy between the object protein S1 and the generated protein S2 output from the generative model G includes solving the optimization problem in an iterative manner to update the baseline protein S0 based on a derivative of the pre-trained binding energy model F with respect to the baseline protein S0.

In some embodiments, the computer-implemented method further includes training the generative model G according to differentiable generative modeling by configuring a loss function of the differentiable generative modeling to include a binding energy between two proteins.

In some embodiments, wherein training the generative model according to differentiable generative modeling includes: calculating the loss function by querying the pre-trained binding energy model F; and updating parameters of the generative model G based on the loss function.

In some embodiments, wherein training the generative model according to differentiable generative modeling includes training the generative model G according to generative adversarial network (GAN) to minimize a binding energy between an input protein and a generated protein output from the generative model G, and the generative model G is a generator of the GAN.

In some embodiments, wherein training the generative model according to differentiable generative modeling includes training the generative model G according to a variational autoencoder (VAE) to minimize the loss function that includes the binding energy between an input protein and a generated protein output from the VAE, and the generative model G is the VAE.

In some embodiments, wherein training the generative model according to differentiable generative modeling includes training a normalizing flow to minimize the loss function including the binding energy between the two proteins.

It is appreciated that methods in accordance with this specification may include any combination of the aspects and features described herein. That is, methods in accordance with this specification are not limited to the combinations of aspects and features specifically described herein but also include any combination of the aspects and features provided.

The details of one or more embodiments of this specification are set forth in the accompanying drawings and the description below. Other features and advantages of this specification will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of an example of a process for drug design (e.g., antibody) based on a differential approach, in accordance with embodiments of this specification.

FIG. 2 is a diagram illustrating an example of a computer-implemented system configured to perform drug design based on a differential approach, in accordance with embodiments of this specification.

FIG. 3 is a block diagram illustrating an example of a computer-implemented System used to provide computational functionalities associated with described algorithms, methods, functions, processes, flows, and procedures, according to an embodiment of the present disclosure.

FIG. 4 depicts examples of modules of an apparatus in accordance with embodiments of this specification.

FIG. 5 is a table illustrating an example data representation of amino acids, in accordance with embodiments of this specification.

FIG. 6 is a table illustrating an example representation of training data for training a binding energy model, in accordance with embodiments of this specification.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

This specification describes techniques for more efficient drug design (e.g., antibody) based on artificial intelligence (AI) technologies. An antigen can include a biological molecule or biological molecular structure or any foreign particulate matter that can bind to a specific antibody or T-cell receptor or ligand. An antibody can include, for example, a protein used by an immune system to identify and neutralize foreign objects such as pathogenic bacteria and viruses. The antibody recognizes or otherwise corresponds to an antigen. For example, an antibody can include one or more paratopes, wherein each paratope is specific for one particular epitope on an antigen, allowing these two structures to bind together with precision. In some embodiments, techniques are described for predicting, identifying, or otherwise generating antibodies that may have a low binding energy to an antigen using computer-implemented methods. The described techniques can be used in applications for identifying antibodies, for example, for more efficiently generating, synthesizing, screening, modifying, or otherwise designing antibodies. In this application, the term “antigen” or “antibody” can be broad enough to encompass one or more of a protein, a peptide, or another type of sequence or chain of amino acids (also referred to as an amino acid chain or an amino acid sequence) ; and the term “protein” can be broad enough to encompass one or more amino acid sequences (regardless of the length of each of the amino acid sequences (e.g., including long polypeptides, short polypeptides, or peptides) ) in a 2-dimension (2D) , 3D or a higher-dimension.

In some embodiments, the process of drug action is primarily the process of binding antigens and antibodies, which are used to screen drugs based on the affinity of the antigens and antibodies. In some embodiments, a low binding energy between two proteins can indicate a more stable binding or interaction between the two and a high affinity of the two.

In some embodiments, the described techniques can identify antibodies that may have a low binding energy to an antigen by directly solving an optimization problem using computer programs designed based on a differential approach (e.g., a gradient-based approach) using pre-trained models (e.g., a binding energy model and protein generative model) . As a result, the described techniques can identify desirable antibodies in an automatic, streamlined manner, and thus largely accelerates drug research and development (collectively referred to as drug design) . In some embodiments, the described techniques can reduce or replace manual labor in the conventional drug design, such as trial and errors by manual alterations of an amino acid sequence of a potential protein, and also reduce the chance of error.

The techniques described in this specification can generate additional or different technical effects. In some embodiments, the described techniques can be implemented as a software-implemented application or package that can efficiently finding potential antibody proteins corresponding to an antigen. Even compared to other computer-assisted drug design approaches, the described techniques can reduce computational load and improve the computational efficiency by directly solving an optimization problem based on a differential approach (also referred to differential drug design) . In some embodiments, the described techniques innovatively formulate the optimization problem to find proteins that have a low binding energy to an antigen protein and make the optimization problem differentiable and thus can use gradient-based algorithm to solve the optimization problem efficiently. In some embodiments, the described techniques further leverage differentiable generative modeling techniques to design a protein generative model that can generate candidate antibody proteins via computers, without time-and cost-consuming manual experiments. Moreover, by innovatively applying the advanced generative modeling techniques, the trained protein generative model can generate better candidate antibody proteins that have desired properties to facilitate antibody design, improving the success rate of the overall drug design process.

Let function f _θ (s ₁, s ₂) be a function to evaluate a binding energy (e.g., dissociation constant (KD) or Gibbs free energy (delta_g) ) between protein s ₁ and protein s ₂, where θ is the parameter of the binding energy function f _θ (s ₁, s ₂) . In some embodiments, the function f _θ (s ₁, s ₂) can be based on an affinity or other interactions between the protein s ₁ and protein s ₂. In some embodiments, the higher the affinity, the lower the binding energy. There can be twenty natural amino acids and other non-natural amino acids. In some embodiments, each of the proteins s ₁ and s ₂ can be represented as one or more amino acid sequences that includes one or more of the twenty natural amino acids and other non-natural amino acids, respectively. As one example, each of the proteins s ₁ and s ₂ includes only one amino acid sequence, wherein n ₁ and n ₂ refer to the sequence length of s ₁ and s ₂, respectively. As one example, s ₁ can be represented as a sequence of length n ₁ and s ₂ can be represented as a sequence of length n ₂. In some embodiments, each amino acid in an amino acid sequence can be represented, for example, by a letter, a numeric, or another character, for example, as discussed with respect to FIG. 5. In some embodiments, to enable and facilitate computer processing, the amino acid can be represented, for example, as a numeric vector, based on word embedding or other types of encoding or conversion techniques. As an example, each amino acid in an amino acid sequence can be represented, for example, by one-hot encoding as a vector of binary numbers e.g., {0, 1} . In this case, s ₁ can be represented as n ₁ vectors of length n ₁, and s ₂ can be represented as n ₂ vectors of length n ₂. Accordingly, f _θ (s ₁, s ₂) can be represented as

which receives two sequences s ₁ and s ₂ as inputs and outputs a real value as a binding energy between the two proteins s ₁ and s ₂, where d represents the total number of possible amino acids at each position of the amino acid sequence. For example, d can be 21 when there are 20 known amino acids with an additional one representing an unknown/uncertain amino acid (e.g., represented by x) . In some embodiments, each of the s ₁ and s ₂ can include multiple amino acid sequences (e.g., one or more of a heavy chain and a light chain) , represented in a vector, matrix, tensor, or another form or data structure. For example, s ₁ may include two amino acid sequences, represented as s ₁= [s ₁₁, s ₁₂] while s ₁ may include three amino acid sequences, s ₂= [s ₂₁, s ₂₂, s ₂₃] . In some embodiments, the binding energy function f _θ (s ₁, s ₂) can take s ₁ and s ₂ that are represented in any form or data structure as input and return a real value as the output. In some embodiments, the binding energy function f _θ (s ₁, s ₂) can be learned, for example, by machine learning algorithms, to infer, predict, estimate, or otherwise output a binding energy of the input s ₁ and s ₂ based on training data such that the dimensions of the inputs and outputs of f _θ (s ₁, s ₂) match the dimensions of the s ₁ and s ₂ in the training data.

Given one protein, e.g., an object protein s ₁ (e.g., an antigen) , it is desirable to design another protein s ₂, which can minimize its binding energy with s ₁. Typically, when addressing drug design problems, a valid drug, e.g., a monoclonal antibody (mAb) , should satisfy some constraints. Let S refer to a constraint set that includes one or more constraints (e.g., having to include or exclude some specific amino acids, meeting a certain total number of amino acids, having a certain 2D or 3D structure, compliant with a biological condition/constraint) . In some embodiments, the one or more constraints can be modeled mathematically by the constraint set S. An optimization problem for solving for a proposed or target protein s ₂ that minimizes a binding energy between proteins s ₁ and s ₂ can be defined based on Equation (1) :

To solve Equation (1) , f _θ (s ₁, s ₂) that computes the binding energy between s ₁ and s ₂ needs to be known. In some embodiments, f _θ (s ₁, s ₂) can be a binding energy prediction model that is learned based on one or more function approximation approaches, such as, machine learning (e.g., deep learning) from a labeled training dataset. For example, f _θ (s ₁, s ₂) can be trained to infer, estimate or otherwise predict a binding energy between the two input proteins s ₁ and s ₂ based on one or more machine learning algorithms such as artificial neural networks. In some embodiments, the binding energy prediction model can be based on an antigen-antibody affinity prediction model that is trained to predict affinity using available antigen-antibody data (e.g., including data representing pairs of antigen-antibody proteins and their respective affinities) according to deep learning methods. The binding energy prediction model can be stored and used in subsequent drug design based on a differential approach.

In some embodiments, f _θ (s ₁, s ₂) is required to be a differentiable function with respective to (w.r.t. ) s ₂ so that Equation (1) can be solved efficiently using a gradient-based optimization algorithm such as Gradient Descent, Momentum or Nesterov (Accelerated) Gradient Descent, RMSprop, Adaptive Moment Estimation (ADAM) , Newton-Raphson algorithm, or another algorithm that updates the variable s ₂ based on the derivative/gradient of the function f _θ (s ₁, s ₂) w.r.t. to s ₂. In some embodiments, given that s ₂ is a sequence of discrete values, Gumbel softmax or other differentiable concrete distribution with reparameterization tricks can be used to make the optimization problem (1) differentiable w.r.t. s ₂ . In some embodiments, in a case where f _θ (s ₁, s ₂) is not differentiable w.r.t. s ₂, f _θ (s ₁, s ₂) can be converted to an approximation of f _θ (s ₁, s ₂) that is differentiable w.r.t. s ₂, for example, by Taylor Expansion or another technique, so that the Equation (1) can still be solved efficiently using a gradient-based optimization algorithm.

In some embodiments, after obtaining f _θ (s ₁, s ₂) , a protein generation function ρ _τ (s ₀) can be designed to generate protein s ₂ based on an input protein s ₀, that is, s ₂= ρ _τ (s ₀) , where τ is the parameter of the protein generation function ρ _τ (s ₀) . In this case, Equation (1) can be rewritten into Equation (2) with the protein generation function ρ _τ (s ₀) :

The optimization problem defined according to Equation (1) is transformed to an optimization problem defined according to Equation (2) to find a desired protein generation function ρ _τ (s ₀) that generates the target protein s ₂ that minimizes a binding energy between proteins s ₁ and s ₂. In some embodiments, s ₀ refers to a baseline protein, which can provide a well starting point to accelerate the optimization problem defined according to Equation (2) .

In some embodiments, Equation (2) can be optimized from scratch based on a differentiable generative model such as, variational autoencoder (VAE) , generative adversarial network (GAN) , or normalizing flow. For example, the protein generation function ρ _τ (s ₀) can be a generative model that is trained based on differentiable generative modeling (including adversarial training) based on Equation (2) to minimize the binding energy f _θ (s ₁, ρ _τ (s ₀) ) .

Using VAE as an example, given a specific antigen s ₁, multiple binding antibodies s ₀ can be collected, for example, based on experiments or empirical knowledge, and form a training data set. A VAE model can be trained based on the training data set to generate antibodies (e.g., s ₂) similar to the known antibodies s ₀. The trained VAE model can be used as the protein generation function ρ _τ (s ₀) in Equation (2) .

In some embodiments, with known models for the binding energy f _θ (s ₁, s ₂) and protein generation function ρ _τ (s ₀) , Equation (1) can be rewritten into Equation (3) to find a desired

that gives rise to the desired

that minimize its binding energy with s ₁:

where

is the output from the protein generation function ρ _τ (s ₀) . Namely,

In some embodiments, ρ _τ (s ₀) is required to be differentiable w.r.t. s ₀ so that Equation (3) can be solved efficiently using a gradient-based optimization algorithm such as Gradient Descent, Momentum or Nesterov (Accelerated) Gradient Descent, RMSprop, Adaptive Moment Estimation (ADAM) , Newton-Raphson algorithm, or another gradient-based algorithm. For example, Equation (3) can be solved in an iterative manner to update the s ₀ until a convergence or a terminating condition is reached:

where t represents the t ^th iteration, and β represents a learning rate. The terminating condition can be configured to include, for example, a predetermined number of iterations, or a different threshold.

In some embodiments,

can be calculated based on the chain rule:

if f _θ (s ₁, ρ _τ (s ₀) ) is differentiable w.r.t. ρ _τ (s ₀) . If f _θ (s ₁, ρ _τ (s ₀) ) is not differentiable w.r.t. ρ _τ (s ₀) , an approximation (e.g., Taylor Expansion) of f _θ (s ₁, ρ _τ (s ₀) ) can be used to compute an approximated

in each iteration.

In some embodiments, given that s ₀ is a sequence of discrete values, Gumbel softmax or any other differentiable concrete distribution with reparameterization tricks can be used to make the optimization problem (3) differentiable w.r.t. s ₀.

FIG. 1 is a flowchart of an example process 100 for drug design (e.g., antibody) based on a differential approach, in accordance with embodiments of this specification. The process 100 can be an example of algorithms performed by a data processing apparatus, such as a computer-implemented system 200 in FIG. 2 or computer-implemented system 300 in FIG. 3. In some embodiments, a data processing apparatus can be a system of one or more computers, located in one or more locations, and programmed appropriately in accordance with this specification. For example, a computer-implemented system 300 of FIG. 3, appropriately programmed, can perform the example process 100.

In some embodiments, the example process 100 shown in FIG. 1 can be modified or reconfigured to include additional, fewer, or different operations, which can be performed in the order shown or in a different order. In some instances, one or more of the operations can be repeated or iterated, for example, until a terminating condition is reached. In some implementations, one or more of the individual operations shown in FIG. 1 can be executed as multiple separate operations, or one or more subsets of the operations shown in FIG. 1 can be combined and executed as a single operation.

At 110, an object protein S1 (e.g., s ₁) can be input, configured, identified, obtained, or otherwise received by the data processing apparatus. In some embodiments, receiving the object protein S1 includes receiving data representing the object protein S1. In some embodiments, the object protein S1 includes one or more object amino acid sequences of respective lengths. The object protein S1 (e.g., s ₁) can be represented as a vector, matrix, tensor, or another form or data structure. As an example, the object protein S1 includes one or more of sequences of length n1. In some embodiments, the object protein S1 includes additional data such as embedding data (e.g., one-hot encoding data) associated with the object amino acid sequence, for example, as described w.r.t. FIGS. 2 and 5. In some embodiments, the object protein S1 includes an antigen, to which the process 100 is performed to design an antibody. In this case, the object protein S1 includes an antigen amino acid sequence, represented by one or more sequences, or another type of sequence, with or without embedding data.

At 120, a baseline or initial protein S0 (e.g., s ₀) can be inputted, configured, identified obtained, or otherwise received by the data processing apparatus. The baseline protein S0 can serve as a starting point for finding a target protein corresponding to the object protein. In some embodiments, the baseline protein S0 can be of default or pre-configured values. For example, the process 100 can be performed without an explicit user input of the baseline protein S0. In some embodiments, the baseline protein S0 can be generated, for example, randomly or according to a certain algorithm. In some embodiments, the baseline protein S0 can be a starting point based on empirical knowledge learned, for example, based on pervious training or experiments that could lead to a faster or better solution (e.g., approach global, rather local optimal solution to the optimization problem in 140.

In some embodiments, receiving the baseline protein S0 includes receiving data representing the baseline protein S0. In some embodiments, the baseline protein S0 includes one or more baseline amino acid sequences of respective lengths. For example, the baseline protein S0 can include a heavy chain of amino acids and a light chain of amino acids. The heavy chain of amino acids and the light chain of amino acids may have the same or different lengths. In some embodiments, the heavy chain of amino acids and the light chain of amino acids can be from an antibody or a portion thereof. In some embodiments, the baseline protein S0 includes one or more sequences of length n0. In some embodiments, the baseline protein S0 includes additional data such as embedding data associated with the baseline amino acid sequence in a similar or different manner as described w.r.t. FIGS. 2 and 5. The baseline protein S0 can be represented as a vector, matrix, tensor, or another form or data structure. The baseline protein S0 can be represented in a manner similar to or different from that of the object protein S1.

At 130, a generative model G is inputted, configured, received, obtained, or otherwise identified by the data processing apparatus. The generative model G receives an input protein (e.g., the baseline protein S0) and outputs a generated protein S2 (e.g., a generated protein that is similar to the baseline protein) . In some embodiments, similarity between two proteins (e.g., the input protein and the output protein of the generative model G) can be defined, for example, based on a binding energy, a structural resemblance or difference, a reaction to a common stimulant between the two proteins, or a combination of these or other measures. For example, two proteins may be considered similar to each other if the binding energy between the two proteins is lower than a threshold, if the number of common component amino acids between the two proteins is higher than a threshold, if a difference between structures of amino acid sequences of the two proteins is lower than a threshold, or if the two proteins interact or react similarly with another protein or stimulant.

In some embodiments, the generative model G can be the protein generation function ρ _τ (s ₀) or a variant thereof, and the generated protein S2 can be, for example, s ₂. The input protein and the generated protein may have the same dimension or different dimensions. As an example, the generative model G can take an input protein of a 2D sequence and output a generative protein of a 3D structure based on the input protein. In this case, the generative model G can learn and model the mapping or transition of the 2D sequence into a corresponding 3D structure of the protein. More generally, the generative model G can learn and model mapping or any other relationships (e.g., in 3D or higher dimensions) between the input protein and the output protein embedded in the training data set.

In some embodiments, the generative model G can be a pre-configured function such as an identify function, s ₀= ρ _τ (s ₀) , or another predefined function of the input protein. In some embodiments, the generative model G can be configured as a default function (e.g., a pre-configured function) . Identifying the generative model G can be performed automatically by the data processing apparatus without a user input. In some embodiments, the generative model G can be selected from multiple candidate functions. Identifying the generative model G can be performed by the data processing apparatus by receiving an input that specifies the generative model G and selecting the specified generative model G based on the input accordingly.

In some embodiments, the generative model G can be a more complex function, such as a generative model that is trained from a labeled training dataset based on differentiable generative modeling techniques such as, a variational autoencoder (VAE) , generative adversarial network (GAN) , normalizing flow, or other generative modeling techniques. For example, the generative model G can be a VAE, a generative model of the GAN, or a normalizing flow that is trained on the labeled training dataset to generate an output protein based on an input protein. In some embodiments, the generative model G is trained for a specific input protein (e.g., a baseline protein s ₀) to generate an output protein similar to the input protein. For example, the labeled training dataset can include the baseline protein s ₀ and multiple known proteins similar to the baseline protein s ₀.

In some embodiments, a user input can specify a type of algorithm for deriving (or a type of) the generative model G (e.g., a VAE, GAN, or normalizing flow) . Identifying the generative model G can include receiving an input that specifies the type of algorithm for deriving the generative model G and selecting a trained generative model G based on the specified type of algorithm for deriving the generative model G accordingly. In some embodiments, multiple types of the generative model G can be pre-trained according to different types of algorithms for deriving the generative model G. In some embodiments, the generative model G can be trained upon receiving the input that specifies the type of algorithm for deriving the generative model G.

In some embodiments, training the generative model G can include training the generative model G according to differentiable generative modeling by configuring a loss function of the differentiable generative modeling to include a binding energy between two proteins. In some embodiments, training the generative model G according to differentiable generative modeling can include calculating a loss function by querying a pre-trained binding energy model F; and updating parameters of the generative model G based on the loss function.

Use GAN as an example of the type of algorithm for deriving the generative model G. A GAN includes two models, a generator and a discriminator. The generator is a model that generates data, for example, to mimic a given distribution of input data. For example, the generator takes samples from a latent space as its input and generates data resembling the data in the training set. The discriminator distinguishes fake data (e.g., data generated by the generative model) from real data (e.g., the input data) . For example, the discriminator is a model that receives a sample from the real data in the training set or from the generator and outputs a probability that the sample belongs to the real training data.

The generative model G can be the generator of the GAN. The generative model G can be, for example, a neural network as a multilayer perceptron (MLP) , a convolutional neural network (CNN) , or any other structure as long as the dimensions of the input and output match the dimensions of the latent space and the real data. During the training process, parameters of the generator are adjusted or updated with a goal would be to minimize a loss function. The loss function can be set as the binding energy between an input protein and a generated protein output from the generator so that the generator generates a protein similar to the input protein (e.g., having a low binding energy with the input protein) . The loss function can be computed by querying a pre-trained binding energy model F, and the parameters of the generative model G based on the computed loss function.

In some embodiments, in one epoch of training, the parameters of the discriminator are updated while the parameters of the generator remain fixed, and the parameters of the generator are updated while the parameters of the discriminator remain fixed. After training, it is expected that the generated samples (e.g., the generated proteins S2 output from the generative model G) given by the generator will more closely resemble the real data (e.g., the training data that include the baseline protein) .

Use VAE as another example of the type of algorithm for deriving the generative model G. A VAE is an autoencoder whose encodings distribution is regularized during the training to ensure that its latent space has good properties allowing generation of new data. The generative model G can be a VAE trained from a training data set that includes the baseline protein and multiple protein similar to the baseline protein) . In variational autoencoders, the loss function typically includes a reconstruction term (that measures an error or a distance between generated data and input data to make the encoding-decoding scheme efficient) and a regularization term (that makes the latent space regular) . Unlike typical choices of the reconstruction term, such as a norm (e.g., l ₀, l ₁, l ₂, or l _∞) between the generated data and the input data, in this example, the reconstruction term of the VAE for the generative model G can be set as the binding energy between an input protein and a generated protein output from the VAE so that the VAE can generate an protein similar to the input protein (e.g., having a low binding energy) . As such, the loss function of the VAE includes the binding energy, which can be computed by querying a pre-trained binding energy model F, and the parameters of the generative model G based on the computed loss function.

Similarly, a normalizing flow can be used another example of the differentiable generative modeling to train a generative model G by configuring a loss function of the normalizing flow to incorporate the binding energy between two proteins.

At 140, a target protein S2* (e.g.,

) is solved for by solving an optimization problem to minimize a binding energy between the object protein S1 and the generated protein S2 output from the generative model G. In some embodiments, the generated protein S2 is subject to a constraint set S (e.g., S) . For example, the constraint set S can include representations of one or more biological or other constraints imposed on the generated protein S2, for example, for the generated protein S2 to be used as a valid drug, e.g., a monoclonal antibody (mAb) . The constraints can include, for example, an inclusion or exclusion of one or more specific amino acids, a certain total number of amino acids, a certain 2D or 3D structure, other conditions besides sequence structure (e.g., certain physicochemical or biochemical properties of molecules) , or a combination of these and other conditions.

The binding energy between the object protein S1 and the generated protein S2 can be calculated by querying a pre-trained binding energy model F that predicts a binding energy between two proteins (e.g., the object protein S1 and the generated protein S2) that are input into the pre-trained binding energy model F. As an example, the pre-trained binding energy model F can be the function f _θ (s ₁, s ₂) or a variant thereof. The pre-trained binding energy model F can be a binding energy prediction model that is learned based on one or more function approximation techniques, such as, machine learning (e.g., deep learning) from a labeled training dataset. For example, the pre-trained binding energy model F can be f _θ (s ₁, s ₂) that is a neutral network trained to infer, estimate or otherwise predict a binding energy between the two input proteins s ₁ and s ₂. The labeled training dataset can include multiple pairs of proteins and respective known binding energies of the multiple pairs of proteins.

As an example, the optimization problem can be based on the optimization problem defined according to Equation (3) . In some embodiments, solving the optimization problem to minimize the binding energy between the object protein S1 and the generated protein S2 output from the generative model G includes solving the optimization problem in an iterative manner to update the baseline protein S0 based on a derivative of the binding energy model F with respect to the baseline protein S0, for example, according to the techniques described w.r.t. Equations (3) and (4) . In some embodiments, the generative model G is differentiable with respect to the baseline protein S0. Various gradient-based optimization algorithms (e.g., Gradient Descent, Momentum or Nesterov (Accelerated) Gradient Descent, RMSprop, Adaptive Moment Estimation (ADAM) , Newton-Raphson algorithm) can be used to solve the optimization problem efficiently, especially with the pre-trained binding energy model F and generative model G.

At 150, the target protein S2*is output, for example, as the solution to the optimization problem. The target protein S2*can be derived given the baseline protein S0. The target protein S2*can be represented by an amino acid sequence, referred to as a target amino acid sequence.

At 160, testing the target protein S2*is performed. In some embodiments, the testing includes evaluating properties of the target protein S2*. In some embodiments, the testing includes synthesizing, modifying, or performing additional or different operations on the target protein S2*.

At 170, a drug is designed based on the target protein S2*. In some embodiments, the drug can be an antibody drug corresponding to an antigen that is included in the object protein S1.

In some embodiments, the above steps 120-160 can be repeated (for example, as shown in FIG. 1, after step 160, the example process 100 may go back to step 120, for example, to receive another baseline proteins S0) to identify different target proteins given different baseline proteins until a termination condition is reached. The termination condition can include, for example, reaching a predetermining numbers of available baseline proteins, reaching a testing result that meets or passes a predetermined threshold, etc. As an example, more than one baseline proteins S0 can be used to find local or global solutions to the optimization problems, which can result in different target proteins S2*, given different baseline proteins S0. In some embodiments, the different target proteins S2*can be collected, evaluated, compared, synthesized, modified, or otherwise manipulated for designing the drug for the object protein S1 at 170.

FIG. 2 is a diagram illustrating an example of a computer-implemented system 200 configured to perform drug design based on a differential approach, in accordance with embodiments of this specification. The computer-implemented system 200 can include a data preparation and preprocessing subsystem 210, an optimization solver 220, a testing/evaluation subsystem 230, and a drug design subsystem 240. In some embodiments, the computer-implemented system 200 can include additional or different components.

The data preparation and preprocessing subsystem 210 can prepare and pre-process data for the drug design based on a differential approach, for example, according to the example process 100. The data preparation and preprocessing subsystem 210 can encode, sort, label or otherwise prepare and pre-process data to implement the computer-implemented drug design process.

In some embodiments, the data preparation and preprocessing subsystem 210 can prepare and pre-process input data 215 for inputting into the optimization solver 220. In some embodiments, the input data 215 can include an object protein 217 (e.g., protein s ₁) , which can include an antigen, and a baseline protein (e.g., protein s ₀) . In some embodiments, a protein can be represented as a sequence of amino acid (also referred to as an amino acid sequence) . The object protein 217 can include an object amino acid sequence (e.g., an antigen amino acid sequence. In some embodiments, each amino acid can be represented by a letter (e.g., A, B, C, D, etc. ) or another character. In some embodiments, an amino acid sequence can include embedding data that based on, for example, word2vec vectors or other types of numeric representations or encoding or word embedding techniques (e.g., one-hot encoding) , of each amino acid in the amino acid sequence.

FIG. 5 is a table illustrating an example data representation 500 of amino acids, in accordance with embodiments of this specification. The example data representation 500 includes multiple amino acids represented by respective rows. In some embodiments, different amino acid can be represented by different letters, e.g., A to Z, as shown in the first column 510. For each the amino acid, the example data representation 500 also includes corresponding embedding data as shown in other columns of the table. In some embodiments, the embedding data can be word2vec vectors or another type of embedding code. Accordingly, a protein composed of amino acids can be represented by the respective letter representations and/or embedding data representations of the amino acids. In some embodiments, amino acids and the protein can be represented in another manner or data structure for computer processing.

In some embodiments, the data preparation and preprocessing subsystem 210 can prepare and pre-process training data A 260 for training a binding energy model F 250. In some embodiments, the training data A 260 can include, for example, multiple pairs of proteins and their respective binding energies (e.g., based on affinity) .

FIG. 6 is a table illustrating an example representation 600 of training data (e.g., training data A 260) for training a binding energy model (e.g., binding energy model F 250) , in accordance with embodiments of this specification. The example representation 600 of training data includes multiple data entries/instances represented by respective rows with respective numbers 610. Each data entry includes a pair of an antibody protein and an antigen protein, and a corresponding binding energy or affinity (represented by delta_g value in column 660) between the pair of the antibody protein and the antigen protein. More specifically, the antibody protein includes two amino acid sequences, a heave chain (antibody_seq_a) 630 and a light chain (antibody_seq_b) 640. The antigen sequence include a single amino acid sequence (antigen_seq) 650. The binding energy values 660 are used as the labels for training the binding energy model. The training data may include additional attributes, such as the ID (pdb) 620 of the protein.

In some embodiments, the data preparation and preprocessing subsystem 210 can prepare and pre-process training data B 270 for training a protein generative model G 280. The training data B 270 can include, for example, a baseline protein s ₀ and multiple similar proteins that have low binding energy with the baseline protein s ₀. The training data B 270 can be represented in a similar to the data representation as shown in FIGS. 5 and 6. In some embodiments, the training data B 270 can be represented in a different manner.

The input data can be input into the optimization solver 220 for solving an optimization problem to find a target protein, for example, according to the example process 100. The optimization solver 220 can be, for example, implemented by one or more processor or a specialized processor for performing accelerated optimization algorithms such as the gradient-based algorithms. The optimization solver 220 will interact with the pre-trained binding energy model F 250 trained based on the training data A 260 and the protein generative model G 280 based on the training data B 270 to solve the optimization problem. For example, the optimization solver 220 can query the pre-trained binding energy model F 250 for any two input protein and receives a predicted binding energy from the pre-trained binding energy model F 250 efficiently, without the need to compute the binding energy by the optimization solver 220 itself on the fly. Similarly, the optimization solver 220 can query the protein generative model G 280 for an output protein that is similar to an input protein and receives a generated protein from the protein generative model G 280. The overall computational efficiency of the drug design process can be improved.

The optimization solver 220 can generate output data 225 that includes a target protein 227. The target protein can be represented in a manner (e.g., with an amino acid sequence with embedding data) similar to the object protein, or can be represented in a different manner.

The target protein can be sent to the testing/evaluation subsystem 230 for performing testing and evaluation, for example, according to the operation 160 of the example process 100. The target protein and testing results can be sent to the drug design subsystem for designing an antibody drug is designed based on the target protein.

FIG. 3 is a block diagram illustrating an example of a computer-implemented system 300 used to provide computational functionalities associated with described algorithms, methods, functions, processes, flows, and procedures, according to an embodiment of the present disclosure. In the illustrated embodiment, System 300 includes a Computer 302 and a Network 330.

The illustrated Computer 302 is intended to encompass any computing device such as a server, desktop computer, laptop/notebook computer, wireless data port, smart phone, personal data assistant (PDA) , tablet computer, one or more processors within these devices, another computing device, or a combination of computing devices, including physical or virtual instances of the computing device, or a combination of physical or virtual instances of the computing device. Additionally, the Computer 302 can include an input device, such as a keypad, keyboard, touch screen, another input device, or a combination of input devices that can accept user information, and an output device that conveys information associated with the operation of the Computer 302, including digital data, visual, audio, another type of information, or a combination of types of information, on a graphical-type user interface (UI) (or GUI) or other UI.

The Computer 302 can serve in a role in a distributed computing system as a client, network component, a server, a database or another persistency, another role, or a combination of roles for performing the subject matter described in the present disclosure. The illustrated Computer 302 is communicably coupled with a Network 330. In some embodiments, one or more components of the Computer 302 can be configured to operate within an environment, including cloud-computing-based, local, global, another environment, or a combination of environments.

At a high level, the Computer 302 is an electronic computing device operable to receive, transmit, process, store, or manage data and information associated with the described subject matter. According to some embodiments, the Computer 302 can also include or be communicably coupled with a server, including an application server, e-mail server, web server, caching server, streaming data server, another server, or a combination of servers.

The Computer 302 can receive requests over Network 330 (for example, from a client software application executing on another Computer 302) and respond to the received requests by processing the received requests using a software application or a combination of software applications. In addition, requests can also be sent to the Computer 302 from internal users (for example, from a command console or by another internal access method) , external or third-parties, or other entities, individuals, systems, or computers.

Each of the components of the Computer 302 can communicate using a System Bus 303. In some embodiments, any or all of the components of the Computer 302, including hardware, software, or a combination of hardware and software, can interface over the System Bus 303 using an application programming interface (API) 312, a Service Layer 313, or a combination of the API 312 and Service Layer 313. The API 312 can include specifications for routines, data structures, and object classes. The API 312 can be either computer-language independent or dependent and refer to a complete interface, a single function, or even a set of APIs. The Service Layer 313 provides software services to the Computer 302 or other components (whether illustrated or not) that are communicably coupled to the Computer 302. The functionality of the Computer 302 can be accessible for all service consumers using the Service Layer 313. Software services, such as those provided by the Service Layer 313, provide reusable, defined functionalities through a defined interface. For example, the interface can be software written in JAVA, C++, another computing language, or a combination of computing languages providing data in extensible markup language (XML) format, another format, or a combination of formats. While illustrated as an integrated component of the Computer 302, alternative embodiments can illustrate the API 312 or the Service Layer 313 as stand-alone components in relation to other components of the Computer 302 or other components (whether illustrated or not) that are communicably coupled to the Computer 302. Moreover, any or all parts of the API 312 or the Service Layer 313 can be implemented as a child or a sub-module of another software module, enterprise application, or hardware module without departing from the scope of the present disclosure.

The Computer 302 includes an Interface 304. Although illustrated as a single Interface 304, two or more Interfaces 304 can be used according to particular needs, desires, or particular embodiments of the Computer 302. The Interface 304 is used by the Computer 302 for communicating with another computing system (whether illustrated or not) that is communicatively linked to the Network 330 in a distributed environment. Generally, the Interface 304 is operable to communicate with the Network 330 and includes logic encoded in software, hardware, or a combination of software and hardware. More specifically, the Interface 304 can include software supporting one or more communication protocols associated with communications such that the Network 330 or hardware of Interface 304 is operable to communicate physical signals within and outside of the illustrated Computer 302.

The Computer 302 includes a Processor 305. Although illustrated as a single Processor 305, two or more Processors 305 can be used according to particular needs, desires, or particular embodiments of the Computer 302. Generally, the Processor 305 executes instructions and manipulates data to perform the operations of the Computer 302 and any algorithms, methods, functions, processes, flows, and procedures as described in the present disclosure.

The Computer 302 also includes a Database 306 that can hold data for the Computer 302, another component communicatively linked to the Network 330 (whether illustrated or not) , or a combination of the Computer 302 and another component. For example, Database 306 can be an in-memory, conventional, or another type of database storing data consistent with the present disclosure. In some embodiments, Database 306 can be a combination of two or more different database types (for example, a hybrid in-memory and conventional database) according to particular needs, desires, or particular embodiments of the Computer 302 and the described functionality. Although illustrated as a single Database 306, two or more databases of similar or differing types can be used according to particular needs, desires, or particular embodiments of the Computer 302 and the described functionality. While Database 306 is illustrated as an integral component of the Computer 302, in alternative embodiments, Database 306 can be external to the Computer 302. As an example, Database 306 can include the above-described training data 316 (e.g., training data A 260 and training data B 270) , one or more pre-trained binding energy models 318 (e.g., the binding energy model F 250) and protein generative models 322 (e.g., the protein generative model G 280) , one or more target proteins 323 (e.g., the target protein 227) , and one or more testing results 328 of the target proteins 323.

The Computer 302 also includes a Memory 307 that can hold data for the Computer 302, another component or components communicatively linked to the Network 330 (whether illustrated or not) , or a combination of the Computer 302 and another component. Memory 307 can store any data consistent with the present disclosure. In some embodiments, Memory 307 can be a combination of two or more different types of memory (for example, a combination of semiconductor and magnetic storage) according to particular needs, desires, or particular embodiments of the Computer 302 and the described functionality. Although illustrated as a single Memory 307, two or more Memories 307 or similar or differing types can be used according to particular needs, desires, or particular embodiments of the Computer 302 and the described functionality. While Memory 307 is illustrated as an integral component of the Computer 302, in alternative embodiments, Memory 307 can be external to the Computer 302.

The Application 308 is an algorithmic software engine providing functionality according to particular needs, desires, or particular embodiments of the Computer 302, particularly with respect to functionality described in the present disclosure. For example, Application 308 can serve as one or more components, modules, or applications. Further, although illustrated as a single Application 308, the Application 308 can be implemented as multiple Applications 308 on the Computer 302. In addition, although illustrated as integral to the Computer 302, in alternative embodiments, the Application 308 can be external to the Computer 302.

The Computer 302 can also include a Power Supply 314. The Power Supply 314 can include a rechargeable or non-rechargeable battery that can be configured to be either user-or non-user-replaceable. In some embodiments, the Power Supply 314 can include power-conversion or management circuits (including recharging, standby, or another power management functionality) . In some embodiments, the Power Supply 314 can include a power plug to allow the Computer 302 to be plugged into a wall socket or another power source to, for example, power the Computer 302 or recharge a rechargeable battery.

There can be any number of Computers 302 associated with, or external to, a computer system containing Computer 302, each Computer 302 communicating over Network 330. Further, the term “client, ” “user, ” or other appropriate terminology can be used interchangeably, as appropriate, without departing from the scope of the present disclosure. Moreover, the present disclosure contemplates that many users can use one Computer 302, or that one user can use multiple computers 302.

FIG. 4 is a diagram of an example of modules of an apparatus 400 in accordance with embodiments of this specification. The apparatus 400 can be an example embodiment of a data processing apparatus for drug design (e.g., antibody) based on a differential approach, in accordance with embodiments of this specification. The apparatus 400 can correspond to the embodiments described above, and the apparatus 400 includes the following: a first receiving module 402 that receives an object protein S1, a second receiving module 403 that receives a baseline protein S0, an identifying module 404 that identifies a generative model G that receives the baseline protein S0 and outputs a generated protein S2, wherein the generative model G is differentiable with respect to the baseline protein S0, a solving module 405 that solves for a target protein S2*by solving an optimization problem to minimize a binding energy between the object protein S1 and the generated protein S2 output from the generative model G, wherein the generated protein S2 is subject to a constraint set S, and the binding energy between the object protein S1 and the generated protein S2 is calculated by querying a pre-trained binding energy model F that predicts a binding energy between two proteins that are input into the pre-trained binding energy model F; and an outputting module 406 that outputs the target protein S2*.

In some embodiments, the apparatus 400 further includes the following: a testing module 407 that performs testing on the target protein S2*; and a designing module 408 for that designs an antibody drug corresponding to an antigen based on the target protein S2*, wherein the object protein S1 includes the antigen.

In some embodiments, the object protein S1 includes an amino acid sequence, and the baseline protein S0 includes one or more amino acid sequences.

In some embodiments, the apparatus 400 further includes a training module 401 that trains the generative model G according to differentiable generative modeling by configuring a loss function of the differentiable generative modeling to include a binding energy between two proteins.

Described embodiments of the subject matter can include one or more features, alone or in combination. For example, in a first embodiment, a computer-implemented method for identifying a target protein corresponding to an object protein includes receiving, by a data processing apparatus, an object protein S1; receiving, by the data processing apparatus, a baseline protein S0; identifying, by the data processing apparatus, a generative model G that receives the baseline protein S0 and outputs a generated protein S2, wherein the generative model G is differentiable with respect to the baseline protein S0; solving, by the data processing apparatus, for a target protein S2*by solving an optimization problem to minimize a binding energy between the object protein S1 and the generated protein S2 output from the generative model G, wherein the generated protein S2 is subject to a constraint set S, and the binding energy between the object protein S1 and the generated protein S2 is calculated by querying a pre-trained binding energy model F that predicts a binding energy between two proteins that are input into the pre-trained binding energy model F; and outputting, by the data processing apparatus, the target protein S2*.

The foregoing and other described embodiments can each, optionally, include one or more of the following features:

A first feature, combinable with any of the following features, specifies that the method further includes: performing testing on the target protein S2*; and designing an antibody drug corresponding to an antigen based on the target protein S2*, wherein the object protein S1 includes the antigen.

A second feature, combinable with any of the following features, specifies that the object protein S1 includes an amino acid sequence, and the baseline protein S0 includes one or more amino acid sequences.

A third feature, combinable with any of the following features, specifies that solving the optimization problem to minimize the binding energy between the object protein S1 and the generated protein S2 output from the generative model G includes solving the optimization problem in an iterative manner to update the baseline protein S0 based on a derivative of the pre-trained binding energy model F with respect to the baseline protein S0.

A fourth feature, combinable with any of the following features, specifies that the method further includes training the generative model G according to differentiable generative modeling by configuring a loss function of the differentiable generative modeling to include a binding energy between two proteins.

A fifth feature, combinable with any of the following features, specifies that training the generative model according to differentiable generative modeling includes: calculating the loss function by querying the pre-trained binding energy model F; and updating parameters of the generative model G based on the loss function.

A sixth feature, combinable with any of the following features, specifies that training the generative model according to differentiable generative modeling includes training the generative model G according to generative adversarial network (GAN) to minimize a binding energy between an input protein and a generated protein output from the generative model G, and the generative model G is a generator of the GAN.

A seventh feature, combinable with any of the following features, specifies that training the generative model according to differentiable generative modeling includes training the generative model G according to a variational autoencoder (VAE) to minimize the loss function that includes the binding energy between an input protein and a generated protein output from the VAE, and the generative model G is the VAE.

An eighth feature, combinable with any of the following features, specifies that training the generative model according to differentiable generative modeling includes training a normalizing flow to minimize the loss function including the binding energy between the two proteins.

In a second embodiment, a system, including: one or more processors; and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon which are executable by the one or more processors to perform the method of any of the first embodiment and its optional combination of the one or more of features described above.

In a third embodiment, an apparatus for identifying a target protein corresponding to an object protein. The apparatus includes one or more modules (for example, the modules as described w.r.t. FIG. 4) for performing the method of any of the first embodiment and its optional combination of the one or more of features described above.

The system, apparatus, module, or unit illustrated in the previous embodiments can be implemented by using a computer chip or an entity, or can be implemented by using a product having a certain function. A typical embodiment device is a computer (and the computer can be a personal computer) , a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, a game console, a tablet computer, a wearable device, or any combination of these devices.

For an embodiment process of functions and roles of each module in the apparatus, references can be made to an embodiment process of corresponding steps in the previous method. Details are omitted here for simplicity.

Because an apparatus embodiment basically corresponds to a method embodiment, for related parts, references can be made to related descriptions in the method embodiment. The previously described apparatus embodiment is merely an example. The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one position, or may be distributed on a number of network modules. Some or all of the modules can be selected based on actual demands to achieve the objectives of the solutions of the specification. A person of ordinary skill in the art can understand and implement the embodiments of the present application without creative efforts.

Referring again to FIG. 4, it can be interpreted as illustrating internal functional modules and a structure of a computing implementation apparatus. The computing implementation apparatus can be an example of a computing system configured to identify a target protein corresponding to an object protein. An execution body in essence can be an electronic device, and the electronic device includes the following: one or more processors; and one or more computer-readable memories configured to store an executable instruction of the one or more processors. In some embodiments, the one or more computer-readable memories are coupled to the one or more processors and have programming instructions stored thereon that are executable by the one or more processors to perform algorithms, methods, functions, processes, flows, and procedures, as described in this specification. This specification also provides one or more non-transitory computer-readable storage media coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with embodiments of the methods provided herein.

This specification further provides a system for implementing the methods provided herein. The system includes one or more processors, and a computer-readable storage medium coupled to the one or more processors having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with embodiments of the methods provided herein.

Embodiments of the subject matter and the actions and operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, e.g., one or more modules of computer program instructions, encoded on a computer program carrier, for execution by, or to control the operation of, data processing apparatus. For example, a computer program carrier can include one or more computer-readable storage media that have instructions encoded or stored thereon. The carrier may be a tangible non-transitory computer-readable medium, such as a magnetic, magneto optical, or optical disk, a solid state drive, a random access memory (RAM) , a read-only memory (ROM) , or other types of media. Alternatively, or in addition, the carrier may be an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be or be part of a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. A computer storage medium is not a propagated signal.

A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, an engine, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, engine, subroutine, or other unit suitable for executing in a computing environment, which environment may include one or more computers interconnected by a data communication network in one or more locations.

A computer program may, but need not, correspond to a file in a file system. A computer program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code.

Processors for execution of a computer program include, by way of example, both general-and special-purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive the instructions of the computer program for execution as well as data from a non-transitory computer-readable medium coupled to the processor.

The term “data processing apparatus” encompasses all kinds of apparatuses, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. Data processing apparatus can include special-purpose logic circuitry, e.g., an FPGA (field programmable gate array) , an ASIC (application specific integrated circuit) , or a GPU (graphics processing unit) . The apparatus can also include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more computers or processors executing one or more computer programs to perform operations by operating on input data and generating output. The processes and logic flows can also be performed by special-purpose logic circuitry, e.g., an FPGA, an ASIC, or a GPU, or by a combination of special-purpose logic circuitry and one or more programmed computers.

Computers suitable for the execution of a computer program can be based on general or special-purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer can include a central processing unit for executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special-purpose logic circuitry.

Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to one or more storage devices. The storage devices can be, for example, magnetic, magneto optical, or optical disks, solid state drives, or any other type of non-transitory, computer-readable media. However, a computer need not have such devices. Thus, a computer may be coupled to one or more storage devices, such as, one or more memories, that are local and/or remote. For example, a computer can include one or more local memories that are integral components of the computer, or the computer can be coupled to one or more remote memories that are in a cloud network. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA) , a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

Components can be “coupled to” each other by being commutatively such as electrically or optically connected to one another, either directly or via one or more intermediate components. Components can also be “coupled to” each other if one of the components is integrated into the other. For example, a storage component that is integrated into a processor (e.g., an L2 cache component) is “coupled to” the processor.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on, or configured to communicate with, a computer having a display device, e.g., an LCD (liquid crystal display) monitor, for displaying information to the user, and an input device by which the user can provide input to the computer, e.g., a keyboard and a pointing device, e.g., a mouse, a trackball, or touchpad. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser, or by interacting with an app running on a user device, e.g., a smartphone or electronic tablet. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.

This specification uses the term “configured to” in connection with systems, apparatus, and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions. For special-purpose logic circuitry to be configured to perform particular operations or actions means that the circuitry has electronic logic that performs the operations or actions.

While this specification contains many specific embodiment details, these should not be construed as limitations on the scope of what is being claimed, which can be computed by the claims themselves, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be realized in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiments can also be realized in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claim may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

A computer-implemented method for identifying a target protein corresponding to an object protein, the method comprising:

receiving, by a data processing apparatus, an object protein S1;

receiving, by the data processing apparatus, a baseline protein S0;

identifying, by the data processing apparatus, a generative model G that receives the baseline protein S0 and outputs a generated protein S2, wherein the generative model G is differentiable with respect to the baseline protein S0;

solving, by the data processing apparatus, for a target protein S2*by solving an optimization problem to minimize a binding energy between the object protein S1 and the generated protein S2 output from the generative model G, wherein the generated protein S2 is subject to a constraint set S, and the binding energy between the object protein S1 and the generated protein S2 is calculated by querying a pre-trained binding energy model F that predicts a binding energy between two proteins that are input into the pre-trained binding energy model F; and

outputting, by the data processing apparatus, the target protein S2*.
The method of claim 1, further comprising:

performing testing on the target protein S2*; and

designing an antibody drug corresponding to an antigen based on the target protein S2*, wherein the object protein S1 comprises the antigen.
The method of any preceding claim, wherein the object protein S1 comprises an amino acid sequence, and the baseline protein S0 comprises one or more amino acid sequences.
The method of any preceding claim, wherein solving the optimization problem to minimize the binding energy between the object protein S1 and the generated protein S2 output from the generative model G comprises solving the optimization problem in an iterative manner to update the baseline protein S0 based on a derivative of the pre-trained binding energy model F with respect to the baseline protein S0.
The method of any preceding claim, further comprising:

training the generative model G according to differentiable generative modeling by configuring a loss function of the differentiable generative modeling to comprise a binding energy between two proteins.
The method of claim 5, wherein training the generative model according to differentiable generative modeling comprises:

calculating the loss function by querying the pre-trained binding energy model F;and

updating parameters of the generative model G based on the loss function.
The method of claim 5, wherein training the generative model according to differentiable generative modeling comprises training the generative model G according to generative adversarial network (GAN) to minimize a binding energy between an input protein and a generated protein output from the generative model G, and the generative model G is a generator of the GAN.
The method of claim 5, wherein training the generative model according to differentiable generative modeling comprises training the generative model G according to a variational autoencoder (VAE) to minimize the loss function that comprises the binding energy between an input protein and a generated protein output from the VAE, and the generative model G is the VAE.
The method of claim 5, wherein training the generative model according to differentiable generative modeling comprises training a normalizing flow to minimize the loss function comprising the binding energy between the two proteins.
A system for performing a software-implemented application for identifying a target protein corresponding to an object protein, the system comprising:

one or more processors; and

one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform the method of any of claims 1 to 9.
An apparatus for identifying a target protein corresponding to an object protein, the apparatus comprising multiple modules for performing the method of any one of claims 1 to 9.